Without this patch, applications with two different stack
regions (eg: native stack vs JIT stack) get truncated
callchains even when RBP chaining is present. GDB shows proper
stack traces and the frame pointer chaining is intact.
This patch disables the (fp < RSP) check, hoping that other checks
in the code save the day for us. In our limited testing, this
didn't seem to break anything.
In the long term, we could potentially have userspace advise
the kernel on the range of valid stack addresses, so we don't
spend a lot of time unwinding from bogus addresses.
Signed-off-by: Arun Sharma <asharma@fb.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334961696-19580-2-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Afaict there's no need to (incompletely) iterate the
MEM_UOPS_RETIRED.* umask state.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Implement rudimentary IVB perf support. The SDM states its identical
to SNB with exception of the exact event tables, but a quick look
suggests they're similar enough.
Also mark SNB-EP as broken for now.
Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there's finally a chip with working PEBS (IvyBridge), we can
enable the hardware and implement cycles:p for SNB/IVB.
Cc: Stephane Eranian <eranian@google.com>
Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Zheng Yan reported that event group validation can wreck event state
when Intel extra_reg allocation changes event state.
Validation shouldn't change any persistent state. Cloning events in
validate_{event,group}() isn't really pretty either, so add a few
special cases to avoid modifying the event state.
The code is restructured to minimize the special case impact.
Reported-by: Zheng Yan <zheng.z.yan@linux.intel.com>
Acked-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338903031.28282.175.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 316ad24830 ("sched/x86: Rewrite set_cpu_sibling_map()")
broke the booted_cores accounting.
The problem is that the booted_cores accounting needs all the
sibling links set up. So restore the second loop and add a comment as
to why its needed.
On qemu booted with -smp sockets=1,cores=2,threads=2;
Before:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 1
cpu cores : 4
cpu cores : 3
With the patch:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 2
cpu cores : 2
cpu cores : 2
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120531073738.GH7511@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In current Linux, percpu variable `vector_irq' is not cleared on
offlined cpus while disabling devices' irqs. If the cpu that has
the disabled irqs in vector_irq is hotplugged,
__setup_vector_irq() hits invalid irq vector and may crash.
This bug can be reproduced as following;
# echo 0 > /sys/devices/system/cpu/cpu7/online
# modprobe -r some_driver_using_interrupts # vector_irq@cpu7 uncleared
# echo 1 > /sys/devices/system/cpu/cpu7/online # kernel may crash
This patch fixes this bug by clearing vector_irq in
__clear_irq_vector() even if the cpu is offlined.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: ltc-kernel@ml.yrl.intra.hitachi.co.jp
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/4FC340BE.7080101@hitachi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When rebooting our 24 CPU Westmere servers with 3.4-rc6, we
always see this warning msg:
Restarting system.
machine restart
------------[ cut here ]------------
WARNING: at arch/x86/kernel/smp.c:125
native_smp_send_reschedule+0x74/0xa7() Hardware name: X8DTN
Modules linked in: igb [last unloaded: scsi_wait_scan]
Pid: 1, comm: systemd-shutdow Not tainted 3.4.0-rc6+ #22
Call Trace:
<IRQ> [<ffffffff8102a41f>] warn_slowpath_common+0x7e/0x96
[<ffffffff8102a44c>] warn_slowpath_null+0x15/0x17
[<ffffffff81018cf7>] native_smp_send_reschedule+0x74/0xa7
[<ffffffff810561c1>] trigger_load_balance+0x279/0x2a6
[<ffffffff81050112>] scheduler_tick+0xe0/0xe9
[<ffffffff81036768>] update_process_times+0x60/0x70
[<ffffffff81062f2f>] tick_sched_timer+0x68/0x92
[<ffffffff81046e33>] __run_hrtimer+0xb3/0x13c
[<ffffffff81062ec7>] ? tick_nohz_handler+0xd0/0xd0
[<ffffffff810474f2>] hrtimer_interrupt+0xdb/0x198
[<ffffffff81019a35>] smp_apic_timer_interrupt+0x81/0x94
[<ffffffff81655187>] apic_timer_interrupt+0x67/0x70
<EOI> [<ffffffff8101a3c4>] ? default_send_IPI_mask_allbutself_phys+0xb4/0xc4
[<ffffffff8101c680>] physflat_send_IPI_allbutself+0x12/0x14
[<ffffffff81018db4>] native_nmi_stop_other_cpus+0x8a/0xd6
[<ffffffff810188ba>] native_machine_shutdown+0x50/0x67
[<ffffffff81018926>] machine_shutdown+0xa/0xc
[<ffffffff8101897e>] native_machine_restart+0x20/0x32
[<ffffffff810189b0>] machine_restart+0xa/0xc
[<ffffffff8103b196>] kernel_restart+0x47/0x4c
[<ffffffff8103b2e6>] sys_reboot+0x13e/0x17c
[<ffffffff8164e436>] ? _raw_spin_unlock_bh+0x10/0x12
[<ffffffff810fcac9>] ? bdi_queue_work+0xcf/0xd8
[<ffffffff810fe82f>] ? __bdi_start_writeback+0xae/0xb7
[<ffffffff810e0d64>] ? iterate_supers+0xa3/0xb7
[<ffffffff816547a2>] system_call_fastpath+0x16/0x1b
---[ end trace 320af5cb1cb60c5b ]---
The root cause seems to be the
default_send_IPI_mask_allbutself_phys() takes quite some time (I
measured it could be several ms) to complete sending NMIs to all
the other 23 CPUs, and for HZ=250/1000 system, the time is long
enough for a timer interrupt to happen, which will in turn
trigger to kick load balance to a stopped CPU and cause this
warning in native_smp_send_reschedule().
So disabling the local irq before stop_other_cpu() can fix this
problem (tested 25 times reboot ok), and it is fine as there
should be nobody caring the timer interrupt in such reboot
stage.
The latest 3.4 kernel slightly changes this behavior by sending
REBOOT_VECTOR first and only send NMI_VECTOR if the REBOOT_VCTOR
fails, and this patch is still needed to prevent the problem.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120530231541.4c13433a@feng-i7
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 82f7af09 ("x86/mce: Cleanup timer mess) dropped the
initialization of the per cpu timer interval. Duh :(
Restore the previous behaviour.
Reported-by: Chen Gong <gong.chen@linux.intel.com>
Cc: bp@amd64.org
Cc: tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the HW implements round-robin interrupt delivery, this
enables multiple cpu's (which are part of the user specified
interrupt smp_affinity mask and belong to the same x2apic
cluster) to service the interrupt.
Also if the platform supports Power Aware Interrupt Routing,
then this enables the interrupt to be routed to an idle cpu or a
busy cpu depending on the perf/power bias tunable.
We are now grouping all the cpu's in a cluster to one vector
domain. So that will limit the total number of interrupt sources
handled by Linux. Previously we support "cpu-count *
available-vectors-per-cpu" interrupt sources but this will now
reduce to "cpu-count/16 * available-vectors-per-cpu".
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: gorcunov@openvz.org
Cc: agordeev@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337644682-19854-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Until now, irq_cfg domain is mostly static. Either all CPU's
(used by flat mode) or one CPU (first CPU in the irq afffinity
mask) to which irq is being migrated (this is used by the rest
of apic modes).
Upcoming x2apic cluster mode optimization patch allows the irq
to be sent to any CPU in the x2apic cluster (if supported by the
HW). So irq_cfg domain changes on the fly (depending on which
CPU in the x2apic cluster is online).
Instead of checking for any intersection between the new irq
affinity mask and the current irq_cfg domain, check if the new
irq affinity mask is a subset of the current irq_cfg domain.
Otherwise proceed with updating the irq_cfg domain aswell as
assigning vector's on all the CPUs specified in the new mask.
This also cleans up a workaround in updating irq_cfg domain for
legacy irq's that are handled by the IO-APIC.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: gorcunov@openvz.org
Cc: agordeev@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337644682-19854-1-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use a more current logging style:
- Bare printks should have a KERN_<LEVEL> for consistency's sake
- Add pr_fmt where appropriate
- Neaten some macro definitions
- Convert some Ok output to OK
- Use "%s: ", __func__ in pr_fmt for summit
- Convert some printks to pr_<level>
Message output is not identical in all cases.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: levinsasha928@gmail.com
Link: http://lkml.kernel.org/r/1337655007.24226.10.camel@joe2Laptop
[ merged two similar patches, tidied up the changelog ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some subarchitectures (such as vSMP) need to slightly adjust the
underlying APIC structure. Add an APIC post-initialization callback
to 'struct x86_platform_ops' for this purpose and use it for
adjusting the APIC structure on vSMP systems.
Signed-off-by: Ido Yariv <ido@wizery.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Link: http://lkml.kernel.org/r/1338675095-27260-1-git-send-email-ido@wizery.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit 82f7af09 (x86/mce: Cleanup timer mess), Thomas just forgot
the "/ 2" there while cleaning up.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Remove NULL assignment of dattr_cur
sched: Remove the last NULL entry from sched_feat_names
sched: Make sched_feat_names const
sched/rt: Fix SCHED_RR across cgroups
sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
sched: Make sure to not re-read variables after validation
sched: Fix SD_OVERLAP
sched: Don't try allocating memory from offline nodes
sched/nohz: Fix rq->cpu_load calculations some more
sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
ipi_call_lock/unlock() lock resp. unlock call_function.lock. This lock
protects only the call_function data structure itself, but it's
completely unrelated to cpu_online_mask. The mask to which the IPIs
are sent is calculated before call_function.lock is taken in
smp_call_function_many(), so the locking around set_cpu_online() is
pointless and can be removed.
[ tglx: Massaged changelog ]
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: ralf@linux-mips.org
Cc: sshtylyov@mvista.com
Cc: david.daney@cavium.com
Cc: nikunj@linux.vnet.ibm.com
Cc: paulmck@linux.vnet.ibm.com
Cc: axboe@kernel.dk
Cc: peterz@infradead.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1338275765-3217-7-git-send-email-yong.zhang0@gmail.com
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dell Precision M6600 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
https://bugzilla.kernel.org/show_bug.cgi?id=42749
cc: x86@kernel.org
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Pull straggler x86 fixes from Peter Anvin:
"Three groups of patches:
- EFI boot stub documentation and the ability to print error messages;
- Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
should never have been ported, and the port is broken and
potentially dangerous.)
- ftrace stack corruption fixes. I'm not super-happy about the
technical implementation, but it is probably the least invasive in
the short term. In the future I would like a single method for
nesting the debug stack, however."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
x86, efi: Add EFI boot stub documentation
x86, efi; Add EFI boot stub console support
x86, efi: Only close open files in error path
ftrace/x86: Do not change stacks in DEBUG when calling lockdep
x86: Allow nesting of the debug stack IDT setting
x86: Reset the debug_stack update counter
ftrace: Use breakpoint method to update ftrace caller
ftrace: Synchronize variable setting with breakpoints
When I added x32 ptrace to 3.4 kernel, I also include PTRACE_ARCH_PRCTL
support for x32 GDB For ARCH_GET_FS/GS, it takes a pointer to int64. But
at user level, ARCH_GET_FS/GS takes a pointer to int32. So I have to add
x32 ptrace to glibc to handle it with a temporary int64 passed to kernel and
copy it back to GDB as int32. Roland suggested that PTRACE_ARCH_PRCTL
is obsolete and x32 GDB should use fs_base and gs_base fields of
user_regs_struct instead.
Accordingly, remove PTRACE_ARCH_PRCTL completely from the x32 code to
avoid possible memory overrun when pointer to int32 is passed to
kernel.
Link: http://lkml.kernel.org/r/CAMe9rOpDzHfS7NH7m1vmD9QRw8SSj4Sc%2BaNOgcWm_WJME2eRsQ@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org> v3.4
If we end up calling do_notify_resume() with !user_mode(refs), it
does nothing (do_signal() explicitly bails out and we can't get there
with TIF_NOTIFY_RESUME in such situations). Then we jump to
resume_userspace_sig, which rechecks the same thing and bails out
to resume_kernel, thus breaking the loop.
It's easier and cheaper to check *before* calling do_notify_resume()
and bail out to resume_kernel immediately. And kill the check in
do_signal()...
Note that on amd64 we can't get there with !user_mode() at all - asm
glue takes care of that.
Acked-and-reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Does block_sigmask() + tracehook_signal_handler(); called when
sigframe has been successfully built. All architectures converted
to it; block_sigmask() itself is gone now (merged into this one).
I'm still not too happy with the signature, but that's a separate
story (IMO we need a structure that would contain signal number +
siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
signal_delivered(), handle_signal() and probably setup...frame() -
take one).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(),
added set_current_blocked() that will exclude unblockable signals, switched
open-coded instances to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
replace boilerplate "should we use ->saved_sigmask or ->blocked?"
with calls of obvious inlined helper...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
first fruits of ..._restore_sigmask() helpers: now we can take
boilerplate "signal didn't have a handler, clear RESTORE_SIGMASK
and restore the blocked mask from ->saved_mask" into a common
helper. Open-coded instances switched...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF
will call into the lockdep code. The lockdep code can call lots of
functions that may be traced by ftrace. When ftrace is updating its
code and hits a breakpoint, the breakpoint handler will call into
lockdep. If lockdep happens to call a function that also has a breakpoint
attached, it will jump back into the breakpoint handler resetting
the stack to the debug stack and corrupt the contents currently on
that stack.
The 'do_sym' call that calls do_int3() is protected by modifying the
IST table to point to a different location if another breakpoint is
hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if
a breakpoint is hit from those, the stack will get corrupted, and
the kernel will crash:
[ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
[ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
[ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
[ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
[ 1013.310600] CPU 2
[ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1013.401848]
[ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
[ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
[ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
[ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
[ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
[ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
[ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
[ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
[ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
[ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
[ 1013.717028] Stack:
[ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
[ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
[ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
[ 1013.790021] Call Trace:
[ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
[ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff
[ 1013.884466] RSP <ffff88014780f408>
[ 1013.901507] CR2: 0000000000000002
The solution was to reuse the NMI functions that change the IDT table to make the debug
stack keep its current stack (in kernel mode) when hitting a breakpoint:
call debug_stack_set_zero
TRACE_IRQS_ON
call debug_stack_reset
If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack
and not crash the box.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the NMI handler runs, it checks if it preempted a debug handler
and if that handler is using the debug stack. If it is, it changes the
IDT table not to update the stack, otherwise it will reset the debug
stack and corrupt the debug handler it preempted.
Now that ftrace uses breakpoints to change functions from nops to
callers, many more places may hit a breakpoint. Unfortunately this
includes some of the calls that lockdep performs. Which causes issues
with the debug stack. It too needs to change the debug stack before
tracing (if called from the debug handler).
Allow the debug_stack_set_zero() and debug_stack_reset() to be nested
so that the debug handlers can take advantage of them too.
[ Used this_cpu_*() over __get_cpu_var() as suggested by H. Peter Anvin ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an NMI goes off and it sees that it preempted the debug stack,
to keep the debug stack safe, it changes the IDT to point to one that
does not modify the stack on breakpoint (to allow breakpoints in NMIs).
But the variable that gets set to know to undo it on exit never gets
cleared on exit. Thus every NMI will reset it on exit the first time
it is done even if it does not need to be reset.
[ Added H. Peter Anvin's suggestion to use this_cpu_read/write ]
Cc: <stable@vger.kernel.org> # v3.3
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On boot up and module load, it is fine to modify the code directly,
without the use of breakpoints. This is because boot up modification
is done before SMP is initialized, thus the modification is serial,
and module load is done before the module executes.
But after that we must use a SMP safe method to modify running code.
Otherwise, if we are running the function tracer and update its
function (by starting off the stack tracer, or perf tracing)
the change of the function called by the ftrace trampoline is done
directly. If this is being executed on another CPU, that CPU may
take a GPF and crash the kernel.
The breakpoint method is used to change the nops at all the functions, but
the change of the ftrace callback handler itself was still using a
direct modification. If tracing was enabled and the function callback
was changed then another CPU could fault if it was currently calling
the original callback. This modification must use the breakpoint method
too.
Note, the direct method is still used for boot up and module load.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function tracer starts modifying the code via breakpoints
it sets a variable (modifying_ftrace_code) to inform the breakpoint
handler to call the ftrace int3 code.
But there's no synchronization between setting this code and the
handler, thus it is possible for the handler to be called on another
CPU before it sees the variable. This will cause a kernel crash as
the int3 handler will not know what to do with it.
I originally added smp_mb()'s to force the visibility of the variable
but H. Peter Anvin suggested that I just make it atomic.
[ Added comments as suggested by Peter Zijlstra ]
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull second pile of signal handling patches from Al Viro:
"This one is just task_work_add() series + remaining prereqs for it.
There probably will be another pull request from that tree this
cycle - at least for helpers, to get them out of the way for per-arch
fixes remaining in the tree."
Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
keys: kill task_struct->replacement_session_keyring
keys: kill the dummy key_replace_session_keyring()
keys: change keyctl_session_to_parent() to use task_work_add()
genirq: reimplement exit_irq_thread() hook via task_work_add()
task_work_add: generic process-context callbacks
avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
parisc: need to check NOTIFY_RESUME when exiting from syscall
move key_repace_session_keyring() into tracehook_notify_resume()
TIF_NOTIFY_RESUME is defined on all targets now
Use unsigned long for dealing with jiffies not int. Rename the
callback to something sensible. Use __this_cpu_read/write for
accessing per cpu data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When boot on sun G5+ with 4T mem, see an overflow in mtrr cleanup as below.
*BAD*gran_size: 2G chunk_size: 2G num_reg: 10 lose cover RAM:
-18014398505283592M
This is because 1<<31 sign extended. Use an unsigned long constant to
fix it. Useful for mem larger than or equal to 4T.
-v2: Use 64bit constant instead of explicit type conversion as suggested
by Yinghai. Description updated too.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: http://lkml.kernel.org/r/4FC5A77F.6060505@oracle.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Commit commit 8e7fbcbc2 ("sched: Remove stale power aware scheduling
remnants and dysfunctional knobs") made a boo-boo with removing the
power aware scheduling muck from the x86 topology bits.
We should unconditionally use the llc_shared mask for multi-core.
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/n/tip-lsksc2kfyeveb13avh327p0d@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 trampoline rework from H. Peter Anvin:
"This code reworks all the "trampoline"/"realmode" code (various bits
that need to live in the first megabyte of memory, most but not all of
which runs in real mode at some point) in the kernel into a single
object. The main reason for doing this is that it eliminates the last
place in the kernel where we needed pages to be mapped RWX. This code
separates all that code into proper R/RW/RX pages."
Fix up conflicts in arch/x86/kernel/Makefile (mca removed next to reboot
code), and arch/x86/kernel/reboot.c (reboot code moved around in one
branch, modified in this one), and arch/x86/tools/relocs.c (mostly same
code came in earlier due to working around the ld bugs just before the
3.4 release).
Also remove stale x86-relocs entry from scripts/.gitignore as per Peter
Anvin.
* commit '61f5446169046c217a5479517edac3a890c3bee7': (36 commits)
x86, realmode: Move end signature into header.S
x86, relocs: When printing an error, say relative or absolute
x86, relocs: More relocations which may end up as absolute
x86, relocs: Workaround for binutils 2.22.52.0.1 section bug
xen-acpi-processor: Add missing #include <xen/xen.h>
acpi, bgrd: Add missing <linux/io.h> to drivers/acpi/bgrt.c
x86, realmode: Change EFER to a single u64 field
x86, realmode: Move kernel/realmode.c to realmode/init.c
x86, realmode: Move not-common bits out of trampoline_common.S
x86, realmode: Mask out EFER.LMA when saving trampoline EFER
x86, realmode: Fix no cache bits test in reboot_32.S
x86, realmode: Make sure all generated files are listed in targets
x86, realmode: build fix: remove duplicate build
x86, realmode: read cr4 and EFER from kernel for 64-bit trampoline
x86, realmode: fixes compilation issue in tboot.c
x86, realmode: move relocs from scripts/ to arch/x86/tools
x86, realmode: header for trampoline code
x86, realmode: flattened rm hierachy
x86, realmode: don't copy real_mode_header
x86, realmode: fix 64-bit wakeup sequence
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPv7OiAAoJEKurIx+X31iBQxEP/RV8YO4nrozhHY597qabzfJc
4YoCOma+wUyhXPDmZI80XvrIlcCq7TEJL1HTaAyA5rnyYvRHpM+uXDCRmbJDI4e0
gA42/Y8+lbmR6BLY8sdptCXIWxw/d8wYEKK2BgNsPhkJxODGW/gVAws93erist/v
yepq+GwI0QGAeRlO6AYgE7NwQmHXK5AdfH3phHUTVABVIUGH5+Zp6471FTB+hYy5
aNvUnL0hw8vxrbpDL/Le359etPqC6wsELCIQ9wVtWCD0/UJM6Yd3j0+CKQ7q/KHU
7zMcP+OCGTJ3koMhEbFIOnxuswWDGq5y/qIzSXMEEemGqgxqFUvX3wUeZ3HFAFNx
nJ7ZaA813t7Bud4G4WwESxMGQpxI7FTvxnF1ow3IlRsMtV4ffvAS9xvWi0GQJrVY
xixK7G87PGAm6fP9Zbb/lQlRO8gD498j4rfI9MOsUuY9QgFNcH2eg6c4O0iHDpos
WkSgUaM49Q610JslrxsXp+BZZLBF/wbcjcFiQGFAWOIKTKgRQ99+dXAQY7fw9CIf
/wNl9MkbOvJdPL9OfLTmAYAMXyaXbOX8qcvItwqcBsUT0AV863NdIXtS4BXBOrMs
5u16CDX1ieFAlA2dzhynvE0Zd1Ws6wfe5W/BgtQ+H175uHFr8pHAxsBTX8GSNrXG
/bSWWrR3CIBRHoWCJMmH
=kG4e
-----END PGP SIGNATURE-----
Merge tag 'x86-mce-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull x86/mce merge window patches from Tony Luck:
"Including two that make error_context() checks less sucky"
* tag 'x86-mce-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Add instruction recovery signatures to mce-severity table
x86/mce: Fix check for processor context when machine check was taken.
MCE: Fix vm86 handling for 32bit mce handler
x86/mce Add validation check before GHES error is recorded
x86/mce: Avoid reading every machine check bank register twice.
Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
"These patches contain two major updates for DMA mapping subsystem
(mainly for ARM architecture). First one is Contiguous Memory
Allocator (CMA) which makes it possible for device drivers to allocate
big contiguous chunks of memory after the system has booted.
The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse the memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework migrates system pages to create space for the required big
chunk of physically contiguous memory.
For more information one can refer to nice LWN articles:
- 'A reworked contiguous memory allocator':
http://lwn.net/Articles/447405/
- 'CMA and ARM':
http://lwn.net/Articles/450286/
- 'A deep dive into CMA':
http://lwn.net/Articles/486301/
- and the following thread with the patches and links to all previous
versions:
https://lkml.org/lkml/2012/4/3/204
The main client for this new framework is ARM DMA-mapping subsystem.
The second part provides a complete redesign in ARM DMA-mapping
subsystem. The core implementation has been changed to use common
struct dma_map_ops based infrastructure with the recent updates for
new dma attributes merged in v3.4-rc2. This allows to use more than
one implementation of dma-mapping calls and change/select them on the
struct device basis. The first client of this new infractructure is
dmabounce implementation which has been completely cut out of the
core, common code.
The last patch of this redesign update introduces a new, experimental
implementation of dma-mapping calls on top of generic IOMMU framework.
This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
calls if one provides required IOMMU hardware.
For more information please refer to the following thread:
http://www.spinics.net/lists/arm-kernel/msg175729.html
The last patch merges changes from both updates and provides a
resolution for the conflicts which cannot be avoided when patches have
been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."
Acked by Andrew Morton <akpm@linux-foundation.org>:
"Yup, this one please. It's had much work, plenty of review and I
think even Russell is happy with it."
* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
ARM: dma-mapping: use PMD size for section unmap
cma: fix migration mode
ARM: integrate CMA with DMA-mapping subsystem
X86: integrate CMA with DMA-mapping subsystem
drivers: add Contiguous Memory Allocator
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
mm: extract reclaim code from __alloc_pages_direct_reclaim()
mm: Serialize access to min_free_kbytes
mm: page_isolation: MIGRATE_CMA isolation functions added
mm: mmzone: MIGRATE_CMA migration type added
mm: page_alloc: change fallbacks array handling
mm: page_alloc: introduce alloc_contig_range()
mm: compaction: export some of the functions
mm: compaction: introduce isolate_freepages_range()
mm: compaction: introduce map_pages()
mm: compaction: introduce isolate_migratepages_range()
mm: page_alloc: remove trailing whitespace
ARM: dma-mapping: add support for IOMMU mapper
ARM: dma-mapping: use alloc, mmap, free from dma_ops
ARM: dma-mapping: remove redundant code and do the cleanup
...
Conflicts:
arch/x86/include/asm/dma-mapping.h
Pull KVM changes from Avi Kivity:
"Changes include additional instruction emulation, page-crossing MMIO,
faster dirty logging, preventing the watchdog from killing a stopped
guest, module autoload, a new MSI ABI, and some minor optimizations
and fixes. Outside x86 we have a small s390 and a very large ppc
update.
Regarding the new (for kvm) rebaseless workflow, some of the patches
that were merged before we switch trees had to be rebased, while
others are true pulls. In either case the signoffs should be correct
now."
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.
I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
KVM: s390: onereg for timer related registers
KVM: s390: epoch difference and TOD programmable field
KVM: s390: KVM_GET/SET_ONEREG for s390
KVM: s390: add capability indicating COW support
KVM: Fix mmu_reload() clash with nested vmx event injection
KVM: MMU: Don't use RCU for lockless shadow walking
KVM: VMX: Optimize %ds, %es reload
KVM: VMX: Fix %ds/%es clobber
KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
KVM: VMX: unlike vmcs on fail path
KVM: PPC: Emulator: clean up SPR reads and writes
KVM: PPC: Emulator: clean up instruction parsing
kvm/powerpc: Add new ioctl to retreive server MMU infos
kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
KVM: PPC: Book3S: Enable IRQs during exit handling
KVM: PPC: Fix PR KVM on POWER7 bare metal
KVM: PPC: Fix stbux emulation
KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
...
The interrupt chip irq_set_affinity() functions copy the affinity mask
to irq_data->affinity but return 0, i.e. IRQ_SET_MASK_OK.
IRQ_SET_MASK_OK causes the core code to do another redundant copy.
Return IRQ_SET_MASK_OK_NOCOPY to avoid this.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Keping Chen <chenkeping@huawei.com>
Link: http://lkml.kernel.org/r/1333120296-13563-4-git-send-email-jiang.liu@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This reverts commit cf8ff6b6ab.
Just found this commit is a function duplicatation of commit 6b617e22
"x86/platform: Add a wallclock_init func to x86_init.timers ops".
Let's revert it and sorry for the noise.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer updates from Thomas Gleixner.
Various trivial conflict fixups in arch Kconfig due to addition of
unrelated entries nearby. And one slightly more subtle one for sparc32
(new user of GENERIC_CLOCKEVENTS), fixed up as per Thomas.
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
timekeeping: Fix a few minor newline issues.
time: remove obsolete declaration
ntp: Fix a stale comment and a few stray newlines.
ntp: Correct TAI offset during leap second
timers: Fixup the Kconfig consolidation fallout
x86: Use generic time config
unicore32: Use generic time config
um: Use generic time config
tile: Use generic time config
sparc: Use: generic time config
sh: Use generic time config
score: Use generic time config
s390: Use generic time config
openrisc: Use generic time config
powerpc: Use generic time config
mn10300: Use generic time config
mips: Use generic time config
microblaze: Use generic time config
m68k: Use generic time config
m32r: Use generic time config
...
Pull user-space probe instrumentation from Ingo Molnar:
"The uprobes code originates from SystemTap and has been used for years
in Fedora and RHEL kernels. This version is much rewritten, reviews
from PeterZ, Oleg and myself shaped the end result.
This tree includes uprobes support in 'perf probe' - but SystemTap
(and other tools) can take advantage of user probe points as well.
Sample usage of uprobes via perf, for example to profile malloc()
calls without modifying user-space binaries.
First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.
If you don't know which function you want to probe you can pick one
from 'perf top' or can get a list all functions that can be probed
within libc (binaries can be specified as well):
$ perf probe -F -x /lib/libc.so.6
To probe libc's malloc():
$ perf probe -x /lib64/libc.so.6 malloc
Added new event:
probe_libc:malloc (on 0x7eac0)
You can now use it in all perf tools, such as:
perf record -e probe_libc:malloc -aR sleep 1
Make use of it to create a call graph (as the flat profile is going to
look very boring):
$ perf record -e probe_libc:malloc -gR make
[ perf record: Woken up 173 times to write data ]
[ perf record: Captured and wrote 44.190 MB perf.data (~1930712
$ perf report | less
32.03% git libc-2.15.so [.] malloc
|
--- malloc
29.49% cc1 libc-2.15.so [.] malloc
|
--- malloc
|
|--0.95%-- 0x208eb1000000000
|
|--0.63%-- htab_traverse_noresize
11.04% as libc-2.15.so [.] malloc
|
--- malloc
|
7.15% ld libc-2.15.so [.] malloc
|
--- malloc
|
5.07% sh libc-2.15.so [.] malloc
|
--- malloc
|
4.99% python-config libc-2.15.so [.] malloc
|
--- malloc
|
4.54% make libc-2.15.so [.] malloc
|
--- malloc
|
|--7.34%-- glob
| |
| |--93.18%-- 0x41588f
| |
| --6.82%-- glob
| 0x41588f
...
Or:
$ perf report -g flat | less
# Overhead Command Shared Object Symbol
# ........ ............. ............. ..........
#
32.03% git libc-2.15.so [.] malloc
27.19%
malloc
29.49% cc1 libc-2.15.so [.] malloc
24.77%
malloc
11.04% as libc-2.15.so [.] malloc
11.02%
malloc
7.15% ld libc-2.15.so [.] malloc
6.57%
malloc
...
The core uprobes design is fairly straightforward: uprobes probe
points register themselves at (inode:offset) addresses of
libraries/binaries, after which all existing (or new) vmas that map
that address will have a software breakpoint injected at that address.
vmas are COW-ed to preserve original content. The probe points are
kept in an rbtree.
If user-space executes the probed inode:offset instruction address
then an event is generated which can be recovered from the regular
perf event channels and mmap-ed ring-buffer.
Multiple probes at the same address are supported, they create a
dynamic callback list of event consumers.
The basic model is further complicated by the XOL speedup: the
original instruction that is probed is copied (in an architecture
specific fashion) and executed out of line when the probe triggers.
The XOL area is a single vma per process, with a fixed number of
entries (which limits probe execution parallelism).
The API: uprobes are installed/removed via
/sys/kernel/debug/tracing/uprobe_events, the API is integrated to
align with the kprobes interface as much as possible, but is separate
to it.
Injecting a probe point is privileged operation, which can be relaxed
by setting perf_paranoid to -1.
You can use multiple probes as well and mix them with kprobes and
regular PMU events or tracepoints, when instrumenting a task."
Fix up trivial conflicts in mm/memory.c due to previous cleanup of
unmap_single_vma().
* 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf probe: Detect probe target when m/x options are absent
perf probe: Provide perf interface for uprobes
tracing: Fix kconfig warning due to a typo
tracing: Provide trace events interface for uprobes
tracing: Extract out common code for kprobes/uprobes trace events
tracing: Modify is_delete, is_return from int to bool
uprobes/core: Decrement uprobe count before the pages are unmapped
uprobes/core: Make background page replacement logic account for rss_stat counters
uprobes/core: Optimize probe hits with the help of a counter
uprobes/core: Allocate XOL slots for uprobes use
uprobes/core: Handle breakpoint and singlestep exceptions
uprobes/core: Rename bkpt to swbp
uprobes/core: Make order of function parameters consistent across functions
uprobes/core: Make macro names consistent
uprobes: Update copyright notices
uprobes/core: Move insn to arch specific structure
uprobes/core: Remove uprobe_opcode_sz
uprobes/core: Make instruction tables volatile
uprobes: Move to kernel/events/
uprobes/core: Clean up, refactor and improve the code
...
Pull first series of signal handling cleanups from Al Viro:
"This is just the first part of the queue (about a half of it);
assorted fixes all over the place in signal handling.
This one ends with all sigsuspend() implementations switched to
generic one (->saved_sigmask-based).
With this, a bunch of assorted old buglets are fixed and most of the
missing bits of NOTIFY_RESUME hookup are in place. Two more fixes sit
in arm and um trees respectively, and there's a couple of broken ones
that need obvious fixes - parisc and avr32 check TIF_NOTIFY_RESUME
only on one of two codepaths; fixes for that will happen in the next
series"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (55 commits)
unicore32: if there's no handler we need to restore sigmask, syscall or no syscall
xtensa: add handling of TIF_NOTIFY_RESUME
microblaze: drop 'oldset' argument of do_notify_resume()
microblaze: handle TIF_NOTIFY_RESUME
score: add handling of NOTIFY_RESUME to do_notify_resume()
m68k: add TIF_NOTIFY_RESUME and handle it.
sparc: kill ancient comment in sparc_sigaction()
h8300: missing checks of __get_user()/__put_user() return values
frv: missing checks of __get_user()/__put_user() return values
cris: missing checks of __get_user()/__put_user() return values
powerpc: missing checks of __get_user()/__put_user() return values
sh: missing checks of __get_user()/__put_user() return values
sparc: missing checks of __get_user()/__put_user() return values
avr32: struct old_sigaction is never used
m32r: struct old_sigaction is never used
xtensa: xtensa_sigaction doesn't exist
alpha: tidy signal delivery up
score: don't open-code force_sigsegv()
cris: don't open-code force_sigsegv()
blackfin: don't open-code force_sigsegv()
...
Pull the MCA deletion branch from Paul Gortmaker:
"It was good that we could support MCA machines back in the day, but
realistically, nobody is using them anymore. They were mostly limited
to 386-sx 16MHz CPU and some 486 class machines and never more than
64MB of RAM. Even the enthusiast hobbyist community seems to have
dried up close to ten years ago, based on what you can find searching
various websites dedicated to the relatively short lived hardware.
So lets remove the support relating to CONFIG_MCA. There is no point
carrying this forward, wasting cycles doing routine maintenance on it;
wasting allyesconfig build time on validating it, wasting I/O on git
grep'ping over it, and so on."
Let's see if anybody screams. It generally has compiled, and James
Bottomley pointed out that there was a MCA extension from NCR that
allowed for up to 4GB of memory and PPro-class machines. So in *theory*
there may be users out there.
But even James (technically listed as a maintainer) doesn't actually
have a system, and while Alan Cox claims to have a machine in his cellar
that he offered to anybody who wants to take it off his hands, he didn't
argue for keeping MCA support either.
So we could bring it back. But somebody had better speak up and talk
about how they have actually been using said MCA hardware with modern
kernels for us to do that. And David already took the patch to delete
all the networking driver code (commit a5e371f61a: "drivers/net:
delete all code/drivers depending on CONFIG_MCA").
* 'delete-mca' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
MCA: delete all remaining traces of microchannel bus support.
scsi: delete the MCA specific drivers and driver code
serial: delete the MCA specific 8250 support.
arm: remove ability to select CONFIG_MCA
Instruction recovery cases are very similar to the data recovery one
we already have. Just trade out for a new MCACOD value.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Linus pointed out that there was no value is checking whether m->ip
was zero - because zero is a legimate value. If we have a reliable
(or faked in the VM86 case) "m->cs" we can use it to tell whether we
were in user mode or kernelwhen the machine check hit.
Reported-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When running on 32bit the mce handler could misinterpret
vm86 mode as ring 0. This can affect whether it does recovery
or not; it was possible to panic when recovery was actually
possible.
Fix this by always forcing vm86 to look like ring 3.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull perf fixes from Ingo Molnar:
- Leftover AMD PMU driver fix fix from the end of the v3.4
stabilization cycle.
- Late tools/perf/ changes that missed the first round:
* endianness fixes
* event parsing improvements
* libtraceevent fixes factored out from trace-cmd
* perl scripting engine fixes related to libtraceevent,
* testcase improvements
* perf inject / pipe mode fixes
* plus a kernel side fix
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Update event scheduling constraints for AMD family 15h models
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched, perf: Use a single callback into the scheduler"
perf evlist: Show event attribute details
perf tools: Bump default sample freq to 4 kHz
perf buildid-list: Work better with pipe mode
perf tools: Fix piped mode read code
perf inject: Fix broken perf inject -b
perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
perf tools: Add union u64_swap type for swapping u64 data
perf tools: Carry perf_event_attr bitfield throught different endians
perf record: Fix documentation for branch stack sampling
perf target: Add cpu flag to sample_type if target has cpu
perf tools: Always try to build libtraceevent
perf tools: Rename libparsevent to libtraceevent in Makefile
perf script: Rename struct event to struct event_format in perl engine
perf script: Explicitly handle known default print arg type
perf tools: Add hardcoded name term for pmu events
perf tools: Separate 'mem:' event scanner bits
perf tools: Use allocated list for each parsed event
perf tools: Add support for displaying event parser debug info
perf test: Move parse event automated tests to separated object
Pull x86 reboot changes from Ingo Molnar:
"The biggest change is a gentler method of rebooting/stopping via IRQs
first and then via NMIs. There are several cleanups in the tree as
well."
* 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Update nonmi_ipi parameter
x86/reboot: Use NMI to assist in shutting down if IRQ fails
Revert "x86, reboot: Use NMI instead of REBOOT_VECTOR to stop cpus"
x86/reboot: Clean up coding style
x86/reboot: Reduce to a single DMI table for reboot quirks
Pull x86 platform changes from Ingo Molnar:
"This tree includes assorted platform driver updates and a preparatory
series for a platform with custom DMA remapping semantics (sta2x11 I/O
hub)."
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vsmp: Fix number of CPUs when vsmp is disabled
keyboard: Use BIOS Keyboard variable to set Numlock
x86/olpc/xo1/sci: Report RTC wakeup events
x86/olpc/xo1/sci: Produce wakeup events for buttons and switches
x86, platform: Initial support for sta2x11 I/O hub
x86: Introduce CONFIG_X86_DMA_REMAP
x86-32: Introduce CONFIG_X86_DEV_DMA_OPS
Pull MCE updates from Ingo Molnar:
"This tree updates/fixes MCE hardware support, it makes the APIC LVT
thresholding interrupt optional because a subset of AMD F15h models
don't support it."
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, MCE, AMD: Disable error thresholding bank 4 on some models
x86, MCE, AMD: Hide interrupt_enable sysfs node
x86, MCE, AMD: Make APIC LVT thresholding interrupt optional
Pull fpu state cleanups from Ingo Molnar:
"This tree streamlines further aspects of FPU handling by eliminating
the prepare_to_copy() complication and moving that logic to
arch_dup_task_struct().
It also fixes the FPU dumps in threaded core dumps, removes and old
(and now invalid) assumption plus micro-optimizes the exit path by
avoiding an FPU save for dead tasks."
Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
in because we now do the FPU handling in arch_dup_task_struct() rather
than the legacy (and now gone) prepare_to_copy().
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, fpu: drop the fpu state during thread exit
x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
coredump: ensure the fpu state is flushed for proper multi-threaded core dump
fork: move the real prepare_to_copy() users to arch_dup_task_struct()
Pull exception table generation updates from Ingo Molnar:
"The biggest change here is to allow the build-time sorting of the
exception table, to speed up booting. This is achieved by the
architecture enabling BUILDTIME_EXTABLE_SORT. This option is enabled
for x86 and MIPS currently.
On x86 a number of fixes and changes were needed to allow build-time
sorting of the exception table, in particular a relocation invariant
exception table format was needed. This required the abstracting out
of exception table protocol and the removal of 20 years of accumulated
assumptions about the x86 exception table format.
While at it, this tree also cleans up various other aspects of
exception handling, such as early(er) exception handling for
rdmsr_safe() et al.
All in one, as the result of these changes the x86 exception code is
now pretty nice and modern. As an added bonus any regressions in this
code will be early and violent crashes, so if you see any of those,
you'll know whom to blame!"
Fix up trivial conflicts in arch/{mips,x86}/Kconfig files due to nearby
modifications of other core architecture options.
* 'x86-extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
Revert "x86, extable: Disable presorted exception table for now"
scripts/sortextable: Handle relative entries, and other cleanups
x86, extable: Switch to relative exception table entries
x86, extable: Disable presorted exception table for now
x86, extable: Add _ASM_EXTABLE_EX() macro
x86, extable: Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/xsave.h
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h
x86, extable: Remove the now-unused __ASM_EX_SEC macros
x86, extable: Remove open-coded exception table entries in arch/x86/xen/xen-asm_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/um/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/usercopy_32.c
x86, extable: Remove open-coded exception table entries in arch/x86/lib/putuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/getuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/csum-copy_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_nocache_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/test_rodata.c
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/entry_64.S
...
Pull x86/urgent branch from Ingo Molnar:
"These are the fixes left over from the very end of the v3.4
stabilization cycle, plus one more fix."
Ugh. Those KERN_CONT additions are just pointless. I think they came
as a reaction to some of the early (broken) printk() work - but that was
fixed before it was merged.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, relocs: Build clean fix
x86, printk: Add missing KERN_CONT to NMI selftest
x86: Fix boot on Twinhead H12Y
Got bitten again by the BIT() macro:
arch/x86/kernel/cpu/mcheck/mce.c: In function '__mcheck_cpu_apply_quirks':
arch/x86/kernel/cpu/mcheck/mce.c:1453:6: warning: left shift
count >= width of type arch/x86/kernel/cpu/mcheck/mce.c:1454:7: warning: left shift count >= width of type
Fix it already.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337684026-19740-2-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
Pull x86/apic changes from Ingo Molnar:
"Most of the changes are about helping virtualized guest kernels
achieve better performance."
Fix up trivial conflicts with the iommu updates to arch/x86/kernel/apic/io_apic.c
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Implement EIO micro-optimization
x86/apic: Add apic->eoi_write() callback
x86/apic: Use symbolic APIC_EOI_ACK
x86/apic: Fix typo EIO_ACK -> EOI_ACK and document it
x86/xen/apic: Add missing #include <xen/xen.h>
x86/apic: Only compile local function if used with !CONFIG_GENERIC_PENDING_IRQ
x86/apic: Fix UP boot crash
x86: Conditionally update time when ack-ing pending irqs
xen/apic: implement io apic read with hypercall
Revert "xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'"
xen/x86: Implement x86_apic_ops
x86/apic: Replace io_apic_ops with x86_io_apic_ops.
Pull scheduler changes from Ingo Molnar:
"The biggest change is the cleanup/simplification of the load-balancer:
instead of the current practice of architectures twiddling scheduler
internal data structures and providing the scheduler domains in
colorfully inconsistent ways, we now have generic scheduler code in
kernel/sched/core.c:sched_init_numa() that looks at the architecture's
node_distance() parameters and (while not fully trusting it) deducts a
NUMA topology from it.
This inevitably changes balancing behavior - hopefully for the better.
There are various smaller optimizations, cleanups and fixlets as well"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
sched: Remove stale power aware scheduling remnants and dysfunctional knobs
sched/debug: Fix printing large integers on 32-bit platforms
sched/fair: Improve the ->group_imb logic
sched/nohz: Fix rq->cpu_load[] calculations
sched/numa: Don't scale the imbalance
sched/fair: Revert sched-domain iteration breakage
sched/x86: Rewrite set_cpu_sibling_map()
sched/numa: Fix the new NUMA topology bits
sched/numa: Rewrite the CONFIG_NUMA sched domain support
sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
sched/fair: Add some serialization to the sched_domain load-balance walk
sched/fair: Let minimally loaded cpu balance the group
sched: Change rq->nr_running to unsigned int
x86/numa: Check for nonsensical topologies on real hw as well
x86/numa: Hard partition cpu topology masks on node boundaries
x86/numa: Allow specifying node_distance() for numa=fake
x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
sched: Update documentation and comments
sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()
Pull perf changes from Ingo Molnar:
"Lots of changes:
- (much) improved assembly annotation support in perf report, with
jump visualization, searching, navigation, visual output
improvements and more.
- kernel support for AMD IBS PMU hardware features. Notably 'perf
record -e cycles:p' and 'perf top -e cycles:p' should work without
skid now, like PEBS does on the Intel side, because it takes
advantage of IBS transparently.
- the libtracevents library: it is the first step towards unifying
tracing tooling and perf, and it also gives a tracing library for
external tools like powertop to rely on.
- infrastructure: various improvements and refactoring of the UI
modules and related code
- infrastructure: cleanup and simplification of the profiling
targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)
- tons of robustness fixes all around
- various ftrace updates: speedups, cleanups, robustness
improvements.
- typing 'make' in tools/ will now give you a menu of projects to
build and a short help text to explain what each does.
- ... and lots of other changes I forgot to list.
The perf record make bzImage + perf report regression you reported
should be fixed."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
tracing: Remove kernel_lock annotations
tracing: Fix initial buffer_size_kb state
ring-buffer: Merge separate resize loops
perf evsel: Create events initially disabled -- again
perf tools: Split term type into value type and term type
perf hists: Fix callchain ip printf format
perf target: Add uses_mmap field
ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
ftrace: Make ftrace_modify_all_code() global for archs to use
ftrace: Return record ip addr for ftrace_location()
ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
ftrace: Speed up search by skipping pages by address
ftrace: Remove extra helper functions
ftrace: Sort all function addresses, not just per page
tracing: change CPU ring buffer state from tracing_cpumask
tracing: Check return value of tracing_dentry_percpu()
ring-buffer: Reset head page before running self test
ring-buffer: Add integrity check at end of iter read
ring-buffer: Make addition of pages in ring buffer atomic
...
Pull percpu updates from Tejun Heo:
"Contains Alex Shi's three patches to remove percpu_xxx() which overlap
with this_cpu_xxx(). There shouldn't be any functional change."
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: remove percpu_xxx() functions
x86: replace percpu_xxx funcs with this_cpu_xxx
net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxx
guts of saved_sigmask-based sigsuspend/rt_sigsuspend. Takes
kernel sigset_t *.
Open-coded instances replaced with calling it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull security subsystem updates from James Morris:
"New notable features:
- The seccomp work from Will Drewry
- PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
- Longer security labels for Smack from Casey Schaufler
- Additional ptrace restriction modes for Yama by Kees Cook"
Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
apparmor: fix long path failure due to disconnected path
apparmor: fix profile lookup for unconfined
ima: fix filename hint to reflect script interpreter name
KEYS: Don't check for NULL key pointer in key_validate()
Smack: allow for significantly longer Smack labels v4
gfp flags for security_inode_alloc()?
Smack: recursive tramsmute
Yama: replace capable() with ns_capable()
TOMOYO: Accept manager programs which do not start with / .
KEYS: Add invalidation support
KEYS: Do LRU discard in full keyrings
KEYS: Permit in-place link replacement in keyring list
KEYS: Perform RCU synchronisation on keys prior to key destruction
KEYS: Announce key type (un)registration
KEYS: Reorganise keys Makefile
KEYS: Move the key config into security/keys/Kconfig
KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
Yama: remove an unused variable
samples/seccomp: fix dependencies on arch macros
Yama: add additional ptrace scopes
...
Pull smp hotplug cleanups from Thomas Gleixner:
"This series is merily a cleanup of code copied around in arch/* and
not changing any of the real cpu hotplug horrors yet. I wish I'd had
something more substantial for 3.5, but I underestimated the lurking
horror..."
Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
um: Remove leftover declaration of alloc_task_struct_node()
task_allocator: Use config switches instead of magic defines
sparc: Use common threadinfo allocator
score: Use common threadinfo allocator
sh-use-common-threadinfo-allocator
mn10300: Use common threadinfo allocator
powerpc: Use common threadinfo allocator
mips: Use common threadinfo allocator
hexagon: Use common threadinfo allocator
m32r: Use common threadinfo allocator
frv: Use common threadinfo allocator
cris: Use common threadinfo allocator
x86: Use common threadinfo allocator
c6x: Use common threadinfo allocator
fork: Provide kmemcache based thread_info allocator
tile: Use common threadinfo allocator
fork: Provide weak arch_release_[task_struct|thread_info] functions
fork: Move thread info gfp flags to header
fork: Remove the weak insanity
sh: Remove cpu_idle_wait()
...
Pull core locking updates from Ingo Molnar:
"This update:
- extends and simplifies x86 NMI callback handling code to enhance
and fix the HP hw-watchdog driver
- simplifies the x86 NMI callback handling code to fix a kmemcheck
bug.
- enhances the hung-task debugger"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix the type of the nmiaction.flags field
x86/nmi: Fix page faults by nmiaction if kmemcheck is enabled
x86/nmi: Add new NMI queues to deal with IO_CHK and SERR
watchdog, hpwdt: Remove priority option for NMI callback
hung task debugging: Inject NMI when hung and going to panic
Pull iommu core changes from Ingo Molnar:
"The IOMMU changes in this cycle are mostly about factoring out
Intel-VT-d specific IRQ remapping details and introducing struct
irq_remap_ops, in preparation for AMD specific hardware."
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iommu: Fix off by one in dmar_get_fault_reason()
irq_remap: Fix the 'sub_handle' uninitialized warning
irq_remap: Fix UP build failure
irq_remap: Fix compiler warning with CONFIG_IRQ_REMAP=y
iommu: rename intr_remapping.[ch] to irq_remapping.[ch]
iommu: rename intr_remapping references to irq_remapping
x86, iommu/vt-d: Clean up interfaces for interrupt remapping
iommu/vt-d: Convert MSI remapping setup to remap_ops
iommu/vt-d: Convert free_irte into a remap_ops callback
iommu/vt-d: Convert IR set_affinity function to remap_ops
iommu/vt-d: Convert IR ioapic-setup to use remap_ops
iommu/vt-d: Convert missing apic.c intr-remapping call to remap_ops
iommu/vt-d: Make intr-remapping initialization generic
iommu: Rename intr_remapping files to intel_intr_remapping
Fix this behaviour:
----------------
| NMI testsuite:
--------------------
remote IPI:
ok |
local IPI:
ok |
Revealed due to a new modification to printk().
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Link: http://lkml.kernel.org/r/1336492573-17530-3-git-send-email-levinsasha928@gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch adds support for CMA to dma-mapping subsystem for x86
architecture that uses common pci-dma/pci-nommu implementation. This
allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Fixes for perf/core:
- Rename some perf_target methods to avoid double negation, from Namhyung Kim.
- Revert change to use per task events with inheritance, from Namhyung Kim.
- Events should start disabled till children starts running, from David Ahern.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPtS5qAAoJEKurIx+X31iB+bAP/1FjUa2Nd53X89HFc6DoLktA
4AshM/JENhAfSbpTfGhg10ZOuwUa8sQ85Sf6yz1CsW6mEiJK/bDFrR1g2KrmejyL
owgQvV6ukPABzfB27tWyXSVVBPmLkJviedLDautVpgPEPVuqauntmpe7fW51b5pf
92MxvYZ6AzgbDIjVaXP7e+kPomvgyM1C/UEvCgoyEcw81h5dchU9NSdXNBS67JS/
uOsMiMJyoNI46haYYbyFgMq3RmpYuxTLFj7qFDlUltyjP+vIvyLs38Ae/vkRMNfV
sYXWRUQlRpvqg4MDIFVZx8FWTufzTm0BMS+Be7JkXKWdF3DAksq6FprOWIxfYi+d
PMxwTFeSJzTINb9n9MiLt3TmuRy3mu37QWd28qJaJciNMkYWbclPqyJmjwsuAMKg
hKSy2FvewIDHTAGOkwaVjS+L8O7j3TNRIAbweFA1d1K4rt6oSfwdn6GZrAb6MTvx
oV0Fe1nAyY9mucyjknBTim3RYZ9qQ7H9SjL8JoaGihSzi988MgBun+iEQOQiwjNl
YJNGIv0wb31qCKFtxU0A8rA0sFdhQRoLgigJfETb5a+2ghtG323oH6ZVWTnB8bp/
g1XOpus222/iJdL2AizMiJeUNV1IZ0SeLn2GoTFQIy9Uf11Zdgu5wLJumUva8EC6
JAACbpBlYufftIn278OH
=tk2M
-----END PGP SIGNATURE-----
Merge tag 'linus-mce-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull a machine check recovery fix from Tony Luck.
I really don't like how the MCE code does some of the things it does,
but this does seem to be an improvement.
* tag 'linus-mce-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Only restart instruction after machine check recovery if it is safe
Merge reason: We are going to queue up a dependent patch:
"perf tools: Move parse event automated tests to separated object"
That depends on:
commit e7c72d8
perf tools: Add 'G' and 'H' modifiers to event parsing
Conflicts:
tools/perf/builtin-stat.c
Conflicted with the recent 'perf_target' patches when checking the
result of perf_evsel open routines to see if a retry is needed to cope
with older kernels where the exclude guest/host perf_event_attr bits
were not used.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We know both register and value for eoi beforehand,
so there's no need to check it and no need to do math
to calculate the msr. Saves instructions/branches
on each EOI when using x2apic.
I looked at the objdump output to verify that the
generated code looks right and actually is shorter.
The real improvemements will be on the KVM guest side
though, those come in a later patch.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: gleb@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/e019d1a125316f10d3e3a4b2f6bda41473f4fb72.1337184153.git.mst@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add eoi_write callback so that kvm can override
eoi accesses without touching the rest of the apic.
As a side-effect, this will enable a micro-optimization
for apics using msr.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: gleb@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/0df425d746c49ac2ecc405174df87752869629d2.1337184153.git.mst@redhat.com
[ tidied it up a bit ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hardware with MCA bus is limited to 386 and 486 class machines
that are now 20+ years old and typically with less than 32MB
of memory. A quick search on the internet, and you see that
even the MCA hobbyist/enthusiast community has lost interest
in the early 2000 era and never really even moved ahead from
the 2.4 kernels to the 2.6 series.
This deletes anything remaining related to CONFIG_MCA from core
kernel code and from the x86 architecture. There is no point in
carrying this any further into the future.
One complication to watch for is inadvertently scooping up
stuff relating to machine check, since there is overlap in
the TLA name space (e.g. arch/x86/boot/mca.c).
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Pull perf, x86 and scheduler updates from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing: Do not enable function event with enable
perf stat: handle ENXIO error for perf_event_open
perf: Turn off compiler warnings for flex and bison generated files
perf stat: Fix case where guest/host monitoring is not supported by kernel
perf build-id: Fix filename size calculation
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: KVM paravirt kernels don't check for CPUID being unavailable
x86: Fix section annotation of acpi_map_cpu2node()
x86/microcode: Ensure that module is only loaded on supported Intel CPUs
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
It's been broken forever (i.e. it's not scheduling in a power
aware fashion), as reported by Suresh and others sending
patches, and nobody cares enough to fix it properly ...
so remove it to make space free for something better.
There's various problems with the code as it stands today, first
and foremost the user interface which is bound to topology
levels and has multiple values per level. This results in a
state explosion which the administrator or distro needs to
master and almost nobody does.
Furthermore large configuration state spaces aren't good, it
means the thing doesn't just work right because it's either
under so many impossibe to meet constraints, or even if
there's an achievable state workloads have to be aware of
it precisely and can never meet it for dynamic workloads.
So pushing this kind of decision to user-space was a bad idea
even with a single knob - it's exponentially worse with knobs
on every node of the topology.
There is a proposal to replace the user interface with a single
3 state knob:
sched_balance_policy := { performance, power, auto }
where 'auto' would be the preferred default which looks at things
like Battery/AC mode and possible cpufreq state or whatever the hw
exposes to show us power use expectations - but there's been no
progress on it in the past many months.
Aside from that, the actual implementation of the various knobs
is known to be broken. There have been sporadic attempts at
fixing things but these always stop short of reaching a mergable
state.
Therefore this wholesale removal with the hopes of spurring
people who care to come forward once again and work on a
coherent replacement.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To remove duplicate code, have the ftrace arch_ftrace_update_code()
use the generic ftrace_modify_all_code(). This requires that the
default ftrace_replace_code() becomes a weak function so that an
arch may override it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is no need to save any active fpu state to the task structure
memory if the task is dead. Just drop the state instead.
For example, this saved some 1770 xsave's during the system boot
of a two socket Xeon system.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-4-git-send-email-suresh.b.siddha@intel.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Code paths like fork(), exit() and signal handling flush the fpu
state explicitly to the structures in memory.
BUG_ON() in __sanitize_i387_state() is checking that the fpu state
is not live any more. But for preempt kernels, task can be scheduled
out and in at any place and the preload_fpu logic during context switch
can make the fpu registers live again.
For example, consider a 64-bit Task which uses fpu frequently and as such
you will find its fpu_counter mostly non-zero. During its time slice, kernel
used fpu by doing kernel_fpu_begin/kernel_fpu_end(). After this, in the same
scheduling slice, task-A got a signal to handle. Then during the signal
setup path we got preempted when we are just before the sanitize_i387_state()
in arch/x86/kernel/xsave.c:save_i387_xstate(). And when we come back we
will have the fpu registers live that can hit the bug_on.
Similarly during core dump, other threads can context-switch in and out
(because of spurious wakeups while waiting for the coredump to finish in
kernel/exit.c:exit_mm()) and the main thread dumping core can run into this
bug when it finds some other thread with its fpu registers live on some other cpu.
So remove the paranoid check for now, even though it caught a bug in the
multi-threaded core dump case (fixed in the previous patch).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-3-git-send-email-suresh.b.siddha@intel.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
the architectures and the rest following the x86 model of flushing the extended
register state like fpu there.
Remove it and use the arch_dup_task_struct() instead.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we've determined we can't do what the user asked, trying to do
something else isn't going to make the user's life better.
Without this the screen scrolls a bit and then you get a panic
anyway, and it's nice not to have so much scroll after the real
problem in bug reports.
Link: http://lkml.kernel.org/r/1337190206-12121-1-git-send-email-pjones@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Keep all the realmode code together, including initialization (only
the rm/ subdirectory is actually built as real-mode code, anyway.)
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Some AMD processors apparently #GP(0) if EFER.LMA is set in WRMSR,
rather than ignoring it. Thus, we need to mask it out.
Reported-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Borislav Petkov <bp@alien8.de>
Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-24-git-send-email-jarkko.sakkinen@intel.com
Change kstack_setup() and code_bytes_setup() in kernel/dumpstack.c
to call kstrtoul() instead of calling obsoleted simple_strtoul().
Signed-off-by: Shuah Khan <shuahkhan@gmail.com>
Link: http://lkml.kernel.org/r/1336327084.2897.15.camel@lorien2
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Change set_corruption_check() and set_corruption_check_period()
in kernel/check.c to call kstrtoul() instead of calling
obsoleted simple_strtoul().
Signed-off-by: Shuah Khan <shuahkhan@gmail.com>
Link: http://lkml.kernel.org/r/1336326908.2897.12.camel@lorien2
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Section 15.3.1.2 of the software developer manual has this to say about the
RIPV bit in the IA32_MCG_STATUS register:
RIPV (restart IP valid) flag, bit 0 — Indicates (when set) that program
execution can be restarted reliably at the instruction pointed to by the
instruction pointer pushed on the stack when the machine-check exception
is generated. When clear, the program cannot be reliably restarted at
the pushed instruction pointer.
We need to save the state of this bit in do_machine_check() and use it
in mce_notify_process() to force a signal; even if memory_failure() says
it made a complete recovery ... e.g. replaced a clean LRU page.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Since percpu_xxx() serial functions are duplicated with this_cpu_xxx().
Removing percpu_xxx() definition and replacing them by this_cpu_xxx()
in code. There is no function change in this patch, just preparation for
later percpu_xxx serial function removing.
On x86 machine the this_cpu_xxx() serial functions are same as
__this_cpu_xxx() without no unnecessary premmpt enable/disable.
Thanks for Stephen Rothwell, he found and fixed a i386 build error in
the patch.
Also thanks for Andrew Morton, he kept updating the patchset in Linus'
tree.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Acked-by: Christoph Lameter <cl@gentwo.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit ad7687dde ("x86/numa: Check for nonsensical topologies on real
hw as well") is broken in that the condition can trigger for valid
setups but only changes the end result for invalid setups with no real
means of discerning between those.
Rewrite set_cpu_sibling_map() to make the code clearer and make sure
to only warn when the check changes the end result.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-klcwahu3gx467uhfiqjyhdcs@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case CONFIG_X86_VSMP is not set, limit the number of CPUs to
the number of CPUs of the first board.
Also make CONFIG_X86_VSMP depend on CONFIG_SMP, as there's
little point in having a vsmp machine with a single CPU.
Signed-off-by: Shai Fultheim <shai@scalemp.com>
[ido@wizery.com: rebased, fixed minor coding-style issues]
Signed-off-by: Ido Yariv <ido@wizery.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the nonmi_ipi parameter to reflect the simple change
instead of the previous complicated one. There should be less
of a need to use it but there may still be corner cases on older
hardware that stumble into NMI issues.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1336761675-24296-4-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For v3.3, I added code to use the NMI to stop other cpus in the
panic case. The idea was to make sure all cpus on the system
were definitely halted to help serialize the panic path to
execute the rest of the code on a single cpu.
The main problem it was trying to solve was how to stop a cpu
that was spinning with its irqs disabled. A IPI irq would be
stuck and couldn't get in there, but an NMI could.
Things were great until we had another conversation about some
pstore changes. Because some of the backend pstore still uses
spinlocks to protect the device access, things could get ugly if
a panic happened and we were stuck spinning on a lock.
Now with the NMI shutting down cpus, we could assume no other
cpus were running and just bust the spin lock and proceed.
The counter argument was, well if you do that the backend could
be in a screwed up state and you might not be able to save
anything as a result. If we could have just given the cpu a
little more time to finish things, we could have grabbed the
spin lock cleanly and everything would have been fine.
Well, how do give a cpu a 'little more time' in the panic case?
For the most part you can't without spinning on the lock and
even in that case, how long do you spin for?
So instead of making it ugly in the pstore code, just mimic the
idea that stop_machine had, which is block on an IRQ IPI until
the remote cpu has re-enabled interrupts and left the critical
region. Which is what happens now using REBOOT_IRQ.
Then leave the NMI case for those cpus that are truly stuck
after a short time. This leaves the current behaviour alone and
just handle a corner case. Most systems should never have to
enter the NMI code and if they do, print out a message in case
the NMI itself causes another issue.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1336761675-24296-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit 3603a2512f.
Originally I wanted a better hammer to shutdown cpus during
panic. However, this really steps on the toes of various
spinlocks in the panic path. Sometimes it is easier to wait for
the IRQ to become re-enabled to indictate the cpu left the
critical region and then shutdown the cpu.
The next patch moves the NMI addition after the IRQ part. To
make it easier to see the logic of everything, revert this patch
and apply the next simpler patch.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1336761675-24296-2-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull KVM fixes from Avi Kivity:
"Two asynchronous page fault fixes (one guest, one host), a powerpc
page refcount fix, and an ia64 build fix."
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: ia64: fix build due to typo
KVM: PPC: Book3S HV: Fix refcounting of hugepages
KVM: Do not take reference to mm during async #PF
KVM: ensure async PF event wakes up vcpu from halt
The value of IbsOpCurCnt rolls over when it reaches IbsOpMaxCnt. Thus,
it is reset to zero by hardware. To get the correct count we need to
add the max count to it in case we received an ibs sample (valid bit
set).
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-13-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After disabling IBS there could be still incomming NMIs with samples
that even have the valid bit cleared. Mark all this NMIs as handled to
avoid spurious interrupt messages.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-12-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When disabling ibs there might be the case where hardware continuously
generates interrupts. This is described in erratum #420 (Instruction-
Based Sampling Engine May Generate Interrupt that Cannot Be Cleared).
To avoid this we must clear the counter mask first and then clear the
enable bit. This patch implements this.
See Revision Guide for AMD Family 10h Processors, Publication #41322.
Note: We now keep track of the last read ibs config value which is
then used to disable ibs. To update the config value we pass now a
pointer to the functions reading it.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-11-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the last hw period is too short we might hit the irq handler which
biases the results. Thus try to have a max last period that triggers
the sw overflow.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-10-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are cases where the remaining period is smaller than the minimal
possible value. In this case the counter is restarted with the minimal
period. This is of no use as the interrupt handler will trigger
immediately again and most likely hits itself. This biases the
results.
So, if the remaining period is within the min range, we better do not
restart the counter and instead trigger the overflow.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-9-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds support for precise event sampling with IBS. There are
two counting modes to count either cycles or micro-ops. If the
corresponding performance counter events (hw events) are setup with
the precise flag set, the request is redirected to the ibs pmu:
perf record -a -e cpu-cycles:p ... # use ibs op counting cycle count
perf record -a -e r076:p ... # same as -e cpu-cycles:p
perf record -a -e r0C1:p ... # use ibs op counting micro-ops
Each ibs sample contains a linear address that points to the
instruction that was causing the sample to trigger. With ibs we have
skid 0. Thus, ibs supports precise levels 1 and 2. Samples are marked
with the PERF_EFLAGS_EXACT flag set. In rare cases the rip is invalid
when IBS was not able to record the rip correctly. Then the
PERF_EFLAGS_EXACT flag is cleared and the rip is taken from pt_regs.
V2:
* don't drop samples in precise level 2 if rip is invalid, instead
support the PERF_EFLAGS_EXACT flag
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120502103309.GP18810@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Each IBS sample contains a linear address of the instruction that
caused the sample to trigger. This address is more precise than the
rip that was taken from the interrupt handler's stack. Update the rip
with that address. We use this in the next patch to implement
precise-event sampling on AMD systems using IBS.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-6-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fixing profiling at a fixed frequency, in this case the freq value and
sample period was setup incorrectly. Since sampling periods are
adjusted we also allow periods that have lower 4 bits set.
Another fix is the setup of the hw counter: If we modify
hwc->sample_period, we also need to update hwc->last_period and
hwc->period_left.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-5-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We always need to pass the last sample period to
perf_sample_data_init(), otherwise the event distribution will be
wrong. Thus, modifiyng the function interface with the required period
as argument. So basically a pattern like this:
perf_sample_data_init(&data, ~0ULL);
data.period = event->hw.last_period;
will now be like that:
perf_sample_data_init(&data, ~0ULL, event->hw.last_period);
Avoids unininitialized data.period and simplifies code.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The last sw period was not correctly updated on overflow and thus led
to wrong distribution of events. We always need to properly initialize
data.period in struct perf_sample_data.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of only checking nonsensical topologies on numa-emu, do it
on real hardware as well, and print a warning.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/n/tip-re15l0jqjtpz709oxozt2zoh@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When using numa=fake= you can get weird topologies where LLCs can span
nodes and other such nonsense. Cure this by hard partitioning these
masks on node boundaries.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/n/tip-di5vwjm96q5vrb76opwuflwx@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
What was called show_registers() so far already showed a stack
trace for kernel faults, and kernel_stack_pointer() isn't even
valid to be used for faults from user mode, hence it was
pointless for show_regs() to call show_trace() after
show_registers().
Simply rename show_registers() to show_regs() and eliminate
the old definition.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/4FAA3D3902000078000826E1@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Added header for trampoline code that can be used to supply
input data to it. This makes interface between real mode code
and kernel cleaner and simpler. Replaced two confusing pointers
to level4 pgt in trampoline_64.S with a single pointer to the
beginning of the page table.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-21-git-send-email-jarkko.sakkinen@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Simplified hierarchy under rm directory to a flat
directory because it is not anymore really justified
to have own directory for wakeup code. It only adds
more complexity.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-20-git-send-email-jarkko.sakkinen@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
There were number of issues in wakeup sequence:
- Wakeup stack was placed in hardcoded address.
- NX bit in EFER was not enabled.
- Initialization incorrectly set physical address
of secondary_startup_64.
- Some alignment issues.
This patch fixes these issues and in addition:
- Unifies coding conventions in .S files.
- Sets alignments of code and data right.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-18-git-send-email-jarkko.sakkinen@intel.com
Originally-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Set proper permissions for rodata, text and data, removing the
realmode trampoline area as a remaining RWX memory mapping in the
kernel.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-8-git-send-email-jarkko.sakkinen@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Migrated ACPI wakeup code to the real-mode blob.
Code existing in .x86_trampoline can be completely
removed. Static descriptor table in wakeup_asm.S is
courtesy of H. Peter Anvin.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-7-git-send-email-jarkko.sakkinen@intel.com
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Migrated SMP trampoline code to the real mode blob.
SMP trampoline code is not yet removed from
.x86_trampoline because it is needed by the wakeup
code.
[ hpa: always enable compiling startup_32_smp in head_32.S... it is
only a few instructions which go into .init on UP builds, and it makes
the rest of the code less #ifdef ugly. ]
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-6-git-send-email-jarkko.sakkinen@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Implements relocator for real mode code that is called
as part of setup_arch(). Processes segment relocations
and linear relocations. Real-mode code is relocated to
a free hole below 1 MB.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Link: http://lkml.kernel.org/r/1336501366-28617-4-git-send-email-jarkko.sakkinen@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull two percpu fixes from Tejun Heo:
"One adds missing KERN_CONT on split printk()s and the other makes
the percpu allocator avoid using PMD_SIZE as atom_size on x86_32.
Using PMD_SIZE led to vmalloc area exhaustion on certain
configurations (x86_32 android) and the only cost of using PAGE_SIZE
instead is static percpu area not being aligned to large page
mapping."
* 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu, x86: don't use PMD_SIZE as embedded atom_size on 32bit
percpu: use KERN_CONT in pcpu_dump_alloc_info()
With the embed percpu first chunk allocator, x86 uses either PAGE_SIZE
or PMD_SIZE for atom_size. PMD_SIZE is used when CPU supports PSE so
that percpu areas are aligned to PMD mappings and possibly allow using
PMD mappings in vmalloc areas in the future. Using larger atom_size
doesn't waste actual memory; however, it does require larger vmalloc
space allocation later on for !first chunks.
With reasonably sized vmalloc area, PMD_SIZE shouldn't be a problem
but x86_32 at this point is anything but reasonable in terms of
address space and using larger atom_size reportedly leads to frequent
percpu allocation failures on certain setups.
As there is no reason to not use PMD_SIZE on x86_64 as vmalloc space
is aplenty and most x86_64 configurations support PSE, fix the issue
by always using PMD_SIZE on x86_64 and PAGE_SIZE on x86_32.
v2: drop cpu_has_pse test and make x86_64 always use PMD_SIZE and
x86_32 PAGE_SIZE as suggested by hpa.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Yanmin Zhang <yanmin.zhang@intel.com>
Reported-by: ShuoX Liu <shuox.liu@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <4F97BA98.6010001@intel.com>
Cc: stable@vger.kernel.org
The only difference is the free_thread_info function, which frees
xstate.
Use the new arch_release_task_struct() function instead and switch
over to the core allocator.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120505150141.559556763@linutronix.de
Cc: x86@kernel.org
The local function io_apic_level_ack_pending() is only called
from io_apic_level_ack_pending(). The later function is only
compiled if CONFIG_GENERIC_PENDING_IRQ is defined. Move the
io_apic_level_ack_pending() to the existing #ifdef
CONFIG_GENERIC_PENDING_IRQ code block.
This will remove the following warning message during compiling
without CONFIG_GENERIC_PENDING_IRQ defined:
* arch/x86/kernel/apic/io_apic.c:382: warning: ‘io_apic_level_ack_pending’ defined but not used
Signed-off-by: Márton Németh <nm127@freemail.hu>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Link: http://lkml.kernel.org/r/1336461860.2296.3.camel@sbsiddha-mobl2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 31b3c9d723 ("xen/x86: Implement x86_apic_ops") implemented
this:
... without considering that on UP the function pointer might be NULL.
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/n/tip-3pfty0ml4yp62phbkchichh0@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The checks that exist in mwait_usable() for "idle=" kernel
parameters are insufficient. As a result, mwait_usable() can
return 1 even if "idle=nomwait" or "idle=poll" or "idle=halt"
parameters are passed.
Of these cases, incorrect handling of idle=nomwait is a
universal problem since mwait can get used for usual CPU idling.
However the rest of the cases are problematic only during CPU
Hotplug (offline) because, in the CPU offline path, the function
mwait_play_dead() is called, which might result in mwait being
used in the offline CPUs, if mwait_usable() happens to return 1.
Fix these issues by checking for the boot time "idle=" kernel
parameter properly in mwait_usable().
The first issue (usual cpu idling) is demonstrated below:
Before applying the patch (dmesg snippet):
[ 0.000000] Command line: [...] idle=nomwait
[ 0.000000] Kernel command line: [...] idle=nomwait
[ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
[ 0.140606] using mwait in idle threads. <======= mwait being used
[ 4.303986] cpuidle: using governor ladder
[ 4.308232] cpuidle: using governor menu
After applying the patch:
[ 0.000000] Command line: [...] idle=nomwait
[ 0.000000] Kernel command line: [...] idle=nomwait
[ 0.000000] RCU dyntick-idle grace-period acceleration is enabled.
[ 4.264100] cpuidle: using governor ladder
[ 4.268342] cpuidle: using governor menu
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: venki@google.com
Cc: suresh.b.siddha@intel.com
Cc: Borislav Petkov <bp@amd64.org>
Cc: lenb@kernel.org
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/4F9E37B8.30001@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On virtual environments, apic_read could take a long time. As a
result, under certain conditions the ack pending loop may exit
without any queued irqs left, but after more than one second. A
warning will be printed needlessly in this case.
If the loop is about to exit regardless of max_loops, don't
update it.
Signed-off-by: Shai Fultheim <shai@scalemp.com>
[ rebased and reworded the commit message]
Signed-off-by: Ido Yariv <ido@wizery.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1334873552-31346-1-git-send-email-ido@wizery.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patchset introduces a generic ops-interface for
accessing interrupt remapping hardware on x86. It factors
out the VT-d specific code from io_apic.c and moves it to
drivers/iommu. These changes will be used to add support for
AMD interrupt remapping hardware.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPp8ZZAAoJECvwRC2XARrj0o0P/22YaTNWvl1ZGDnO2Z913pEM
hbE0RPsht/2paCR+oXggu3vv3D0BzY5SkC0Hmjr4po4hiwDNXHj+V1QQKHFEMwuO
gzWatXCW/dDvIbDNHuCgG76n0kNmDMeps/MgKGrG4seV7zQ2jMfjaWFvJ2UKF/fJ
D1Jm+7QLr+MLM/cT64WG5S/8/Pi1N6CYbdXwkPTlFTnKRS+NODWGMo9fgc7VIsI/
9A1haZp+aV/TvGqklDlSh6CgIUBZVt/cffpftJiuH4L3o5mq1sXaFcK4Whu8PxGD
OwmbLe4PKXl39piPe4x8NGGrj0xnk+WN17JUQpefxFIbBkCjfGakNqLxFMKbxH8k
AtWeERge3Jd4jcfbmlXQ0GKaL2+BQUzIq0VHadHwsifCKs0KP7qG+q4Ev+02+Xbo
gfC5UtemoffI4fXznvNIE0Hcp8EYONmtksPJ2iKOxy+t7uDlESDMDspXf1IOLZzO
BODOdDlnsPCCWxT9RsJziVkD6C/hSZRwQIavZchBeMele4+YMpym+POlTJ5lx9KJ
TVXYZmRv8sUgzP67s0WuDNlB2SEQq8fHNYIHrCPGqcmp/VHCFzj1nSe5jRu3FdnB
Yu44BOoTibYss/aVy21hbB99SRMkpUWfImhH2eZLO/dahtlUMiI9ZnnYBJ62Ol7u
bBAnLHtq8WkVZH6drSCe
=H+Nt
-----END PGP SIGNATURE-----
Merge tag 'intr-remapping-ops-for-ingo' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu into core/iommu
- This patchset introduces a generic ops-interface for
accessing interrupt remapping hardware on x86. It factors
out the VT-d specific code from io_apic.c and moves it to
drivers/iommu. These changes will be used to add support for
AMD interrupt remapping hardware.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On some architectures (such as vSMP), it is possible to have
CPUs with a different number of cores sharing the same cache.
The current implementation implicitly assumes that all CPUs will
have the same number of cores sharing caches, and as a result,
different CPUs can end up with the same l2/l3 ids.
Fix this by masking out the shared cache bits, instead of
shifting the APICID. By doing so, it is guaranteed that the
generated cache ids are always unique.
Signed-off-by: Shai Fultheim <shai@scalemp.com>
[ rebased, simplified, and reworded the commit message]
Signed-off-by: Ido Yariv <ido@wizery.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/1334873351-31142-1-git-send-email-ido@wizery.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is particularly to be able to specify "hpet=force,verbose",
as "force" ought to be a primary candidate for also wanting to
use "verbose".
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F79D120020000780007C031@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While Linux itself has been calling hpet_disable() for quite a
while, having e.g. a secondary (kexec) kernel depend on such
behavior of the primary (crashed) environment is fragile. It
particularly broke until very recently when the primary
environment was Xen based, as that hypervisor did not clear any
of the HPET settings it may have used.
Rather than blindly (and incompletely) clearing certain HPET
settings in hpet_disable(), latch the config register settings
during boot and restore then here.
(Note on the hpet_set_mode() change: Now that we're clearing the
level bit upon initialization, there's no need anymore to do so
here.)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F79D0BB020000780007C02D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Exit early when there's no support for a particular CPU family.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Jones <davej@redhat.com>
Cc: tigran@aivazian.fsnet.co.uk
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/4F8BDB58.6070007@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the file names consistent with the naming conventions of irq subsystem.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Make the code consistent with the naming conventions of irq subsystem.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Remove the Intel specific interfaces from dmar.h and remove
asm/irq_remapping.h which is only used for io_apic.c anyway.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The operation for releasing a remapping entry is iommu
specific too.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The function to set interrupt affinity with interrupt
remapping enabled is Intel specific too. So move it to the
irq_remap_ops too.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The IOAPIC setup routine for interrupt remapping is VT-d
specific. Move it to the irq_remap_ops and add a call helper
function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Convert these calls too:
* Disable of remapping hardware
* Reenable of remapping hardware
* Enable fault handling
With that all of arch/x86/kernel/apic/apic.c is converted to
use the generic intr-remapping interface.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch introduces irq_remap_ops to hold implementation
specific function pointer to handle interrupt remapping. As
the first part the initialization functions for VT-d are
converted to these ops.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Commit ce7e5d2d19 ("x86: fix broken TASK_SIZE for ia32_aout") breaks
kernel builds when "CONFIG_IA32_AOUT=m" with
ERROR: "set_personality_ia32" [arch/x86/ia32/ia32_aout.ko] undefined!
make[1]: *** [__modpost] Error 1
The entry point needs to be exported.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Acked-by: Al Viro <viro@zeniv.linux.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It turned to be totally unneeded. The reason the code was introduced is
so that KVM can prefault swapped in page, but prefault can fail even
if mm is pinned since page table can change anyway. KVM handles this
situation correctly though and does not inject spurious page faults.
Fixes:
"INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected" warning while
running LTP inside a KVM guest using the recent -next kernel.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Same code. Use the generic version. The special Makefile treatment is
pointless anyway as init_task.o contains only data which is handled by
the linker script. So no point on being treated like head text.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20120503085035.739963562@linutronix.de
Cc: x86@kernel.org
If CONFIG_KPROBES is not set, then linux/kprobes.h will not include
asm/kprobes.h needed by x86/ftrace.c for the BREAKPOINT macro.
The x86/ftrace.c file should just include asm/kprobes.h as it does not
need the rest of kprobes.
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into next
Linux 3.4-rc5
Merge to pull in prerequisite change for Smack:
86812bb0de
Requested by Casey.
Which makes the code fit within the rest of the x86_ops functions.
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
[v1: Changed x86_apic -> x86_ioapic per Yinghai Lu <yinghai@kernel.org> suggestion]
[v2: Rebased on tip/x86/urgent and redid to match Ingo's syntax style]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Turn off MC4_MISC thresholding banks on models which have them but that
particular processor implementation does not supply applicable error
sources to be counted.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Depending on whether the box supports the APIC LVT interrupt for
thresholding, we want to show the 'interrupt_enable' sysfs node or not.
Make that the case by adding it to the default sysfs attributes only if
it is supported.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Currently, the APIC LVT interrupt for error thresholding is implicitly
enabled. However, there are models in the F15h range which do not enable
it. Make the code machinery which sets up the APIC interrupt support
an optional setting and add an ->interrupt_capable member to the bank
representation mirroring that capability and enable the interrupt offset
programming only if it is true.
Simplify code and fixup comment style while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
As ftrace function tracing would require modifying code that could
be executed in NMI context, which is not stopped with stop_machine(),
ftrace had to do a complex algorithm with various stages of setup
and memory barriers to make it work.
With the new breakpoint method, this is no longer required. The changes
to the code can be done without any problem in NMI context, as well as
without stop machine altogether. Remove the complex code as it is
no longer needed.
Also, a lot of the notrace annotations could be removed from the
NMI code as it is now safe to trace them. With the exception of
do_nmi itself, which does some special work to handle running in
the debug stack. The breakpoint method can cause NMIs to double
nest the debug stack if it's not setup properly, and that is done
in do_nmi(), thus that function must not be traced.
(Note the arch sh may want to do the same)
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This method changes x86 to add a breakpoint to the mcount locations
instead of calling stop machine.
Now that iret can be handled by NMIs, we perform the following to
update code:
1) Add a breakpoint to all locations that will be modified
2) Sync all cores
3) Update all locations to be either a nop or call (except breakpoint
op)
4) Sync all cores
5) Remove the breakpoint with the new code.
6) Sync all cores
[
Added updates that Masami suggested:
Use unlikely(modifying_ftrace_code) in int3 trap to keep kprobes efficient.
Don't use NOTIFY_* in ftrace handler in int3 as it is not a notifier.
]
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
BIOS will switch off the corresponding feature flag on family
15h models 10h-1fh non-desktop CPUs.
The topology extension CPUID leafs are required to detect which
cores belong to the same compute unit. (thread siblings mask is
set accordingly and also correct information about L1i and L2
cache sharing depends on this).
W/o this patch we wouldn't see which cores belong to the same
compute unit and also cache sharing information for L1i and L2
would be incorrect on such systems.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now the return value of cmpxchg() is used to match an event. The
change removes the duplicate event comparison and traverses the list
until an event was removed. This also fixes the following warning:
arch/x86/kernel/cpu/perf_event_amd.c:170: warning: value computed is not used
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333643084-26776-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124557.246929343@linutronix.de
Preparatory patch to use the generic idle thread allocation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124557.176604405@linutronix.de
A function name represents the pointer to it - no need to take the
address of it. (Fixing this helps us introduce some macro magic
around register_nmi_handler() in the future.)
Cc: Robert Richter <robert.richter@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide systems that do not support x2apic cluster mode
a mechanism to select x2apic physical mode using the
FADT FORCE_APIC_PHYSICAL_DESTINATION_MODE bit.
Changes from v1: (based on Suresh's comments)
- removed #ifdef CONFIG_ACPI
- removed #include <linux/acpi.h>
Signed-off-by: Greg Pearson <greg.pearson@hp.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1335313436-32020-1-git-send-email-greg.pearson@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch tries to fix the problem of page fault exception
caused by accessing nmiaction structure in nmi if kmemcheck
is enabled.
If kmemcheck is enabled, the memory allocated through slab are
in pages that are marked non-present, so that some checks could
be done in the page fault handling code ( e.g. whether the
memory is read before written to ).
As nmiaction is allocated in this way, so it resides in a
non-present page. Then there is a page fault while the nmi code
accessing the nmiaction structure, which would then cause a
warning by WARN_ON_ONCE(in_nmi()) in kmemcheck_fault(), called
by do_page_fault().
This significantly simplifies the code as well, as the whole
dynamic allocation dance goes away.
v2: as Peter suggested, changed the nmiaction to use static
storage.
v3: as Peter suggested, use macro to shorten the codes. Also
keep the original usage of register_nmi_handler, so users of
this call doesn't need change.
Tested-by: Seiji Aguchi <seiji.aguchi@hds.com>
Fixes: https://lkml.org/lkml/2012/3/2/356
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
[ simplified the wrappers ]
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: thomas.mingarelli@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1333051877-15755-4-git-send-email-dzickus@redhat.com
[ tidied the patch a bit ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In discussions with Thomas Mingarelli about hpwdt, he explained
to me some issues they were some when using their virtual NMI
button to test the hpwdt driver.
It turns out the virtual NMI button used on HP's machines do no
send unknown NMIs but instead send IO_CHK NMIs. The way the
kernel code is written, the hpwdt driver can not register itself
against that type of NMI and therefore can not successfully
capture system information before panic'ing.
To solve this I created two new NMI queues to allow driver to
register against the IO_CHK and SERR NMIs. Or in the hpwdt all
three (if you include unknown NMIs too).
The change is straightforward and just mimics what the unknown
NMI does.
Reported-and-tested-by: Thomas Mingarelli <thomas.mingarelli@hp.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1333051877-15755-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
way the code checks for already disabled indices.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJPkEFNAAoJEBLB8Bhh3lVKEigP/1hrNM0oE3IQHejNYkZ2kbj0
DL4WkALiHCgWJVcr/lk9PvTaY7aIAo8HFgvbUKF4PcBvnFZvu6rckJPIsWgsE1wD
+OHDaO27vyYKMaA9ImvqYneB8hcl528EDlZ7ssfTepQXPyx99SoTYc9ToT1+znVp
qQUC+d67MTLl/eHL+i6rbQfYdVaXGaVTuAIpzQn3oTqUDLrmXKe+60oTgnC0zgSW
kQ63Vo8MHi9CpzCe0JhZwvK3d8sPzrBktNisaEbdHe+0Gk5zvcEiPzE2nIyvLXjz
cf0QoQFQHzniCdFQoYjzLafT3aItBsN1v29gOrydPT6LxMZ8wO5k+8Suu1NcyI9R
RLJR61wKwSzi4JQ/1+LAqoDQHrldATKPCM74BLYiNTi8OGeqda+10COJQmLFITzM
9mn9fg/ZPg8V+z+0e2zObYcFVRtUCIs+XRWIzjhuPICdRRnwoNT+b9HyrLZ5JY1X
BH3nHon0W4Bm5jEBhOb02XLdzBMtF3vRM4mNYCzpoS/tDFRszyOuV63FXkPurVbe
hT9ZKhDZ63YV8ycwx2jqZgZAFuySi719kG3aUMDNMJYbUEi6D0QRKH9ngE6JAvwH
rw1hX65sNX9jbBOAvcDTq2780SpV3dpO1TywLip9uSWIk6DYyEVHbr8AdWwnLlfb
fdq9y8V/3+NC9aTNf/e3
=VwUL
-----END PGP SIGNATURE-----
Merge tag 'l3-fix-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent
A small L3 cache index disable fix from Srivatsa Bhat which unifies the
way the code checks for already disabled indices.
( Pulling it into v3.4 despite the v3.5 tag - the fix is small and we better
keep the same code across kernel versions for such user facing interfaces. )
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With commit a2ef5c4fd4
"ACPI: Move module parameter gts and bfs to sleep.c" the
wake_sleep_flags is required when calling acpi_enter_sleep_state.
The assembler code in wakeup_*.S did not do that. One solution
is to call it from assembler and stick the wake_sleep_flags on
the stack (for 32-bit) or in %esi (for 64-bit). hpa and rafael
both suggested however to create a wrapper function to call
acpi_enter_sleep_state and call said wrapper function
("acpi_enter_s3") from assembler.
For 32-bit, the acpi_enter_s3 ends up looking as so:
push %ebp
mov %esp,%ebp
sub $0x8,%esp
movzbl 0xc1809314,%eax [wake_sleep_flags]
movl $0x3,(%esp)
mov %eax,0x4(%esp)
call 0xc12d1fa0 <acpi_enter_sleep_state>
leave
ret
And 64-bit:
movzbl 0x9afde1(%rip),%esi [wake_sleep_flags]
push %rbp
mov $0x3,%edi
mov %rsp,%rbp
callq 0xffffffff812e9800 <acpi_enter_sleep_state>
leaveq
retq
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
[v2: Remove extra assembler operations, per hpa review]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1335150198-21899-3-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
With commit a2ef5c4fd4
"ACPI: Move module parameter gts and bfs to sleep.c" the wake_sleep_flags
is required when calling acpi_enter_sleep_state, which means
that if there are functions outside the sleep.c code they
can't get the wake_sleep_flags values.
This converts the function in to a exported value and converts
the module config operands to a function.
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Lin Ming <ming.m.lin@intel.com>
[v2: Parameters can be turned on/off dynamically]
[v3: unsigned char -> u8]
[v4: val -> kp->arg]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1335150198-21899-2-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When GHES error record is logged into mcelog kernel buffer, a validation
check for physical address is necessary, which prevents reporting an
invalid physical address.
[Since physical address is the only useful element in this error record,
we drop generating the record completely if we don't have a valid address]
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Remove open-coded exception table entries in arch/x86/kernel/test_rodata.c,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/kernel/entry_64.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
Remove open-coded exception table entries in arch/x86/kernel/entry_32.S,
and replace them with _ASM_EXTABLE() macros; this will allow us to
change the format and type of the exception table entries.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: David Daney <david.daney@cavium.com>
Link: http://lkml.kernel.org/r/CA%2B55aFyijf43qSu3N9nWHEBwaGbb7T2Oq9A=9EyR=Jtyqfq_cQ@mail.gmail.com
If we get an exception during early boot, walk the exception table to
see if we should intercept it. The main use case for this is to allow
rdmsr_safe()/wrmsr_safe() during CPU initialization.
Since the exception table is currently sorted at runtime, and fairly
late in startup, this code walks the exception table linearly. We
obviously don't need to worry about modules, however: none have been
loaded at this point.
This patch changes the early IDT setup to look a lot more like x86-64:
we now install handlers for all 32 exception vectors. The output of
the early exception handler has changed somewhat as it directly
reflects the stack frame of the exception handler, and the stack frame
has been somewhat restructured.
Finally, centralize the code that can and should be run only once.
[ v2: Use early_fixup_exception() instead of linear search ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1334794610-5546-6-git-send-email-hpa@zytor.com
If we get an exception during early boot, walk the exception table to
see if we should intercept it. The main use case for this is to allow
rdmsr_safe()/wrmsr_safe() during CPU initialization.
Since the exception table is currently sorted at runtime, and fairly
late in startup, this code walks the exception table linearly. We
obviously don't need to worry about modules, however: none have been
loaded at this point.
[ v2: Use early_fixup_exception() instead of linear search ]
Link: http://lkml.kernel.org/r/1334794610-5546-5-git-send-email-hpa@zytor.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
GET_CR2_INTO_RCX is asinine: it is only used in one place, the actual
paravirt call returns the value in %rax, not %rcx; and the one place
that wants it wants the result in %r9. We actually generate as a
result of this call:
call ...
movq %rax, %rcx
xorq %rax, %rax /* this value isn't even used... */
movq %rcx, %r9
At least make the macro do what the paravirt call does, which is put
the value into %rax.
Nevermind the fact that the macro clobbers all the volatile registers.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1334794610-5546-4-git-send-email-hpa@zytor.com
Cc: Glauber de Oliveira Costa <glommer@parallels.com>
Merge reason: development work has dependency on kvm patches merged
upstream.
Conflicts:
Documentation/feature-removal-schedule.txt
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the L3 disable slot is already in use, return -EEXIST instead of
-EINVAL. The caller, store_cache_disable(), checks this return value to
print an appropriate warning.
Also, we want to signal with -EEXIST that the current index we're
disabling has actually been already disabled on the node:
$ echo 12 > /sys/devices/system/cpu/cpu3/cache/index3/cache_disable_0
$ echo 12 > /sys/devices/system/cpu/cpu3/cache/index3/cache_disable_0
-bash: echo: write error: File exists
$ echo 12 > /sys/devices/system/cpu/cpu3/cache/index3/cache_disable_1
-bash: echo: write error: File exists
$ echo 12 > /sys/devices/system/cpu/cpu5/cache/index3/cache_disable_1
-bash: echo: write error: File exists
The old code would say
-bash: echo: write error: Invalid argument
for disable slot 1 when playing the example above with no output in
dmesg, which is clearly misleading.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120419070053.GB16645@elgon.mountain
[Boris: add testing for the other index too]
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Reading machine check bank registers is slow. There is a trend of
increasing the number of banks, and the number of cores. The main section
of do_machine_check() is a serialized section where each cpu in turn
checks every bank. Even on a little two socket SandyBridge-EP system
that multiplies out as:
2 sockets * 8 cores * 2 hyperthreads * 20 banks = 640 MSRs
We already scan the banks in parallel in mce_no_way_out() to see if there
is a fatal error anywhere in the system. If we build a cache of VALID
bits during this scan, we can avoid uselessly re-reading banks that have
no data. Note that this cache is only a hint. If the valid bit is set in a
shared bank, all cpus that share that bank will see it during the parallel
scan, but the first to find it in the sequential scan will (usually) clear
the bank.
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Current APIC code assumes MSR_IA32_APICBASE is present for all systems.
Pentium Classic P5 and friends didn't have this MSR. MSR_IA32_APICBASE
was introduced as an architectural MSR by Intel @ P6.
Code paths that can touch this MSR invalidly are when vendor == Intel &&
cpu-family == 5 and APIC bit is set in CPUID - or when you simply pass
lapic on the kernel command line, on a P5.
The below patch stops Linux incorrectly interfering with the
MSR_IA32_APICBASE for P5 class machines. Other code paths exist that
touch the MSR - however those paths are not currently reachable for a
conformant P5.
Signed-off-by: Bryan O'Donoghue <bryan.odonoghue@linux.intel.com>
Link: http://lkml.kernel.org/r/4F8EEDD3.1080404@linux.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
Starting from 7e16838d "i387: support lazy restore of FPU state"
we assume that fpu_owner_task doesn't need restore_fpu_checking()
on the context switch, its FPU state should match what we already
have in the FPU on this CPU.
However, debugger can change the tracee's FPU state, in this case
we should reset fpu.last_cpu to ensure fpu_lazy_restore() can't
return true.
Change init_fpu() to do this, it is called by user_regset->set()
methods.
Reported-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20120416204815.GB24884@redhat.com
Cc: <stable@vger.kernel.org> v3.3
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
It's only called from amd.c:srat_detect_node(). The introduced
condition for calling the fixup code is true for all AMD
multi-node processors, e.g. Magny-Cours and Interlagos. There we
have 2 NUMA nodes on one socket. Thus there are cores having
different numa-node-id but with equal phys_proc_id.
There is no point to print error messages in such a situation.
The confusing/misleading error message was introduced with
commit 64be4c1c24 ("x86: Add
x86_init platform override to fix up NUMA core numbering").
Remove the default fixup function (especially the error message)
and replace it by a NULL pointer check, move the
Numascale-specific condition for calling the fixup into the
fixup-function itself and slightly adapt the comment.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: <stable@kernel.org>
Cc: <sp@numascale.com>
Cc: <bp@amd64.org>
Cc: <daniel@numascale-asia.com>
Link: http://lkml.kernel.org/r/20120402160648.GR27684@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
microcode driver is loaded on unsupported platforms.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPh/jXAAoJEBLB8Bhh3lVKhFUP/25nGddMi/wl4D76Ef8eJJRu
lYVJGqatSCJIV+51DFaFnC9JQ2C10bMfs5it6tNNPTaTpTTss3b/opFs9M2CXJF5
GHY7XlIc7fC7v+Vur9Jgmz4QAldYVYtARDkx6uIuADiD6LvSsug384uFy6Yscdnh
s02JTa9PeGYOz0RtTPxzrsGPablDukcfXV/Is/dLNygD+mHs8HdLWNRBDEZfDRea
9nH+zqwa/SbkGPZHbN8wIGLD7V+QMLWbY5YRvOHzlU+L5Hnjfik25+XlsuKKSv+B
v6/HNV8YluAGNixIFlm9EkrzNXTMe8smhM50+8b0CjOX/n2ldi89VVheOsgoPaNa
dTwyTcrfY71I/B/nyp4pIswEt4P96c2HiZpGO6b22tGxDP7EYtXPKka1v7reBP/O
atDmK+3++JOzgaqbKjZkdiLawyzAxHgkLZAr02EsCb0IFF8s6a0HI23s9zOEem3T
pifrcYE8j8VgUIvGBUwWWzKONqiXpXPAKU/bwqphQIsnOqIaXOTaCuyDI5PjQX9l
Zri/UxXvoBEq+fuobM3W7dhFQ3AnN5h0Ri3L55Eak9DVJU8vQ3ZfbxAwXGBeVtut
EDp9LkT4wP0WTY8PfJaZUQoIZ72m0g2sO2eEkcwPuT6ljQb7Rdsr1y5F3KzP4kRF
o79lhzWZBAv3cBYvocNS
=ZgVv
-----END PGP SIGNATURE-----
Merge tag 'microcode-fix-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent
Pull from Borislav Petkov a two-patch fix from Andreas taking care of a sysfs
warning when the microcode driver is loaded on unsupported platforms.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge in latest upstream (and the latest perf development tree),
to prepare for tooling changes, and also to pick up v3.4 MM
changes that the uprobes code needs to take care of.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Enable support for seccomp filter on x86:
- syscall_get_arch()
- syscall_get_arguments()
- syscall_rollback()
- syscall_set_return_value()
- SIGSYS siginfo_t support
- secure_computing is called from a ptrace_event()-safe context
- secure_computing return value is checked (see below).
SECCOMP_RET_TRACE and SECCOMP_RET_TRAP may result in seccomp needing to
skip a system call without killing the process. This is done by
returning a non-zero (-1) value from secure_computing. This change
makes x86 respect that return value.
To ensure that minimal kernel code is exposed, a non-zero return value
results in an immediate return to user space (with an invalid syscall
number).
Signed-off-by: Will Drewry <wad@chromium.org>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
v18: rebase and tweaked change description, acked-by
v17: added reviewed by and rebased
v..: all rebases since original introduction.
Signed-off-by: James Morris <james.l.morris@oracle.com>
Exit early when there's no support for a particular CPU family. Also,
fixup the "no support for this CPU vendor" to be issued only when the
driver is attempted to be loaded on an unsupported vendor.
Cc: stable@vger.kernel.org
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20120411163849.GE4794@alberich.amd.com
[Boris: add a commit msg because Andreas is lazy]
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Pull x86 fixes from Thomas Gleixner.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Use correct byte-sized register constraint in __add()
x86: Use correct byte-sized register constraint in __xchg_op()
x86: vsyscall: Use NULL instead 0 for a pointer argument
check_and_clear_guest_paused does not need to be exported as it isn't used
by any modules, remove the export.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When a host stops or suspends a VM it will set a flag to show this. The
watchdog will use these functions to determine if a softlockup is real, or the
result of a suspended VM.
Signed-off-by: Eric B Munson <emunson@mgebm.net>
asm-generic changes Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Pull a few KVM fixes from Avi Kivity:
"A bunch of powerpc KVM fixes, a guest and a host RCU fix (unrelated),
and a small build fix."
* 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Resolve RCU vs. async page fault problem
KVM: VMX: vmx_set_cr0 expects kvm->srcu locked
KVM: PMU: Fix integer constant is too large warning in kvm_pmu_set_msr()
KVM: PPC: Book3S: PR: Fix preemption
KVM: PPC: Save/Restore CR over vcpu_run
KVM: PPC: Book3S HV: Save and restore CR in __kvmppc_vcore_entry
KVM: PPC: Book3S HV: Fix kvm_alloc_linear in case where no linears exist
KVM: PPC: Book3S: Compile fix for ppc32 in HIOR access code
Merge batch of fixes from Andrew Morton:
"The simple_open() cleanup was held back while I wanted for laggards to
merge things.
I still need to send a few checkpoint/restore patches. I've been
wobbly about merging them because I'm wobbly about the overall
prospects for success of the project. But after speaking with Pavel
at the LSF conference, it sounds like they're further toward
completion than I feared - apparently davem is at the "has stopped
complaining" stage regarding the net changes. So I need to go back
and re-review those patchs and their (lengthy) discussion."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (16 patches)
memcg swap: use mem_cgroup_uncharge_swap fix
backlight: add driver for DA9052/53 PMIC v1
C6X: use set_current_blocked() and block_sigmask()
MAINTAINERS: add entry for sparse checker
MAINTAINERS: fix REMOTEPROC F: typo
alpha: use set_current_blocked() and block_sigmask()
simple_open: automatically convert to simple_open()
scripts/coccinelle/api/simple_open.cocci: semantic patch for simple_open()
libfs: add simple_open()
hugetlbfs: remove unregister_filesystem() when initializing module
drivers/rtc/rtc-88pm860x.c: fix rtc irq enable callback
fs/xattr.c:setxattr(): improve handling of allocation failures
fs/xattr.c:listxattr(): fall back to vmalloc() if kmalloc() failed
fs/xattr.c: suppress page allocation failure warnings from sys_listxattr()
sysrq: use SEND_SIG_FORCED instead of force_sig()
proc: fix mount -t proc -o AAA
Many users of debugfs copy the implementation of default_open() when
they want to support a custom read/write function op. This leads to a
proliferation of the default_open() implementation across the entire
tree.
Now that the common implementation has been consolidated into libfs we
can replace all the users of this function with simple_open().
This replacement was done with the following semantic patch:
<smpl>
@ open @
identifier open_f != simple_open;
identifier i, f;
@@
-int open_f(struct inode *i, struct file *f)
-{
(
-if (i->i_private)
-f->private_data = i->i_private;
|
-f->private_data = i->i_private;
)
-return 0;
-}
@ has_open depends on open @
identifier fops;
identifier open.open_f;
@@
struct file_operations fops = {
...
-.open = open_f,
+.open = simple_open,
...
};
</smpl>
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"Page ready" async PF can kick vcpu out of idle state much like IRQ.
We need to tell RCU about this.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
3.4: Fix an an Smatch warning that appeared in the 3.4 merge window
3.0: Fix kgdb test suite with SMP for all archs without HW single stepping
2.6.36: Fix kgdb sw breakpoints with CONFIG_DEBUG_RODATA=y limitations on x86
2.6.26: Fix oops on kgdb test suite with CONFIG_DEBUG_RODATA
Fix kgdb test suite with SMP for all archs with HW single stepping
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPedocAAoJEIciOldedpOjn7EP/397Rh0zmRlG8oQwMEJcK3E5
gaRyBNpkGoU3ekHXHx/nzgQ/CS9opzW7nBZDu8weWLjRKMT4RyHfuJcWyu525GvQ
SnoiX2ZUzP315d8llCYwXmaCEYA7lHQi4T2bGMlDSn1J8kS235EQxllgEfhXDdEC
DxRWgHABG2UR62C62sGKbPaMMDO9TcNcrAQK27LDLTS7pKLmYqBWBdZKgWzBM/Pr
AF8vakqSgUw3Aq9qrLge+483uT7uhMoUJofxRppWtm1QgnDcTmri9LOagiazDotz
RQliRGwVxj9hEo5mLEiQtI0N1kIGCAsK0+9aUJEZRXovRBR9kvqaqHT4c5xdhznr
VKYvqqTcHBkKLIfNXFvQZnn2cXtNVNqve9CZZwdBJaFYEkaR7ZVQqE6f2xq8KAb2
RmhvzlEUyLU+89YKkH66uSa22VLSazkeH+4b8AJ4JxYDEab3BHoBCe8axcBQrTsj
7X5NOs7V3Oj+4J3bS1fbUbxq4t0dfpLLyg8e/lELWtT+Kq7nQRzA2XHRZAMTve8M
T0cTdrwtUbgY9ZMTpywNB2KlPgTvhWOyfYbH6/Kcks7ecSXlkow3edXoiUbw79iE
hP8vcMWbT2Rv3IbLkSMFZEQGAG9qL1YyGv4NDmLOoljO1c/Bi3WQIR5aI+di6asV
Z5q5s/bmGa4+OhFFITSd
=SW2N
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB regression fixes from Jason Wessel:
- Fix a Smatch warning that appeared in the 3.4 merge window
- Fix kgdb test suite with SMP for all archs without HW single stepping
- Fix kgdb sw breakpoints with CONFIG_DEBUG_RODATA=y limitations on x86
- Fix oops on kgdb test suite with CONFIG_DEBUG_RODATA
- Fix kgdb test suite with SMP for all archs with HW single stepping
* tag 'for_linus-3.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
x86,kgdb: Fix DEBUG_RODATA limitation using text_poke()
kgdb,debug_core: pass the breakpoint struct instead of address and memory
kgdbts: (2 of 2) fix single step awareness to work correctly with SMP
kgdbts: (1 of 2) fix single step awareness to work correctly with SMP
kgdbts: Fix kernel oops with CONFIG_DEBUG_RODATA
kdb: Fix smatch warning on dbg_io_ops->is_console
Pull DMA mapping branch from Marek Szyprowski:
"Short summary for the whole series:
A few limitations have been identified in the current dma-mapping
design and its implementations for various architectures. There exist
more than one function for allocating and freeing the buffers:
currently these 3 are used dma_{alloc, free}_coherent,
dma_{alloc,free}_writecombine, dma_{alloc,free}_noncoherent.
For most of the systems these calls are almost equivalent and can be
interchanged. For others, especially the truly non-coherent ones
(like ARM), the difference can be easily noticed in overall driver
performance. Sadly not all architectures provide implementations for
all of them, so the drivers might need to be adapted and cannot be
easily shared between different architectures. The provided patches
unify all these functions and hide the differences under the already
existing dma attributes concept. The thread with more references is
available here:
http://www.spinics.net/lists/linux-sh/msg09777.html
These patches are also a prerequisite for unifying DMA-mapping
implementation on ARM architecture with the common one provided by
dma_map_ops structure and extending it with IOMMU support. More
information is available in the following thread:
http://thread.gmane.org/gmane.linux.kernel.cross-arch/12819
More works on dma-mapping framework are planned, especially in the
area of buffer sharing and managing the shared mappings (together with
the recently introduced dma_buf interface: commit d15bd7ee44
"dma-buf: Introduce dma buffer sharing mechanism").
The patches in the current set introduce a new alloc/free methods
(with support for memory attributes) in dma_map_ops structure, which
will later replace dma_alloc_coherent and dma_alloc_writecombine
functions."
People finally started piping up with support for merging this, so I'm
merging it as the last of the pending stuff from the merge window.
Looks like pohmelfs is going to wait for 3.5 and more external support
for merging.
* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
common: DMA-mapping: add NON-CONSISTENT attribute
common: DMA-mapping: add WRITE_COMBINE attribute
common: dma-mapping: introduce mmap method
common: dma-mapping: remove old alloc_coherent and free_coherent methods
Hexagon: adapt for dma_map_ops changes
Unicore32: adapt for dma_map_ops changes
Microblaze: adapt for dma_map_ops changes
SH: adapt for dma_map_ops changes
Alpha: adapt for dma_map_ops changes
SPARC: adapt for dma_map_ops changes
PowerPC: adapt for dma_map_ops changes
MIPS: adapt for dma_map_ops changes
X86 & IA64: adapt for dma_map_ops changes
common: dma-mapping: introduce generic alloc() and free() methods
Pull x86 fixes from Ingo Molnar.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: Call restore_sched_clock_state() only after %gs is initialized
x86: Use -mno-avx when available
x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() facility
x86: Preserve lazy irq disable semantics in fixup_irqs()
Steven reported his P4 not booting properly, the missing format
attributes cause a NULL ptr deref. Cure this by adding the
missing format specification.
I took the format description out of the comment near
p4_config_pack*() and hope that comment is still relatively
accurate.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1332859842.16159.227.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf updates and fixes from Ingo Molnar:
"It's mostly fixes, but there's also two late items:
- preliminary GTK GUI support for perf report
- PMU raw event format descriptors in sysfs, to be parsed by tooling
The raw event format in sysfs is a new ABI. For example for the 'CPU'
PMU we have:
aldebaran:~> ll /sys/bus/event_source/devices/cpu/format/*
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/any
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/cmask
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/edge
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/event
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/inv
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/offcore_rsp
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/pc
-r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/umask
those lists of fields contain a specific format:
aldebaran:~> cat /sys/bus/event_source/devices/cpu/format/offcore_rsp
config1:0-63
So, those who wish to specify raw events can now use the following
event format:
-e cpu/cmask=1,event=2,umask=3
Most people will not want to specify any events (let alone raw
events), they'll just use whatever default event the tools use.
But for more obscure PMU events that have no cross-architecture
generic events the above syntax is more usable and a bit more
structured than specifying hex numbers."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
perf tools: Remove auto-generated bison/flex files
perf annotate: Fix off by one symbol hist size allocation and hit accounting
perf tools: Add missing ref-cycles event back to event parser
perf annotate: addr2line wants addresses in same format as objdump
perf probe: Finder fails to resolve function name to address
tracing: Fix ent_size in trace output
perf symbols: Handle NULL dso in dso__name_len
perf symbols: Do not include libgen.h
perf tools: Fix bug in raw sample parsing
perf tools: Fix display of first level of callchains
perf tools: Switch module.h into export.h
perf: Move mmap page data_head offset assertion out of header
perf: Fix mmap_page capabilities and docs
perf diff: Fix to work with new hists design
perf tools: Fix modifier to be applied on correct events
perf tools: Fix various casting issues for 32 bits
perf tools: Simplify event_read_id exit path
tracing: Fix ftrace stack trace entries
tracing: Move the tracing_on/off() declarations into CONFIG_TRACING
perf report: Add a simple GTK2-based 'perf report' browser
...
Pull ACPI & Power Management changes from Len Brown:
- ACPI 5.0 after-ripples, ACPICA/Linux divergence cleanup
- cpuidle evolving, more ARM use
- thermal sub-system evolving, ditto
- assorted other PM bits
Fix up conflicts in various cpuidle implementations due to ARM cpuidle
cleanups (ARM at91 self-refresh and cpu idle code rewritten into
"standby" in asm conflicting with the consolidation of cpuidle time
keeping), trivial SH include file context conflict and RCU tracing fixes
in generic code.
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (77 commits)
ACPI throttling: fix endian bug in acpi_read_throttling_status()
Disable MCP limit exceeded messages from Intel IPS driver
ACPI video: Don't start video device until its associated input device has been allocated
ACPI video: Harden video bus adding.
ACPI: Add support for exposing BGRT data
ACPI: export acpi_kobj
ACPI: Fix logic for removing mappings in 'acpi_unmap'
CPER failed to handle generic error records with multiple sections
ACPI: Clean redundant codes in scan.c
ACPI: Fix unprotected smp_processor_id() in acpi_processor_cst_has_changed()
ACPI: consistently use should_use_kmap()
PNPACPI: Fix device ref leaking in acpi_pnp_match
ACPI: Fix use-after-free in acpi_map_lsapic
ACPI: processor_driver: add missing kfree
ACPI, APEI: Fix incorrect APEI register bit width check and usage
Update documentation for parameter *notrigger* in einj.txt
ACPI, APEI, EINJ, new parameter to control trigger action
ACPI, APEI, EINJ, limit the range of einj_param
ACPI, APEI, Fix ERST header length check
cpuidle: power_usage should be declared signed integer
...
Conflicts:
drivers/acpi/acpica/hwsleep.c
Text conflict between:
2feec47d4c
(ACPICA: ACPI 5: Support for new FADT SleepStatus, SleepControl registers)
which removed #include "actables.h"
and
09f98a825a
(x86, acpi, tboot: Have a ACPI os prepare sleep instead of calling tboot_sleep.)
which removed #include <linux/tboot.h>
The resolution is to remove them both.
Signed-off-by: Len Brown <len.brown@intel.com>
When processor is being hot-added to the system, acpi_map_lsapic invokes
ACPI _MAT method to find APIC ID and flags, verifies that returned structure
is indeed ACPI's local APIC structure, and that flags contain MADT_ENABLED
bit. Then saves APIC ID, frees structure - and accesses structure when
computing arguments for acpi_register_lapic call. Which sometime leads
to acpi_register_lapic call being made with second argument zero, failing
to bring processor online with error 'Unable to map lapic to logical cpu
number'.
As lapic->lapic_flags & ACPI_MADT_ENABLED was already confirmed to be non-zero
few lines above, we can just pass unconditional ACPI_MADT_ENABLED to the
acpi_register_lapic.
Signed-off-by: Petr Vandrovec <petr@vmware.com>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Currently when a CPU is off-lined it enters either MWAIT-based idle or,
if MWAIT is not desired or supported, HLT-based idle (which places the
processor in C1 state). This patch allows processors without MWAIT
support to stay in states deeper than C1.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
The X86_32-only disable_hlt/enable_hlt mechanism was used by the
32-bit floppy driver. Its effect was to replace the use of the
HLT instruction inside default_idle() with cpu_relax() - essentially
it turned off the use of HLT.
This workaround was commented in the code as:
"disable hlt during certain critical i/o operations"
"This halt magic was a workaround for ancient floppy DMA
wreckage. It should be safe to remove."
H. Peter Anvin additionally adds:
"To the best of my knowledge, no-hlt only existed because of
flaky power distributions on 386/486 systems which were sold to
run DOS. Since DOS did no power management of any kind,
including HLT, the power draw was fairly uniform; when exposed
to the much hhigher noise levels you got when Linux used HLT
caused some of these systems to fail.
They were by far in the minority even back then."
Alan Cox further says:
"Also for the Cyrix 5510 which tended to go castors up if a HLT
occurred during a DMA cycle and on a few other boxes HLT during
DMA tended to go astray.
Do we care ? I doubt it. The 5510 was pretty obscure, the 5520
fixed it, the 5530 is probably the oldest still in any kind of
use."
So, let's finally drop this.
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephen Hemminger <shemminger@vyatta.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-3rhk9bzf0x9rljkv488tloib@git.kernel.org
[ If anyone cares then alternative instruction patching could be
used to replace HLT with a one-byte NOP instruction. Much simpler. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 cleanups from Peter Anvin:
"The biggest textual change is the cleanup to use symbolic constants
for x86 trap values.
The only *functional* change and the reason for the x86/x32 dependency
is the move of is_ia32_task() into <asm/thread_info.h> so that it can
be used in other code that needs to understand if a system call comes
from the compat entry point (and therefore uses i386 system call
numbers) or not. One intended user for that is the BPF system call
filter. Moving it out of <asm/compat.h> means we can define it
unconditionally, returning always true on i386."
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move is_ia32_task to asm/thread_info.h from asm/compat.h
x86: Rename trap_no to trap_nr in thread_struct
x86: Use enum instead of literals for trap values
Pull x32 support for x86-64 from Ingo Molnar:
"This tree introduces the X32 binary format and execution mode for x86:
32-bit data space binaries using 64-bit instructions and 64-bit kernel
syscalls.
This allows applications whose working set fits into a 32 bits address
space to make use of 64-bit instructions while using a 32-bit address
space with shorter pointers, more compressed data structures, etc."
Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}
* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
x32: Fix alignment fail in struct compat_siginfo
x32: Fix stupid ia32/x32 inversion in the siginfo format
x32: Add ptrace for x32
x32: Switch to a 64-bit clock_t
x32: Provide separate is_ia32_task() and is_x32_task() predicates
x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
x86/x32: Fix the binutils auto-detect
x32: Warn and disable rather than error if binutils too old
x32: Only clear TIF_X32 flag once
x32: Make sure TS_COMPAT is cleared for x32 tasks
fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
fs: Fix close_on_exec pointer in alloc_fdtable
x32: Drop non-__vdso weak symbols from the x32 VDSO
x32: Fix coding style violations in the x32 VDSO code
x32: Add x32 VDSO support
x32: Allow x32 to be configured
x32: If configured, add x32 system calls to system call tables
x32: Handle process creation
x32: Signal-related system calls
x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
...
There has long been a limitation using software breakpoints with a
kernel compiled with CONFIG_DEBUG_RODATA going back to 2.6.26. For
this particular patch, it will apply cleanly and has been tested all
the way back to 2.6.36.
The kprobes code uses the text_poke() function which accommodates
writing a breakpoint into a read-only page. The x86 kgdb code can
solve the problem similarly by overriding the default breakpoint
set/remove routines and using text_poke() directly.
The x86 kgdb code will first attempt to use the traditional
probe_kernel_write(), and next try using a the text_poke() function.
The break point install method is tracked such that the correct break
point removal routine will get called later on.
Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org # >= 2.6.36
Inspried-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpusets: Remove an unused variable
sched/rt: Improve pick_next_highest_task_rt()
sched: Fix select_fallback_rq() vs cpu_active/cpu_online
sched/x86/smp: Do not enable IRQs over calibrate_delay()
sched: Fix compiler warning about declared inline after use
MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS
Pull x86 updates from Ingo Molnar.
This touches some non-x86 files due to the sanitized INLINE_SPIN_UNLOCK
config usage.
Fixed up trivial conflicts due to just header include changes (removing
headers due to cpu_idle() merge clashing with the <asm/system.h> split).
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic/amd: Be more verbose about LVT offset assignments
x86, tls: Off by one limit check
x86/ioapic: Add io_apic_ops driver layer to allow interception
x86/olpc: Add debugfs interface for EC commands
x86: Merge the x86_32 and x86_64 cpu_idle() functions
x86/kconfig: Remove CONFIG_TR=y from the defconfigs
x86: Stop recursive fault in print_context_stack after stack overflow
x86/io_apic: Move and reenable irq only when CONFIG_GENERIC_PENDING_IRQ=y
x86/apic: Add separate apic_id_valid() functions for selected apic drivers
locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage
x86/kconfig: Update defconfigs
x86: Fix excessive MSR print out when show_msr is not specified
Pull timer core updates from Thomas Gleixner.
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ia64: vsyscall: Add missing paranthesis
alarmtimer: Don't call rtc_timer_init() when CONFIG_RTC_CLASS=n
x86: vdso: Put declaration before code
x86-64: Inline vdso clock_gettime helpers
x86-64: Simplify and optimize vdso clock_gettime monotonic variants
kernel-time: fix s/then/than/ spelling errors
time: remove no_sync_cmos_clock
time: Avoid scary backtraces when warning of > 11% adj
alarmtimer: Make sure we initialize the rtctimer
ntp: Fix leap-second hrtimer livelock
x86, tsc: Skip refined tsc calibration on systems with reliable TSC
rtc: Provide flag for rtc devices that don't support UIE
ia64: vsyscall: Use seqcount instead of seqlock
x86: vdso: Use seqcount instead of seqlock
x86: vdso: Remove bogus locking in update_vsyscall_tz()
time: Remove bogus comments
time: Fix change_clocksource locking
time: x86: Fix race switching from vsyscall to non-vsyscall clock
The default irq_disable() sematics are to mark the interrupt disabled,
but keep it unmasked. If the interrupt is delivered while marked
disabled, the low level interrupt handler masks it and marks it
pending. This is important for detecting wakeup interrupts during
suspend and for edge type interrupts to avoid losing interrupts.
fixup_irqs() moves the interrupts away from an offlined cpu. For
certain interrupt types it needs to mask the interrupt line before
changing the affinity. After affinity has changed the interrupt line
is unmasked again, but only if it is not marked disabled.
This breaks the lazy irq disable semantics and causes problems in
suspend as the interrupt can be lost or wakeup functionality is
broken.
Check irqd_irq_masked() instead of irqd_irq_disabled() because
irqd_irq_masked() is only set, when the core code actually masked the
interrupt line. If it's not set, we unmask the interrupt and let the
lazy irq disable logic deal with an eventually incoming interrupt.
[ tglx: Massaged changelog and added a comment ]
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A05DFB3@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge third batch of patches from Andrew Morton:
- Some MM stragglers
- core SMP library cleanups (on_each_cpu_mask)
- Some IPI optimisations
- kexec
- kdump
- IPMI
- the radix-tree iterator work
- various other misc bits.
"That'll do for -rc1. I still have ~10 patches for 3.4, will send
those along when they've baked a little more."
* emailed from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
backlight: fix typo in tosa_lcd.c
crc32: add help text for the algorithm select option
mm: move hugepage test examples to tools/testing/selftests/vm
mm: move slabinfo.c to tools/vm
mm: move page-types.c from Documentation to tools/vm
selftests/Makefile: make `run_tests' depend on `all'
selftests: launch individual selftests from the main Makefile
radix-tree: use iterators in find_get_pages* functions
radix-tree: rewrite gang lookup using iterator
radix-tree: introduce bit-optimized iterator
fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
nbd: rename the nbd_device variable from lo to nbd
pidns: add reboot_pid_ns() to handle the reboot syscall
sysctl: use bitmap library functions
ipmi: use locks on watchdog timeout set on reboot
ipmi: simplify locking
ipmi: fix message handling during panics
ipmi: use a tasklet for handling received messages
ipmi: increase KCS timeouts
ipmi: decrease the IPMI message transaction time in interrupt mode
...
crashkernel reservation need know the total memory size. Current
get_total_mem simply use max_pfn - min_low_pfn. It is wrong because it
will including memory holes in the middle.
Especially for kvm guest with memory > 0xe0000000, there's below in qemu
code: qemu split memory as below:
if (ram_size >= 0xe0000000 ) {
above_4g_mem_size = ram_size - 0xe0000000;
below_4g_mem_size = 0xe0000000;
} else {
below_4g_mem_size = ram_size;
}
So for 4G mem guest, seabios will insert a 512M usable region beyond of
4G. Thus in above case max_pfn - min_low_pfn will be more than original
memsize.
Fixing this issue by using memblock_phys_mem_size() to get the total
memsize.
Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
Pull kvm updates from Avi Kivity:
"Changes include timekeeping improvements, support for assigning host
PCI devices that share interrupt lines, s390 user-controlled guests, a
large ppc update, and random fixes."
This is with the sign-off's fixed, hopefully next merge window we won't
have rebased commits.
* 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: Convert intx_mask_lock to spin lock
KVM: x86: fix kvm_write_tsc() TSC matching thinko
x86: kvmclock: abstract save/restore sched_clock_state
KVM: nVMX: Fix erroneous exception bitmap check
KVM: Ignore the writes to MSR_K7_HWCR(3)
KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
KVM: PMU: add proper support for fixed counter 2
KVM: PMU: Fix raw event check
KVM: PMU: warn when pin control is set in eventsel msr
KVM: VMX: Fix delayed load of shared MSRs
KVM: use correct tlbs dirty type in cmpxchg
KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
KVM: x86 emulator: Allow PM/VM86 switch during task switch
KVM: SVM: Fix CPL updates
KVM: x86 emulator: VM86 segments must have DPL 3
KVM: x86 emulator: Fix task switch privilege checks
arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice
KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
...
Add information about LVT offset assignments to better debug firmware
bugs related to this. See following examples.
# dmesg | grep -i 'offset\|ibs'
LVT offset 0 assigned for vector 0xf9
[Firmware Bug]: cpu 0, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
[Firmware Bug]: cpu 0, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
Failed to setup IBS, -22
In this case the BIOS assigns both offsets for MCE (0xf9) and IBS
(0x400) vectors to offset 0, which is why the second APIC setup (IBS)
failed.
With correct setup you get:
# dmesg | grep -i 'offset\|ibs'
LVT offset 0 assigned for vector 0xf9
LVT offset 1 assigned for vector 0x400
IBS: LVT offset 1 assigned
perf: AMD IBS detected (0x00000007)
oprofile: AMD IBS detected (0x00000007)
Note: The vector includes also the message type to handle also NMIs
(0x400). In the firmware bug message the format is the same as of the
APIC500 register and includes the mask bit (bit 16) in addition.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These are used as offsets into an array of GDT_ENTRY_TLS_ENTRIES members
so GDT_ENTRY_TLS_ENTRIES is one past the end of the array.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: http://lkml.kernel.org/r/20120324075250.GA28258@elgon.mountain
Cc: <stable@vger.kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Adapt core x86 and IA64 architecture code for dma_map_ops changes: replace
alloc/free_coherent with generic alloc/free methods.
Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
[removed swiotlb related changes and replaced it with wrappers,
merged with IA64 patch to avoid inter-patch dependences in intel-iommu code]
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Xen dom0 needs to paravirtualize IO operations to the IO APIC,
so add a io_apic_ops for it to intercept. Do this as ops
structure because there's at least some chance that another
paravirtualized environment may want to intercept these.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: jwboyer@redhat.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1332385090-18056-2-git-send-email-konrad.wilk@oracle.com
[ Made all the affected code easier on the eyes ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We should not ever enable IRQs until we're fully set up. This opens up
a window where interrupts can hit the cpu and interrupts can do
wakeups, wakeups need state that isn't set-up yet, in particular this
cpu isn't elegible to run tasks, so if any cpu-affine task that got
created in CPU_UP_PREPARE manages to get a wakeup, its affinity mask
will get broken and we'll run into lots of 'interesting' problems.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-yaezmlbriluh166tfkgni22m@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both functions are mostly identical.
The differences are:
- x86_32's cpu_idle() makes use of check_pgt_cache(), which is a
nop on both x86_32 and x86_64.
- x86_64's cpu_idle() uses enter/__exit_idle/(), on x86_32 these
function are a nop.
- In contrast to x86_32, x86_64 calls rcu_idle_enter/exit() in
the innermost loop because idle notifications need RCU.
Calling these function on x86_32 also in the innermost loop
does not hurt.
So we can merge both functions.
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: paulmck@linux.vnet.ibm.com
Cc: josh@joshtriplett.org
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/1332709204-22496-1-git-send-email-richard@nod.at
Signed-off-by: Ingo Molnar <mingo@kernel.org>
"[RFC - PATCH 0/7] consolidation of BUG support code."
https://lkml.org/lkml/2012/1/26/525
--
The changes shown here are to unify linux's BUG support under
the one <linux/bug.h> file. Due to historical reasons, we have
some BUG code in bug.h and some in kernel.h -- i.e. the support for
BUILD_BUG in linux/kernel.h predates the addition of linux/bug.h,
but old code in kernel.h wasn't moved to bug.h at that time. As
a band-aid, kernel.h was including <asm/bug.h> to pseudo link them.
This has caused confusion[1] and general yuck/WTF[2] reactions.
Here is an example that violates the principle of least surprise:
CC lib/string.o
lib/string.c: In function 'strlcat':
lib/string.c:225:2: error: implicit declaration of function 'BUILD_BUG_ON'
make[2]: *** [lib/string.o] Error 1
$
$ grep linux/bug.h lib/string.c
#include <linux/bug.h>
$
We've included <linux/bug.h> for the BUG infrastructure and yet we
still get a compile fail! [We've not kernel.h for BUILD_BUG_ON.]
Ugh - very confusing for someone who is new to kernel development.
With the above in mind, the goals of this changeset are:
1) find and fix any include/*.h files that were relying on the
implicit presence of BUG code.
2) find and fix any C files that were consuming kernel.h and
hence relying on implicitly getting some/all BUG code.
3) Move the BUG related code living in kernel.h to <linux/bug.h>
4) remove the asm/bug.h from kernel.h to finally break the chain.
During development, the order was more like 3-4, build-test, 1-2.
But to ensure that git history for bisect doesn't get needless
build failures introduced, the commits have been reorderd to fix
the problem areas in advance.
[1] https://lkml.org/lkml/2012/1/3/90
[2] https://lkml.org/lkml/2012/1/17/414
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPbNwpAAoJEOvOhAQsB9HWrqYP/A0t9VB0nK6e42F0OR2P14MZ
GJFtf1B++wwioIrx+KSWSRfSur1C5FKhDbxLR3I/pvkAYl4+T4JvRdMG6xJwxyip
CC1kVQQNDjWVVqzjz2x6rYkOffx6dUlw/ERyIyk+OzP+1HzRIsIrugMqbzGLlX0X
y0v2Tbd0G6xg1DV8lcRdp95eIzcGuUvdb2iY2LGadWZczEOeSXx64Jz3QCFxg3aL
LFU4oovsg8Nb7MRJmqDvHK/oQf5vaTm9WSrS0pvVte0msSQRn8LStYdWC0G9BPCS
GwL86h/eLXlUXQlC5GpgWg1QQt5i2QpjBFcVBIG0IT5SgEPMx+gXyiqZva2KwbHu
LKicjKtfnzPitQnyEV/N6JyV1fb1U6/MsB7ebU5nCCzt9Gr7MYbjZ44peNeprAtu
HMvJ/BNnRr4Ha6nPQNu952AdASPKkxmeXFUwBL1zUbLkOX/bK/vy1ujlcdkFxCD7
fP3t7hghYa737IHk0ehUOhrE4H67hvxTSCKioLUAy/YeN1IcfH/iOQiCBQVLWmoS
AqYV6ou9cqgdYoyila2UeAqegb+8xyubPIHt+lebcaKxs5aGsTg+r3vq5juMDAPs
iwSVYUDcIw9dHer1lJfo7QCy3QUTRDTxh+LB9VlHXQICgeCK02sLBOi9hbEr4/H8
Ko9g8J3BMxcMkXLHT9ud
=PYQT
-----END PGP SIGNATURE-----
Merge tag 'bug-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull <linux/bug.h> cleanup from Paul Gortmaker:
"The changes shown here are to unify linux's BUG support under the one
<linux/bug.h> file. Due to historical reasons, we have some BUG code
in bug.h and some in kernel.h -- i.e. the support for BUILD_BUG in
linux/kernel.h predates the addition of linux/bug.h, but old code in
kernel.h wasn't moved to bug.h at that time. As a band-aid, kernel.h
was including <asm/bug.h> to pseudo link them.
This has caused confusion[1] and general yuck/WTF[2] reactions. Here
is an example that violates the principle of least surprise:
CC lib/string.o
lib/string.c: In function 'strlcat':
lib/string.c:225:2: error: implicit declaration of function 'BUILD_BUG_ON'
make[2]: *** [lib/string.o] Error 1
$
$ grep linux/bug.h lib/string.c
#include <linux/bug.h>
$
We've included <linux/bug.h> for the BUG infrastructure and yet we
still get a compile fail! [We've not kernel.h for BUILD_BUG_ON.] Ugh -
very confusing for someone who is new to kernel development.
With the above in mind, the goals of this changeset are:
1) find and fix any include/*.h files that were relying on the
implicit presence of BUG code.
2) find and fix any C files that were consuming kernel.h and hence
relying on implicitly getting some/all BUG code.
3) Move the BUG related code living in kernel.h to <linux/bug.h>
4) remove the asm/bug.h from kernel.h to finally break the chain.
During development, the order was more like 3-4, build-test, 1-2. But
to ensure that git history for bisect doesn't get needless build
failures introduced, the commits have been reorderd to fix the problem
areas in advance.
[1] https://lkml.org/lkml/2012/1/3/90
[2] https://lkml.org/lkml/2012/1/17/414"
Fix up conflicts (new radeon file, reiserfs header cleanups) as per Paul
and linux-next.
* tag 'bug-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
kernel.h: doesn't explicitly use bug.h, so don't include it.
bug: consolidate BUILD_BUG_ON with other bug code
BUG: headers with BUG/BUG_ON etc. need linux/bug.h
bug.h: add include of it to various implicit C users
lib: fix implicit users of kernel.h for TAINT_WARN
spinlock: macroize assert_spin_locked to avoid bug.h dependency
x86: relocate get/set debugreg fcns to include/asm/debugreg.
Sigh, warnings are there for a reason.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: John Stultz <john.stultz@linaro.org>
After printing out the first line of a stack backtrace,
print_context_stack() calls print_ftrace_graph_addr() to check
if it's making a graph of function calls, usually not the case.
But unfortunate ordering of assignments causes this to oops if
an earlier stack overflow corrupted threadinfo->task. Reorder
to avoid that irritation.
( The fact that there was a stack overflow may often be more
interesting than the stack that can now be shown; but
integrating that information with this stacktrace is awkward,
so leave it to overflow reporting. )
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120323225648.15DD5A033B@akpm.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull additional x86 fixes from Peter Anvin:
- address a long-standing bug related to when a kernel-spawned process
gets a signal on an i386 kernel compiled without CONFIG_VM86.
- fix the newly introduced build warning in arch/x86/boot.
- fix a typo in the i386 system call table which affects building some
libcs.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-32: Fix endless loop when processing signals for kernel tasks
x86, boot: Correct CFLAGS for hostprogs
x86-32: Fix typo for mq_getsetattr in syscall table
Merge second batch of patches from Andrew Morton:
- various misc things
- core kernel changes to prctl, exit, exec, init, etc.
- kernel/watchdog.c updates
- get_maintainer
- MAINTAINERS
- the backlight driver queue
- core bitops code cleanups
- the led driver queue
- some core prio_tree work
- checkpatch udpates
- largeish crc32 update
- a new poll() feature for the v4l guys
- the rtc driver queue
- fatfs
- ptrace
- signals
- kmod/usermodehelper updates
- coredump
- procfs updates
* emailed from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
seq_file: add seq_set_overflow(), seq_overflow()
proc-ns: use d_set_d_op() API to set dentry ops in proc_ns_instantiate().
procfs: speed up /proc/pid/stat, statm
procfs: add num_to_str() to speed up /proc/stat
proc: speed up /proc/stat handling
fs/proc/kcore.c: make get_sparsemem_vmemmap_info() static
coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP
coredump: remove VM_ALWAYSDUMP flag
kmod: make __request_module() killable
kmod: introduce call_modprobe() helper
usermodehelper: ____call_usermodehelper() doesn't need do_exit()
usermodehelper: kill umh_wait, renumber UMH_* constants
usermodehelper: implement UMH_KILLABLE
usermodehelper: introduce umh_complete(sub_info)
usermodehelper: use UMH_WAIT_PROC consistently
signal: zap_pid_ns_processes: s/SEND_SIG_NOINFO/SEND_SIG_FORCED/
signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig()
signal: cosmetic, s/from_ancestor_ns/force/ in prepare_signal() paths
signal: give SEND_SIG_FORCED more power to beat SIGNAL_UNKILLABLE
Hexagon: use set_current_blocked() and block_sigmask()
...
Use for_each_clear_bit() to iterate over all the cleared bit in a
memory region.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This renames for_each_set_bit_cont() to for_each_set_bit_from() because
it is analogous to list_for_each_entry_from() in list.h rather than
list_for_each_entry_continue().
This doesn't remove for_each_set_bit_cont() for now.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We used to store the wall-to-monotonic offset and the realtime base.
It's faster to precompute the monotonic base.
This is about a 3% speedup on Sandy Bridge for CLOCK_MONOTONIC.
It's much more impressive for CLOCK_MONOTONIC_COARSE.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Pull PCI changes (including maintainer change) from Jesse Barnes:
"This pull has some good cleanups from Bjorn and Yinghai, as well as
some more code from Yinghai to better handle resource re-allocation
when enabled.
There's also a new initcall_debug feature from Arjan which will print
out quirk timing information to help identify slow quirks for fixing
or refinement (Yinghai sent in a few patches to do just that once the
new debug code landed).
Beyond that, I'm handing off PCI maintainership to Bjorn Helgaas.
He's been a core PCI and Linux contributor for some time now, and has
kindly volunteered to take over. I just don't feel I have the time
for PCI review and work that it deserves lately (I've taken on some
other projects), and haven't been as responsive lately as I'd like, so
I approached Bjorn asking if he'd like to manage things. He's going
to give it a try, and I'm confident he'll do at least as well as I
have in keeping the tree managed, patches flowing, and keeping things
stable."
Fix up some fairly trivial conflicts due to other cleanups (mips device
resource fixup cleanups clashing with list handling cleanup, ppc iseries
removal clashing with pci_probe_only cleanup etc)
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: (112 commits)
PCI: Bjorn gets PCI hotplug too
PCI: hand PCI maintenance over to Bjorn Helgaas
unicore32/PCI: move <asm-generic/pci-bridge.h> include to asm/pci.h
sparc/PCI: convert devtree and arch-probed bus addresses to resource
powerpc/PCI: allow reallocation on PA Semi
powerpc/PCI: convert devtree bus addresses to resource
powerpc/PCI: compute I/O space bus-to-resource offset consistently
arm/PCI: don't export pci_flags
PCI: fix bridge I/O window bus-to-resource conversion
x86/PCI: add spinlock held check to 'pcibios_fwaddrmap_lookup()'
PCI / PCIe: Introduce command line option to disable ARI
PCI: make acpihp use __pci_remove_bus_device instead
PCI: export __pci_remove_bus_device
PCI: Rename pci_remove_behind_bridge to pci_stop_and_remove_behind_bridge
PCI: Rename pci_remove_bus_device to pci_stop_and_remove_bus_device
PCI: print out PCI device info along with duration
PCI: Move "pci reassigndev resource alignment" out of quirks.c
PCI: Use class for quirk for usb host controller fixup
PCI: Use class for quirk for ti816x class fixup
PCI: Use class for quirk for intel e100 interrupt fixup
...
- Fix KDB keyboard repeat scan codes and leaked keyboard events
- Fix kernel crash with kdb_printf() for users who compile new kdb_printf()'s
in early code
- Return all segment registers to gdb on x86_64
Features:
- KDB/KGDB hook the reboot notifier and end user can control if it stops,
detaches or does nothing (updated docs as well)
- Notify users who use CONFIG_DEBUG_RODATA to use hw breakpoints
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPbJ17AAoJEIciOldedpOjQeIP/AkUxQFJ7O4aLrLYHl62EHnh
spkgkd+nBzIzcKyV73alkrVBaR2WE2822aAPQmAPBP8/X283DZJJjqgDCUNVI1Mf
CIZ7g8AQHRnS+bAZmof5Jss4malZn4byLvG/cfpOivrsye+4A8MdrAKKM3pYWNVy
4xABkcEknVEEamdNEhHrcPd+xehretfw7+9mmU8hfjqHb/5cXFB7JwDcf4tF7ozT
MDyN4xKtOn1W/ftQl0t6izksCUuPyqKzIfUyAy0j6AwTgsEavXu56S52T1UoB2ZI
JcwLn/ZpN4eGCWVodY04R3jzaMtKFb6ImY40wsb5wl3CU3Ecy5syMU6z4fg3cvjH
/KE6xWF61j4yiE5lzjeJVtKyxwalthzrr56qU2uEwrsEVmo3SOnW9sm0cwouqa7j
/TAMlhZuGgbZGesFwdaUKI5OLGoki+pRQ0Gaq3TsbZwpPC5Bimkq0bIvruruKJCP
QWVkEvb5TZgxCFS3AvniePOm7Hc2wD9zXB3OfN3o91pCom0ryDBIthcLlwhVeNCo
Jd67pnJJNVULPF/1qVicZihKHxvG3DUb4E9pUcbgJplBke+isi+3eHOvnQrYFjIG
iCamE9qvVbsQm/OFV8MOJ5mfPs9R+nb/jNzTO8JDmBc8AL7nRDV3AFGjW68x/KWT
ERcqNEGJ4QuVAxfejq76
=SXu9
-----END PGP SIGNATURE-----
Merge tag 'for_linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb
Pull KGDB/KDB updates from Jason Wessel:
"Fixes:
- Fix KDB keyboard repeat scan codes and leaked keyboard events
- Fix kernel crash with kdb_printf() for users who compile new
kdb_printf()'s in early code
- Return all segment registers to gdb on x86_64
Features:
- KDB/KGDB hook the reboot notifier and end user can control if it
stops, detaches or does nothing (updated docs as well)
- Notify users who use CONFIG_DEBUG_RODATA to use hw breakpoints"
* tag 'for_linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb:
kdb: Add message about CONFIG_DEBUG_RODATA on failure to install breakpoint
kdb: Avoid using dbg_io_ops until it is initialized
kgdb,debug_core: add the ability to control the reboot notifier
KDB: Fix usability issues relating to the 'enter' key.
kgdb,debug-core,gdbstub: Hook the reboot notifier for debugger detach
kgdb: Respect that flush op is optional
kgdb: x86: Return all segment registers also in 64-bit mode
This patch removes dead code from certain .config variations.
When CONFIG_GENERIC_PENDING_IRQ=n irq move and reenable code is
never get executed, nor do_unmask_irq variable updates its init
value. Move the code under CONFIG_GENERIC_PENDING_IRQ macro.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120320141935.GA24806@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As suggested by Suresh Siddha and Yinghai Lu:
For x2apic pre-enabled systems, apic driver is set already early
through early_acpi_boot_init()/early_acpi_process_madt()/
acpi_parse_madt()/default_acpi_madt_oem_check() path so that
apic_id_valid() checking will be sufficient during MADT and SRAT
parsing.
For non-x2apic pre-enabled systems, all apic ids should be less
than 255.
This allows us to substitute the checks in
arch/x86/kernel/acpi/boot.c::acpi_parse_x2apic() and
arch/x86/mm/srat.c::acpi_numa_x2apic_affinity_init() with
apic->apic_id_valid().
In addition we can avoid feigning the x2apic cpu feature in the
NumaChip apic code.
The following apic drivers have separate apic_id_valid()
functions which will accept x2apic type IDs :
x2apic_phys
x2apic_cluster
x2apic_uv_x
apic_numachip
Signed-off-by: Steffen Persvold <sp@numascale.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/1331925935-13372-1-git-send-email-sp@numascale.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dave found:
| During bootup, I now have 162 messages like this..
| [ 0.227346] MSR0000001b: 00000000fee00900
| [ 0.227465] MSR00000021: 0000000000000001
| [ 0.227584] MSR0000002a: 00000000c1c81400
|
| commit 21c3fcf3e3 looks suspect.
| It claims that it will only print these out if show_msr= is
| passed, but that doesn't seem to be the case.
Fix it by changing to the version that checks the index.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1332477103-4595-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Complete the syscall-less self-profiling feature and address
all complaints, namely:
- capabilities, so we can detect what is actually available at runtime
Add a capabilities field to perf_event_mmap_page to indicate
what is actually available for use.
- on x86: RDPMC weirdness due to being 40/48 bits and not sign-extending
properly.
- ABI documentation as to how all this stuff works.
Also improve the documentation for the new features.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1332433596.2487.33.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
. Short term fix for 'diff' tool breakage related to perf.data files
with multiple events. From Jiri Olsa
. Cleanup for event id tracepoint reading routine, from Borislav Petkov
. 32-bit compilation fixes from Jiri Olsa
. Event parsing modifier assignment fixes from Jiri Olsa
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJPa24hAAoJENZQFvNTUqpARrEQAIXfsXG2X45twEFHJ0oOmvZI
7F28pMFM0OBJdSq/MKyO86+j5wtbtKtjCcVZfVm1ukkeD7KnKQv+iyBw1rEuE/dx
IJIIS+zkCEAje7s64diNax/rpvZXp+k92pKKo/JGjalsVcw63VF5n0Ej3JA+n8Qp
anKyJd2SJFogoljTzRnFnZ+zsbn5CWVwb3NVRhH8t+jfKKOsmw8Nq5ds5ysuRf8x
+Xpm3YAsyKzAJMd05Uo5no+Ne4A5TukySiarZl3wiQ0TrwM9mVxQVAU5iVkAUZkb
Bp1dgBOgG1RNudcRSgbq3NuO7TGFby6DqODgXvo1TzpLyixKbWUz3JI0mEya5xCL
eBxJ2UTQpruBg+XJ3C0ys74TkMcQUeyxhqbwPu1WhMbP7cIJK1W1hJ728qWRG94a
uxFMFlEfc0d9kxlsXVvxUumiyhEw5gPtwdbQf7fo2xpMPQmHCC3yS7DdMO6FBfUW
oqPC7gb3aSTmHnKuYvwbT9EcOL27YHU+Y/4sRJ/QfohArgc4h60dUQrE9fWz7dXH
/eCXDvMnBs2+z52dNLq/mkoQMP4iFzWqboDNn5GM4jI2LmXd/hxBGL6lb2Vzm8CI
QYi5xC4E6VStSkJG2DTtvaYTclWOjo+ZrsGz7sDXQV3MgZ1wiZIWcGiPjeFBWU25
mHWkZT3of/MAbewmeklc
=V7d9
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Cleanups and fixes for perf/core:
. Short term fix for 'diff' tool breakage related to perf.data files
with multiple events. From Jiri Olsa
. Cleanup for event id tracepoint reading routine, from Borislav Petkov
. 32-bit compilation fixes from Jiri Olsa
. Event parsing modifier assignment fixes from Jiri Olsa
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The problem occurs on !CONFIG_VM86 kernels [1] when a kernel-mode task
returns from a system call with a pending signal.
A real-life scenario is a child of 'khelper' returning from a failed
kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ].
kernel_execve() fails due to a pending SIGKILL, which is the result of
"kill -9 -1" (at least, busybox's init does it upon reboot).
The loop is as follows:
* syscall_exit_work:
- work_pending: // start_of_the_loop
- work_notify_sig:
- do_notify_resume()
- do_signal()
- if (!user_mode(regs)) return;
- resume_userspace // TIF_SIGPENDING is still set
- work_pending // so we call work_pending => goto
// start_of_the_loop
More information can be found in another LKML thread:
http://www.serverphorums.com/read.php?12,457826
[1] the problem was also seen on MIPS.
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Link: http://lkml.kernel.org/r/1332448765.2299.68.camel@dimm
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Even if the content is always 0, gdb expects us to return also ds,
es, fs, and gs while in x86-64 mode. Do this to avoid ugly errors on
"info registers".
[jason.wessel@windriver.com: adjust NUMREGBYTES for two new regs]
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Pull x86 "urgent" leftovers from Ingo Molnar:
"Pending x86/urgent bits that were not high prio enough to warrant
-rc-less v3.3-final inclusion."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Fix pointer math issue in handle_ramdisks()
x86/ioapic: Add register level checks to detect bogus io-apic entries
x86, mce: Fix rcu splat in drain_mce_log_buffer()
x86, memblock: Move mem_hole_size() to .init
Pull x86 platform changes from Ingo Molnar.
Removes the Moorestown platform that nobody ever used.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform: Move APIC ID validity check into platform APIC code
x86/olpc/xo15/sci: Enable lid close wakeup control
x86/geode/net5501: Add platform driver for Soekris Engineering net5501
x86/geode/alix2: Supplement driver to include GPIO button support
x86/mid/powerbtn: Use MSIC read/write instead of ipc_scu
x86/mid/thermal: Turn off thermistor
x86/mid/thermal: Add msic_thermal alias
x86/mid/thermal: Convert to use Intel MSIC API
x86/mid/scu_ipc: Remove Moorestown support
x86/mid: Kill off Moorestown
x86/mrst: Add msic_thermal platform support
x86/config: Select MSIC MFD driver on Intel Medfield platform
x86/mid: Remove Intel Moorestown
x86/mrst: Set ISA bus type for fake MP IRQs
x86/ioapic: Use legacy_pic to set correct gsi-irq mapping
Pull MCE changes from Ingo Molnar.
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
x86/mce: Convert static array of pointers to per-cpu variables
x86/mce: Replace hard coded hex constants with symbolic defines
x86/mce: Recognise machine check bank signature for data path error
x86/mce: Handle "action required" errors
x86/mce: Add mechanism to safely save information in MCE handler
x86/mce: Create helper function to save addr/misc when needed
HWPOISON: Add code to handle "action required" errors.
HWPOISON: Clean up memory_failure() vs. __memory_failure()
Pull x86/fpu changes from Ingo Molnar.
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
i387: Split up <asm/i387.h> into exported and internal interfaces
i387: Uninline the generic FP helpers that we expose to kernel modules
Pull trivial x86 branches from Ingo Molnar: small one-liners to fix up
details.
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Remove some noise from boot log when starting cpus
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, boot: Fix port argument to inl() function
* 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpufeature: Add CPU features from Intel document 319433-012A
* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86_64: Record stack pointer before task execution begins
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/UV: Lower UV rtc clocksource rating
Pull x86/asm changes from Ingo Molnar
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Include probe_roms.h in probe_roms.c
x86/32: Print control and debug registers for kerenel context
x86: Tighten dependencies of CPU_SUP_*_32
x86/numa: Improve internode cache alignment
x86: Fix the NMI nesting comments
x86-64: Improve insn scheduling in SAVE_ARGS_IRQ
x86-64: Fix CFI annotations for NMI nesting code
bitops: Add missing parentheses to new get_order macro
bitops: Optimise get_order()
bitops: Adjust the comment on get_order() to describe the size==0 case
x86/spinlocks: Eliminate TICKET_MASK
x86-64: Handle byte-wise tail copying in memcpy() without a loop
x86-64: Fix memcpy() to support sizes of 4Gb and above
x86-64: Fix memset() to support sizes of 4Gb and above
x86-64: Slightly shorten copy_page()
If the required size is bigger than cached_hole_size it is better to
search from free_area_cache - it is easier to get a free region,
specifically for the 64 bit process whose address space is large enough
Do it just as hugetlb_get_unmapped_area_topdown() in arch/x86/mm/hugetlbpage.c
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.
It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().
Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.
Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).
The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).
All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).
if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.
The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.
====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!
At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:
mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.
143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }
After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.
1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));
The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.
virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |<-- B(fault)
| | |
2 MB | |/////////////////////|-.
huge < |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'
- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.
sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.
...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)
The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.
====== end quote =======
[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Jones <davej@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [2.6.38+]
Cc: Mark Salter <msalter@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This branch takes the PowerPC irq_host infrastructure (reverse mapping
from Linux IRQ numbers to hardware irq numbering), generalizes it,
renames it to irq_domain, and makes it available to all architectures.
Originally the plan has been to create an all-new irq_domain
implementation which addresses some of the powerpc shortcomings such
as not handling 1:1 mappings well, but doing that proved to be far
more difficult and invasive than generalizing the working code and
refactoring it in-place. So, this branch rips out the 'new'
irq_domain and replaces it with the modified powerpc version (in a
fully bisectable way of course). It converts all users over to the
new API and makes irq_domain selectable on any architecture.
No architecture is forced to enable irq_domain, but the infrastructure
is required for doing OpenFirmware style irq translations. It will
even work on SPARC even though SPARC has it's own mechanism for
translating irqs at boot time. MIPS, microblaze, embedded x86 and c6x
are converted too.
The resulting irq_domain code is probably still too verbose and can be
optimized more, but that can be done incrementally and is a task for
follow-on patches.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPZ1yiAAoJEEFnBt12D9kB4yIQAJvCfTPL65sCYVD6i9RnVHtR
ahwddtd0AtT+UYLU8Xg2fZgVi6cmupDGnqkBixzZD3xxSTERqm7Snqa0ugklfeAi
B6Zqf/K17H5hJNaoQ3fkNauow8m7ZYOeEH2vVUvkb3woWS9Wm7OGd+BvcIBgYSGe
Aaoumhu7kDxFkii0qz3x/+kvsb6DRp2HtSPWj+APL/kNjdiO4JBOihtcc/lX6d47
bsZLiEMzHUFV4ApJNwqmfDnf54oMrHmrRJxgQHIMjeJC5or9I3Do8wDGe/aTF5xO
5GVpxCQsTlJMjTBWlAFtpTwCJB6y76EHQrHc7WzLlq8OJSsxApOke8M0BzXFrfMy
CU7UUpTvNZTLpZibLCEQKemv1+oNOkfFylsHxfek2MCqx0W6W4FHEGV3qE/GtgV9
+vurA9hNNp7VM0FGRGigcUr3woYdHLdEVQrlnL7Z9AgBu1W44MZLaai7iRVZOeCT
ZQ9++v2PJJ8vHT8kdkgTdiRpnEhmv84MX/GBT7ilWFEMIVeT5zhGkIBojzNgyzGc
7cvermmM0P8h+unkDgmzmSbDxo0PboqVKeoO71AOBhA6MmR9iom7XkuNdHhoOwy2
4A5xT1srbhJDbuv15BBREBV24TywpZ4a1+4nwQT4L1fXe+HfCxeEWexGcKQMRcIt
dAelOHTQ+ZGkOKvXeW05
=ruGA
-----END PGP SIGNATURE-----
Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull irq_domain support for all architectures from Grant Likely:
"Generialize powerpc's irq_host as irq_domain
This branch takes the PowerPC irq_host infrastructure (reverse mapping
from Linux IRQ numbers to hardware irq numbering), generalizes it,
renames it to irq_domain, and makes it available to all architectures.
Originally the plan has been to create an all-new irq_domain
implementation which addresses some of the powerpc shortcomings such
as not handling 1:1 mappings well, but doing that proved to be far
more difficult and invasive than generalizing the working code and
refactoring it in-place. So, this branch rips out the 'new'
irq_domain and replaces it with the modified powerpc version (in a
fully bisectable way of course). It converts all users over to the
new API and makes irq_domain selectable on any architecture.
No architecture is forced to enable irq_domain, but the infrastructure
is required for doing OpenFirmware style irq translations. It will
even work on SPARC even though SPARC has it's own mechanism for
translating irqs at boot time. MIPS, microblaze, embedded x86 and c6x
are converted too.
The resulting irq_domain code is probably still too verbose and can be
optimized more, but that can be done incrementally and is a task for
follow-on patches."
* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: (31 commits)
dt: fix twl4030 for non-dt compile on x86
mfd: twl-core: Add IRQ_DOMAIN dependency
devicetree: Add empty of_platform_populate() for !CONFIG_OF_ADDRESS (sparc)
irq_domain: Centralize definition of irq_dispose_mapping()
irq_domain/mips: Allow irq_domain on MIPS
irq_domain/x86: Convert x86 (embedded) to use common irq_domain
ppc-6xx: fix build failure in flipper-pic.c and hlwd-pic.c
irq_domain/microblaze: Convert microblaze to use irq_domains
irq_domain/powerpc: Replace custom xlate functions with library functions
irq_domain/powerpc: constify irq_domain_ops
irq_domain/c6x: Use library of xlate functions
irq_domain/c6x: constify irq_domain structures
irq_domain/c6x: Convert c6x to use generic irq_domain support.
irq_domain: constify irq_domain_ops
irq_domain: Create common xlate functions that device drivers can use
irq_domain: Remove irq_domain_add_simple()
irq_domain: Remove 'new' irq_domain in favour of the ppc one
mfd: twl-core.c: Fix the number of interrupts managed by twl4030
of/address: add empty static inlines for !CONFIG_OF
irq_domain: Add support for base irq and hwirq in legacy mappings
...
Assorted extensions and fixes including:
* Introduction of early/late suspend/hibernation device callbacks.
* Generic PM domains extensions and fixes.
* devfreq updates from Axel Lin and MyungJoo Ham.
* Device PM QoS updates.
* Fixes of concurrency problems with wakeup sources.
* System suspend and hibernation fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIcBAABAgAGBQJPZww5AAoJEKhOf7ml8uNsiBYQAL9YGso7KypZhLspNxvAKuZr
iHyme2F7OdOiUfo40DVH5tRuEsQvLOl0S+9ukWLrzQotKBsMfym05jtbGN9m6Ygh
Z793sx3eRI3mltekJ9yrOxH6BOBDMWMkwY8ztU/X5aYDNirgJ/qtAjSK4BvWXBrz
APeaUReVnLdaNP8SnhHfne/KPsHk++NKZvAAva7E6RwtZn4KV6bfiBPGb8yvY8pP
m4cg1S5QEduMy+zQJ8+IlEHR91bt9spUyRwbhw6ZHCNzNeu4iEZT8DVt1O1sIRbO
LsNcClqsd40nr781SoF8N9GmGUxlUDr46bS3FSsDkYzn8uyxGEsv00edJZtPwIm5
7nPuYat3Ke1YsON0Kcd/wkBGXqw/Rjfp3F1bnHjpVx/0oM/6MPrFNnIwvpHspejG
kN3770idYJ17dLckhcsbYsLdy8yirITILDzvHT0AAaZ9z4Lr9Pm56WwFZLyb/lhR
2cqK8Bb8W9YvcVsKV8YqkyBVrygWMe+c56KoAoUBiSNxvW6LphmXFBj5QiFMs8s8
Xh8H7xU96FKbpNMIAZ1+bpI4zgulQG4xPXI9pKbhMfjaMUgj2zQeO8/t0WlB1M0z
+kEUcYHJnXrRrObQuHEFXZdIjy/E0fdUboMIrlLt0gm97OxnG6imPseQp6/leQkC
t+L4Aq6TOUofUU86d4cI
=IGhc
-----END PGP SIGNATURE-----
Merge tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates for 3.4 from Rafael Wysocki:
"Assorted extensions and fixes including:
* Introduction of early/late suspend/hibernation device callbacks.
* Generic PM domains extensions and fixes.
* devfreq updates from Axel Lin and MyungJoo Ham.
* Device PM QoS updates.
* Fixes of concurrency problems with wakeup sources.
* System suspend and hibernation fixes."
* tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits)
PM / Domains: Check domain status during hibernation restore of devices
PM / devfreq: add relation of recommended frequency.
PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on()
PM / shmobile: Make CMT driver use pm_genpd_dev_always_on()
PM / shmobile: Make TMU driver use pm_genpd_dev_always_on()
PM / Domains: Introduce "always on" device flag
PM / Domains: Fix hibernation restore of devices, v2
PM / Domains: Fix handling of wakeup devices during system resume
sh_mmcif / PM: Use PM QoS latency constraint
tmio_mmc / PM: Use PM QoS latency constraint
PM / QoS: Make it possible to expose PM QoS latency constraints
PM / Sleep: JBD and JBD2 missing set_freezable()
PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case
PM / Freezer: Remove references to TIF_FREEZE in comments
PM / Sleep: Add more wakeup source initialization routines
PM / Hibernate: Enable usermodehelpers in hibernate() error path
PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
PM / Sleep: Fix race conditions related to wakeup source timer function
PM / Sleep: Fix possible infinite loop during wakeup source destruction
PM / Hibernate: print physical addresses consistently with other parts of kernel
...
Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Here's the big driver core merge for 3.4-rc1.
Lots of various things here, sysfs fixes/tweaks (with the nlink breakage
reverted), dynamic debugging updates, w1 drivers, hyperv driver updates,
and a variety of other bits and pieces, full information in the
shortlog.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iEYEABECAAYFAk9neCsACgkQMUfUDdst+ylyQwCfY2eizvzw5HhjQs8gOiBRDADe
yrgAnj1Zan2QkoCnQIFJNAoxqNX9yAhd
=biH6
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches for 3.4-rc1 from Greg KH:
"Here's the big driver core merge for 3.4-rc1.
Lots of various things here, sysfs fixes/tweaks (with the nlink
breakage reverted), dynamic debugging updates, w1 drivers, hyperv
driver updates, and a variety of other bits and pieces, full
information in the shortlog."
* tag 'driver-core-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (78 commits)
Tools: hv: Support enumeration from all the pools
Tools: hv: Fully support the new KVP verbs in the user level daemon
Drivers: hv: Support the newly introduced KVP messages in the driver
Drivers: hv: Add new message types to enhance KVP
regulator: Support driver probe deferral
Revert "sysfs: Kill nlink counting."
uevent: send events in correct order according to seqnum (v3)
driver core: minor comment formatting cleanups
driver core: move the deferred probe pointer into the private area
drivercore: Add driver probe deferral mechanism
DS2781 Maxim Stand-Alone Fuel Gauge battery and w1 slave drivers
w1_bq27000: Only one thread can access the bq27000 at a time.
w1_bq27000 - remove w1_bq27000_write
w1_bq27000: remove unnecessary NULL test.
sysfs: Fix memory leak in sysfs_sd_setsecdata().
intel_idle: Revert change of auto_demotion_disable_flags for Nehalem
w1: Fix w1_bq27000
driver-core: documentation: fix up Greg's email address
powernow-k6: Really enable auto-loading
powernow-k7: Fix CPU family number
...
Pull timer changes for v3.4 from Ingo Molnar
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
ntp: Fix integer overflow when setting time
math: Introduce div64_long
cs5535-clockevt: Allow the MFGPT IRQ to be shared
cs5535-clockevt: Don't ignore MFGPT on SMP-capable kernels
x86/time: Eliminate unused irq0_irqs counter
clocksource: scx200_hrt: Fix the build
x86/tsc: Reduce the TSC sync check time for core-siblings
timer: Fix bad idle check on irq entry
nohz: Remove ts->Einidle checks before restarting the tick
nohz: Remove update_ts_time_stat from tick_nohz_start_idle
clockevents: Leave the broadcast device in shutdown mode when not needed
clocksource: Load the ACPI PM clocksource asynchronously
clocksource: scx200_hrt: Convert scx200 to use clocksource_register_hz
clocksource: Get rid of clocksource_calc_mult_shift()
clocksource: dbx500: convert to clocksource_register_hz()
clocksource: scx200_hrt: use pr_<level> instead of printk
time: Move common updates to a function
time: Reorder so the hot data is together
time: Remove most of xtime_lock usage in timekeeping.c
ntp: Add ntp_lock to replace xtime_locking
...
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
Pull perf events changes for v3.4 from Ingo Molnar:
- New "hardware based branch profiling" feature both on the kernel and
the tooling side, on CPUs that support it. (modern x86 Intel CPUs
with the 'LBR' hardware feature currently.)
This new feature is basically a sophisticated 'magnifying glass' for
branch execution - something that is pretty difficult to extract from
regular, function histogram centric profiles.
The simplest mode is activated via 'perf record -b', and the result
looks like this in perf report:
$ perf record -b any_call,u -e cycles:u branchy
$ perf report -b --sort=symbol
52.34% [.] main [.] f1
24.04% [.] f1 [.] f3
23.60% [.] f1 [.] f2
0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow
0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn
0.01% [k] _IO_vfprintf_internal [k] strchrnul
0.01% [k] __printf [k] _IO_vfprintf_internal
0.01% [k] main [k] __printf
This output shows from/to branch columns and shows the highest
percentage (from,to) jump combinations - i.e. the most likely taken
branches in the system. "branches" can also include function calls
and any other synchronous and asynchronous transitions of the
instruction pointer that are not 'next instruction' - such as system
calls, traps, interrupts, etc.
This feature comes with (hopefully intuitive) flat ascii and TUI
support in perf report.
- Various 'perf annotate' visual improvements for us assembly junkies.
It will now recognize function calls in the TUI and by hitting enter
you can follow the call (recursively) and back, amongst other
improvements.
- Multiple threads/processes recording support in perf record, perf
stat, perf top - which is activated via a comma-list of PIDs:
perf top -p 21483,21485
perf stat -p 21483,21485 -ddd
perf record -p 21483,21485
- Support for per UID views, via the --uid paramter to perf top, perf
report, etc. For example 'perf top --uid mingo' will only show the
tasks that I am running, excluding other users, root, etc.
- Jump label restructurings and improvements - this includes the
factoring out of the (hopefully much clearer) include/linux/static_key.h
generic facility:
struct static_key key = STATIC_KEY_INIT_FALSE;
...
if (static_key_false(&key))
do unlikely code
else
do likely code
...
static_key_slow_inc();
...
static_key_slow_inc();
...
The static_key_false() branch will be generated into the code with as
little impact to the likely code path as possible. the
static_key_slow_*() APIs flip the branch via live kernel code patching.
This facility can now be used more widely within the kernel to
micro-optimize hot branches whose likelihood matches the static-key
usage and fast/slow cost patterns.
- SW function tracer improvements: perf support and filtering support.
- Various hardenings of the perf.data ABI, to make older perf.data's
smoother on newer tool versions, to make new features integrate more
smoothly, to support cross-endian recording/analyzing workflows
better, etc.
- Restructuring of the kprobes code, the splitting out of 'optprobes',
and a corner case bugfix.
- Allow the tracing of kernel console output (printk).
- Improvements/fixes to user-space RDPMC support, allowing user-space
self-profiling code to extract PMU counts without performing any
system calls, while playing nice with the kernel side.
- 'perf bench' improvements
- ... and lots of internal restructurings, cleanups and fixes that made
these features possible. And, as usual this list is incomplete as
there were also lots of other improvements
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits)
perf report: Fix annotate double quit issue in branch view mode
perf report: Remove duplicate annotate choice in branch view mode
perf/x86: Prettify pmu config literals
perf report: Enable TUI in branch view mode
perf report: Auto-detect branch stack sampling mode
perf record: Add HEADER_BRANCH_STACK tag
perf record: Provide default branch stack sampling mode option
perf tools: Make perf able to read files from older ABIs
perf tools: Fix ABI compatibility bug in print_event_desc()
perf tools: Enable reading of perf.data files from different ABI rev
perf: Add ABI reference sizes
perf report: Add support for taken branch sampling
perf record: Add support for sampling taken branch
perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK
x86/kprobes: Split out optprobe related code to kprobes-opt.c
x86/kprobes: Fix a bug which can modify kernel code permanently
x86/kprobes: Fix instruction recovery on optimized path
perf: Add callback to flush branch_stack on context switch
perf: Disable PERF_SAMPLE_BRANCH_* when not supported
perf/x86: Add LBR software filter support for Intel CPUs
...
Pull irq/core changes for v3.4 from Ingo Molnar
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove paranoid warnons and bogus fixups
genirq: Flush the irq thread on synchronization
genirq: Get rid of unnecessary IRQTF_DIED flag
genirq: No need to check IRQTF_DIED before stopping a thread handler
genirq: Get rid of unnecessary irqaction field in task_struct
genirq: Fix incorrect check for forced IRQ thread handler
softirq: Reduce invoke_softirq() code duplication
genirq: Fix long-term regression in genirq irq_set_irq_type() handling
x86-32/irq: Don't switch to irq stack for a user-mode irq
Upon resume from hibernation, CPU 0's hvclock area contains the old
values for system_time and tsc_timestamp. It is necessary for the
hypervisor to update these values with uptodate ones before the CPU uses
them.
Abstract TSC's save/restore sched_clock_state functions and use
restore_state to write to KVM_SYSTEM_TIME MSR, forcing an update.
Also move restore_sched_clock_state before __restore_processor_state,
since the later calls CONFIG_LOCK_STAT's lockstat_clock (also for TSC).
Thanks to Igor Mammedov for tracking it down.
Fixes suspend-to-disk with kvmclock.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fix the following section warnings :
WARNING: vmlinux.o(.text+0x49dbc): Section mismatch in reference
from the function acpi_map_cpu2node() to the variable
.cpuinit.data:__apicid_to_node The function acpi_map_cpu2node()
references the variable __cpuinitdata __apicid_to_node. This is
often because acpi_map_cpu2node lacks a __cpuinitdata
annotation or the annotation of __apicid_to_node is wrong.
WARNING: vmlinux.o(.text+0x49dc1): Section mismatch in reference
from the function acpi_map_cpu2node() to the function
.cpuinit.text:numa_set_node() The function acpi_map_cpu2node()
references the function __cpuinit numa_set_node(). This is often
because acpi_map_cpu2node lacks a __cpuinit annotation or the
annotation of numa_set_node is wrong.
WARNING: vmlinux.o(.text+0x526e77): Section mismatch in
reference from the function prealloc_protection_domains() to the
function .init.text:alloc_passthrough_domain() The function
prealloc_protection_domains() references the function __init
alloc_passthrough_domain(). This is often because
prealloc_protection_domains lacks a __init annotation or the annotation of alloc_passthrough_domain is wrong.
Signed-off-by: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1331810188-24785-1-git-send-email-sp@numascale.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Adding sysfs group 'format' attribute for pmu device that
contains a syntax description on how to construct raw events.
The event configuration is described in following
struct pefr_event_attr attributes:
config
config1
config2
Each sysfs attribute within the format attribute group,
describes mapping of name and bitfield definition within
one of above attributes.
eg:
"/sys/...<dev>/format/event" contains "config:0-7"
"/sys/...<dev>/format/umask" contains "config:8-15"
"/sys/...<dev>/format/usr" contains "config:16"
the attribute value syntax is:
line: config ':' bits
config: 'config' | 'config1' | 'config2"
bits: bits ',' bit_term | bit_term
bit_term: VALUE '-' VALUE | VALUE
Adding format attribute definitions for x86 cpu pmus.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/n/tip-vhdk5y2hyype9j63prymty36@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
While running the latest Linux as guest under VMware in highly
over-committed situations, we have seen cases when the refined TSC
algorithm fails to get a valid tsc_start value in
tsc_refine_calibration_work from multiple attempts. As a result the
kernel keeps on scheduling the tsc_irqwork task for later. Subsequently
after several attempts when it gets a valid start value it goes through
the refined calibration and either bails out or uses the new results.
Given that the kernel originally read the TSC frequency from the
platform, which is the best it can get, I don't think there is much
value in refining it.
So for systems which get the TSC frequency from the platform we
should skip the refined tsc algorithm.
We can use the TSC_RELIABLE cpu cap flag to detect this, right now it is
set only on VMware and for Moorestown Penwell both of which have there
own TSC calibration methods.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: stable@kernel.org
[jstultz: Reworked to simply not schedule the refining work,
rather then scheduling the work and bombing out later]
Signed-off-by: John Stultz <john.stultz@linaro.org>
The update of the vdso data happens under xtime_lock, so adding a
nested lock is pointless. Just use a seqcount to sync the readers.
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Changing the sequence count in update_vsyscall_tz() is completely
pointless.
The vdso code copies the data unprotected. There is no point to change
this as sys_tz is nowhere protected at all. See sys_gettimeofday().
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Move APIC ID validity check into platform APIC code, so it can
be overridden when needed. For NumaChip systems, always trust
MADT, as it's constructed with high APIC IDs.
Behaviour verifies on standard x86 systems and on NumaChip
systems with this, and compile-tested with allyesconfig.
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Reviewed-by: Steffen Persvold <sp@numascale.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1331709454-27966-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPW8yUAAoJEHm+PkMAQRiGhFIH/RGUPxGmUkJv8EP5I4HDA4dJ
c6/PrzZCHs8rxzYzvn7ojXqZGXTOAA5ZgS9A6LkJ2sxMFvgMnkpFi6B4CwMzizS3
vLWo/HNxbiTCNGFfQrhQB8O58uNI8wOBa87lrQfkXkDqN0cFhdjtIxeY1BD9LXIo
qbWysGxCcZhJWHapsQ3NZaVJQnIK5vA/+mhyYP4HzbcHI3aWnbIEZ8GQKeY28Ch0
+pct5UQBjZavV5SujaW0Xd65oIiycm8XHAQw6FxQy//DfaabauWgFteR162Q/oew
xxUBDOHF3nO1bdteHHaYqxig0j1MbIHsqxTnE/neR8UryF04//1SFF7DYuY/1pg=
=SV5V
-----END PGP SIGNATURE-----
Merge tag 'v3.3-rc7' into x86/mce
Merge reason: Update from an ancient -rc1 base to an almost-final stable kernel.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Uprobes uses exception notifiers to get to know if a thread hit
a breakpoint or a singlestep exception.
When a thread hits a uprobe or is singlestepping post a uprobe
hit, the uprobe exception notifier sets its TIF_UPROBE bit,
which will then be checked on its return to userspace path
(do_notify_resume() ->uprobe_notify_resume()), where the
consumers handlers are run (in task context) based on the
defined filters.
Uprobe hits are thread specific and hence we need to maintain
information about if a task hit a uprobe, what uprobe was hit,
the slot where the original instruction was copied for xol so
that it can be singlestepped with appropriate fixups.
In some cases, special care is needed for instructions that are
executed out of line (xol). These are architecture specific
artefacts, such as handling RIP relative instructions on x86_64.
Since the instruction at which the uprobe was inserted is
executed out of line, architecture specific fixups are added so
that the thread continues normal execution in the presence of a
uprobe.
Postpone the signals until we execute the probed insn.
post_xol() path does a recalc_sigpending() before return to
user-mode, this ensures the signal can't be lost.
Uprobes relies on DIE_DEBUG notification to notify if a
singlestep is complete.
Adds x86 specific uprobe exception notifiers and appropriate
hooks needed to determine a uprobe hit and subsequent post
processing.
Add requisite x86 fixups for xol for uprobes. Specific cases
needing fixups include relative jumps (x86_64), calls, etc.
Where possible, we check and skip singlestepping the
breakpointed instructions. For now we skip single byte as well
as few multibyte nop instructions. However this can be extended
to other instructions too.
Credits to Oleg Nesterov for suggestions/patches related to
signal, breakpoint, singlestep handling code.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120313180011.29771.89027.sendpatchset@srdronam.in.ibm.com
[ Performed various cleanliness edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
.. as appropiately. As tboot_sleep now returns values.
remove tboot_sleep_wrapper.
Suggested-and-Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Joseph Cihula <joseph.cihula@intel.com>
[v1: Return -1/0/+1 instead of ACPI_xx values]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The ACPI suspend path makes a call to tboot_sleep right before
it writes the PM1A, PM1B values. We replace the direct call to
tboot via an registration callback similar to __acpi_register_gsi.
CC: Len Brown <len.brown@intel.com>
Acked-by: Joseph Cihula <joseph.cihula@intel.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
[v1: Added __attribute__ ((unused))]
[v2: Introduced a wrapper instead of changing tboot_sleep return values]
[v3: Added return value AE_CTRL_SKIP for acpi_os_sleep_prepare]
Signed-off-by: Tang Liang <liang.tang@oracle.com>
[v1: Fix compile issues on IA64 and PPC64]
[v2: Fix where __acpi_os_prepare_sleep==NULL and did not go in sleep properly]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When a machine boots up, the TSC generally gets reset. However,
when kexec is used to boot into a kernel, the TSC value would be
carried over from the previous kernel. The computation of
cycns_offset in set_cyc2ns_scale is prone to an overflow, if the
machine has been up more than 208 days prior to the kexec. The
overflow happens when we multiply *scale, even though there is
enough room to store the final answer.
We fix this issue by decomposing tsc_now into the quotient and
remainder of division by CYC2NS_SCALE_FACTOR and then performing
the multiplication separately on the two components.
Refactor code to share the calculation with the previous
fix in __cycles_2_ns().
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: john stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are precedences of trap number being referred to as
trap_nr. However thread struct refers trap number as trap_no.
Change it to trap_nr.
Also use enum instead of left-over literals for trap values.
This is pure cleanup, no functional change intended.
Suggested-by: Ingo Molnar <mingo@eltu.hu>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092555.5379.942.sendpatchset@srdronam.in.ibm.com
[ Fixed the math-emu build ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If a function takes struct uprobe or struct arch_uprobe, then it
is passed as the first parameter.
This is pure cleanup, no functional change intended.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Linux-mm <linux-mm@kvack.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120312092530.5379.18394.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the recent changes to clear_IO_APIC_pin() which tries to
clear remoteIRR bit explicitly, some of the users started to see
"Unable to reset IRR for apic .." messages.
Close look shows that these are related to bogus IO-APIC entries
which return's all 1's for their io-apic registers. And the
above mentioned error messages are benign. But kernel should
have ignored such io-apic's in the first place.
Check if register 0, 1, 2 of the listed io-apic are all 1's and
ignore such io-apic.
Reported-by: Álvaro Castillo <midgoon@gmail.com>
Tested-by: Jon Dufresne <jon@jondufresne.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: kernel-team@fedoraproject.org
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1331577393.31585.94.camel@sbsiddha-desk.sc.intel.com
[ Performed minor cleanup of affected code. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I got somewhat tired of having to decode hex numbers..
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Robert Richter <robert.richter@amd.com>
Link: http://lkml.kernel.org/n/tip-0vsy1sgywc4uar3mu1szm0rg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stepan found:
CPU0 CPUn
_cpu_up()
__cpu_up()
boostrap()
notify_cpu_starting()
set_cpu_online()
while (!cpu_active())
cpu_relax()
<PREEMPT-out>
smp_call_function(.wait=1)
/* we find cpu_online() is true */
arch_send_call_function_ipi_mask()
/* wait-forever-more */
<PREEMPT-in>
local_irq_enable()
cpu_notify(CPU_ONLINE)
sched_cpu_active()
set_cpu_active()
Now the purpose of cpu_active is mostly with bringing down a cpu, where
we mark it !active to avoid the load-balancer from moving tasks to it
while we tear down the cpu. This is required because we only update the
sched_domain tree after we brought the cpu-down. And this is needed so
that some tasks can still run while we bring it down, we just don't want
new tasks to appear.
On cpu-up however the sched_domain tree doesn't yet include the new cpu,
so its invisible to the load-balancer, regardless of the active state.
So instead of setting the active state after we boot the new cpu (and
consequently having to wait for it before enabling interrupts) set the
cpu active before we set it online and avoid the whole mess.
Reported-by: Stepan Moskovchenko <stepanm@codeaurora.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The traps are referred to by their numbers and it can be difficult to
understand them while reading the code without context. This patch adds
enumeration of the trap numbers and replaces the numbers with the correct
enum for x86.
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20120310000710.GA32667@www.outflux.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch implements 64 bit counter support for IBS. The
sampling period is no longer limited to the hw counter width.
The functions perf_event_set_period() and
perf_event_try_update() can be used as generic functions. They
can replace similar code that is duplicate across architectures.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-5-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add code to control the IBS pmu. We need to maintain per-cpu
states. Since some states are used and changed by the nmi
handler, access to these states must be atomic.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements code to handle ibs interrupts. If ibs data
is available a raw perf_event data sample is created and sent
back to the userland. This patch only implements the storage of
ibs data in the raw sample, but this could be extended in a
later patch by generating generic event data such as the rip
from the ibs sampling data.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements perf configuration for AMD IBS. The IBS
pmu is selected using the type attribute in sysfs. There are two
types of ibs pmus, for instruction fetch (IBS_FETCH) and for
instruction execution (IBS_OP):
/sys/bus/event_source/devices/ibs_fetch/type
/sys/bus/event_source/devices/ibs_op/type
Except for the sample period IBS can only be set up with raw
config values and raw data samples. The event attributes for the
syscall should be programmed like this (IBS_FETCH):
type = get_pmu_type("/sys/bus/event_source/devices/ibs_fetch/type");
memset(&attr, 0, sizeof(attr));
attr.type = type;
attr.sample_type = PERF_SAMPLE_CPU | PERF_SAMPLE_RAW;
attr.config = IBS_FETCH_CONFIG_DEFAULT;
This implementation does not yet support 64 bit counters. It is
limited to the hardware counter bit width which is 20 bits. 64
bit support can be added later.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1323968199-9326-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While for a user mode register dump it may be reasonable to skip
those (albeit x86-64 doesn't do so), for kernel mode dumps these
should be printed to make sure all information possibly
necessary for analysis is available.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F58889202000078000770E7@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a bug in kprobes which can modify kernel code
permanently at run-time. In the result, kernel can
crash when it executes the modified code.
This bug can happen when we put two probes enough near
and the first probe is optimized. When the second probe
is set up, it copies a byte which is already modified
by the first probe, and executes it when the probe is hit.
Even worse, the first probe and the second probe are removed
respectively, the second probe writes back the copied
(modified) instruction.
To fix this bug, kprobes always recovers the original
code and copies the first byte from recovered instruction.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: systemtap@sourceware.org
Cc: anderson@redhat.com
Link: http://lkml.kernel.org/r/20120305133215.5982.31991.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current probed-instruction recovery expects that only breakpoint
instruction modifies instruction. However, since kprobes jump
optimization can replace original instructions with a jump,
that expectation is not enough. And it may cause instruction
decoding failure on the function where an optimized probe
already exists.
This bug can reproduce easily as below:
1) find a target function address (any kprobe-able function is OK)
$ grep __secure_computing /proc/kallsyms
ffffffff810c19d0 T __secure_computing
2) decode the function
$ objdump -d vmlinux --start-address=0xffffffff810c19d0 --stop-address=0xffffffff810c19eb
vmlinux: file format elf64-x86-64
Disassembly of section .text:
ffffffff810c19d0 <__secure_computing>:
ffffffff810c19d0: 55 push %rbp
ffffffff810c19d1: 48 89 e5 mov %rsp,%rbp
ffffffff810c19d4: e8 67 8f 72 00 callq
ffffffff817ea940 <mcount>
ffffffff810c19d9: 65 48 8b 04 25 40 b8 mov %gs:0xb840,%rax
ffffffff810c19e0: 00 00
ffffffff810c19e2: 83 b8 88 05 00 00 01 cmpl $0x1,0x588(%rax)
ffffffff810c19e9: 74 05 je ffffffff810c19f0 <__secure_computing+0x20>
3) put a kprobe-event at an optimize-able place, where no
call/jump places within the 5 bytes.
$ su -
# cd /sys/kernel/debug/tracing
# echo p __secure_computing+0x9 > kprobe_events
4) enable it and check it is optimized.
# echo 1 > events/kprobes/p___secure_computing_9/enable
# cat ../kprobes/list
ffffffff810c19d9 k __secure_computing+0x9 [OPTIMIZED]
5) put another kprobe on an instruction after previous probe in
the same function.
# echo p __secure_computing+0x12 >> kprobe_events
bash: echo: write error: Invalid argument
# dmesg | tail -n 1
[ 1666.500016] Probing address(0xffffffff810c19e2) is not an instruction boundary.
6) however, if the kprobes optimization is disabled, it works.
# echo 0 > /proc/sys/debug/kprobes-optimization
# cat ../kprobes/list
ffffffff810c19d9 k __secure_computing+0x9
# echo p __secure_computing+0x12 >> kprobe_events
(no error)
This is because kprobes doesn't recover the instruction
which is overwritten with a relative jump by another kprobe
when finding instruction boundary.
It only recovers the breakpoint instruction.
This patch fixes kprobes to recover such instructions.
With this fix:
# echo p __secure_computing+0x9 > kprobe_events
# echo 1 > events/kprobes/p___secure_computing_9/enable
# cat ../kprobes/list
ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED]
# echo p __secure_computing+0x12 >> kprobe_events
# cat ../kprobes/list
ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED]
ffffffff810c1ab2 k __secure_computing+0x12 [DISABLED]
Changes in v4:
- Fix a bug to ensure optimized probe is really optimized
by jump.
- Remove kprobe_optready() dependency.
- Cleanup code for preparing optprobe separation.
Changes in v3:
- Fix a build error when CONFIG_OPTPROBE=n. (Thanks, Ingo!)
To fix the error, split optprobe instruction recovering
path from kprobes path.
- Cleanup comments/styles.
Changes in v2:
- Fix a bug to recover original instruction address in
RIP-relative instruction fixup.
- Moved on tip/master.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: systemtap@sourceware.org
Cc: anderson@redhat.com
Link: http://lkml.kernel.org/r/20120305133209.5982.36568.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
X32 ptrace is a hybrid of 64bit ptrace and compat ptrace with 32bit
address and longs. It use 64bit ptrace to access the full 64bit
registers. PTRACE_PEEKUSR and PTRACE_POKEUSR are only allowed to access
segment and debug registers. PTRACE_PEEKUSR returns the lower 32bits
and PTRACE_POKEUSR zero-extends 32bit value to 64bit. It works since
the upper 32bits of segment and debug registers of x32 process are always
zero. GDB only uses PTRACE_PEEKUSR and PTRACE_POKEUSR to access
segment and debug registers.
[ hpa: changed TIF_X32 test to use !is_ia32_task() instead, and moved
the system call number to the now-unused 521 slot. ]
Signed-off-by: "H.J. Lu" <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
With branch stack sampling, it is possible to filter by priv levels.
In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.
We need a callback on context switch to either flush the branch stack
or save it. This patch adds a new callback in struct pmu which is called
during context switches. The callback is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for
per-thread context.
In this version, the Intel x86 code simply flushes (resets) the LBR
on context switches (fills it with zeroes). Those zeroed branches are
then filtered out by the SW filter.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds an internal sofware filter to complement
the (optional) LBR hardware filter.
The software filter is necessary:
- as a substitute when there is no HW LBR filter (e.g., Atom, Core)
- to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere)
- to provide finer grain filtering (e.g., all processors)
Sometimes the LBR HW filter cannot distinguish between two types
of branches. For instance, to capture syscall as CALLS, it is necessary
to enable the LBR_FAR filter which will also capture JMP instructions.
Thus, a second pass is necessary to filter those out, this is what the
SW filter can do.
The SW filter is built on top of the internal x86 disassembler. It
is a best effort filter especially for user level code. It is subject
to the availability of the text page of the program.
The SW filter is enabled on all Intel processors. It is bypassed
when the user is capturing all branches at all priv levels.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-9-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements PERF_SAMPLE_BRANCH support for Intel
x86processors. It connects PERF_SAMPLE_BRANCH to the actual LBR.
The patch adds the hooks in the PMU irq handler to save the LBR
on counter overflow for both regular and PEBS modes.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-8-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The patch adds a restriction for Intel Atom LBR support. Only
steppings 10 (PineView) and more recent are supported. Older models
do not have a functional LBR. Their LBR does not freeze on PMU
interrupt which makes LBR unusable in the context of perf_events.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-7-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_*
filters to the actual Intel x86LBR filters, whenever they exist.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-6-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If precise sampling is enabled on Intel x86 then perf_event uses PEBS.
To correct for the off-by-one error of PEBS, perf_event uses LBR when
precise_sample > 1.
On Intel x86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR,
therefore both features must be coordinated as they may not
configure LBR the same way.
For PEBS, LBR needs to capture all branches at the priv level of
the associated event.
This patch checks that the branch type and priv level of BRANCH_STACK
is compatible with that of the PEBS LBR requirement, thereby allowing:
$ perf record -b any,u -e instructions:upp ....
But:
$ perf record -b any_call,u -e instructions:upp
Is not possible.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Intel LBR on some recent processor is capable
of filtering branches by type. The filter is configurable
via the LBR_SELECT MSR register.
There are limitation on how this register can be used.
On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads
when HT is on. It is private to each core when HT is off.
On SandyBridge, the LBR_SELECT register is private to each thread
when HT is on. It is private to each core when HT is off.
The kernel must manage the sharing of LBR_SELECT. It allows
multiple users on the same logical CPU to use LBR_SELECT as
long as they program it with the same value. Across sibling
CPUs (HT threads), the same restriction applies on NHM/WSM.
This patch implements this sharing logic by leveraging the
mechanism put in place for managing the offcore_response
shared MSR.
We modify __intel_shared_reg_get_constraints() to cause
x86_get_event_constraint() to be called because LBR may
be associated with events that may be counter constrained.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the LBR definitions for NHM/WSM/SNB and Core.
It also adds the definitions for the architected LBR MSR:
LBR_SELECT, LBRT_TOS.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the ability to sample taken branches to the
perf_event interface.
The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.
This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel x86 it is
implemented on top of the Last Branch Record (LBR) facility.
To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.
Sampled taken branches may be filtered by type and/or priv
levels.
The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.
Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.
The following generic filters are currently defined:
- PERF_SAMPLE_USER
only branches whose targets are at the user level
- PERF_SAMPLE_KERNEL
only branches whose targets are at the kernel level
- PERF_SAMPLE_HV
only branches whose targets are at the hypervisor level
- PERF_SAMPLE_ANY
any type of branches (subject to priv levels filters)
- PERF_SAMPLE_ANY_CALL
any call branches (may incl. syscall on some arch)
- PERF_SAMPLE_ANY_RET
any return branches (may incl. syscall returns on some arch)
- PERF_SAMPLE_IND_CALL
indirect call branches
Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).
The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When kvm guest uses kvmclock, it may hang on vcpu hot-plug.
This is caused by an overflow in pvclock_get_nsec_offset,
u64 delta = tsc - shadow->tsc_timestamp;
which in turn is caused by an undefined values from percpu
hv_clock that hasn't been initialized yet.
Uninitialized clock on being booted cpu is accessed from
start_secondary
-> smp_callin
-> smp_store_cpu_info
-> identify_secondary_cpu
-> mtrr_ap_init
-> mtrr_restore
-> stop_machine_from_inactive_cpu
-> queue_stop_cpus_work
...
-> sched_clock
-> kvm_clock_read
which is well before x86_cpuinit.setup_percpu_clockev call in
start_secondary, where percpu clock is initialized.
This patch introduces a hook that allows to setup/initialize
per_cpu clock early and avoid overflow due to reading
- undefined values
- old values if cpu was offlined and then onlined again
Another possible early user of this clock source is ftrace that
accesses it to get timestamps for ring buffer entries. So if
mtrr_ap_init is moved from identify_secondary_cpu to past
x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace
may cause the same overflow/hang on cpu hot-plug anyway.
More complete description of the problem:
https://lkml.org/lkml/2012/2/2/101
Credits to Marcelo Tosatti <mtosatti@redhat.com> for hook idea.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It turned out that a performance counter on AMD does not
count at all when the GO or HO bit is set in the control
register and SVM is disabled in EFER.
This patch works around this issue by masking out the HO bit
in the performance counter control register when SVM is not
enabled.
The GO bit is not touched because it is only set when the
user wants to count in guest-mode only. So when SVM is
disabled the counter should not run at all and the
not-counting is the intended behaviour.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Avi Kivity <avi@redhat.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: stable@vger.kernel.org # v3.2
Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Specify the data structures for the 64-bit ioctls with explicit sizing
and padding so that the x32 kernel will correctly use the 64-bit forms
of these ioctls. Note that these ioctls are bogus in both forms on
both 32 and 64 bits; even on 64 bits the maximum MTRR size is only 44
bits long.
Note that nothing really is supposed to use these ioctls and that the
preferred interface is text strings on /proc/mtrr, or better yet,
nothing at all (use /sys/bus/pci/devices/*/resource*_wc for write
combining; that uses PAT not MTRRs.)
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: H. J. Lu <hjl.tools@gmail.com>
Tested-by: Nitin A. Kamble <nitin.a.kamble@intel.com>
Link: http://lkml.kernel.org/n/tip-vwvnlu3hjmtkwvij4qxtm90l@git.kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
With bug.h currently living right in linux/kernel.h there
are files that use BUG_ON and friends but are not including
the header explicitly. Fix them up so we can remove the
presence in kernel.h file.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Since we already have a debugreg.h header file, move the
assoc. get/set functions to it. In addition to it being the
logical home for them, it has a secondary advantage. The
functions that are moved use BUG(). So we really need to
have linux/bug.h in scope. But asm/processor.h is used about
600 times, vs. only about 15 for debugreg.h -- so adding bug.h
to the latter reduces the amount of time we'll be processing
it during a compile.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/AMD: Fix UP build error
x86: Specify a size for the cmp in the NMI handler
x86/nmi: Test saved %cs in NMI to determine nested NMI case
x86/amd: Fix L1i and L2 cache sharing information for AMD family 15h processors
x86/microcode: Remove noisy AMD microcode warning
Commit eab9e6137f ("x86-64: Fix CFI data for interrupt frames")
introduced a DW_CFA_def_cfa_expression in the SAVE_ARGS_IRQ
macro. To later define the CFA using a simple register+offset
rule both register and offset need to be supplied. Just using
CFI_DEF_CFA_REGISTER leaves the offset undefined. So use
CFI_DEF_CFA with reg+off explicitly at the end of
common_interrupt.
Signed-off-by: Mark Wielaard <mjw@redhat.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/1330079527-30711-1-git-send-email-mjw@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After all, this code is being run once at boot only (if
configured in at all).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/4F478C010200007800074A3D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
task->thread.usersp is unusable immediately after a binary is exec()'d
until it undergoes a context switch cycle. The start_thread() function
called during execve() saves the stack pointer into pt_regs and into
old_rsp, but fails to record it into task->thread.usersp.
Because of this, KSTK_ESP(task) returns an incorrect value for a
64-bit program until the task is switched out and back in since
switch_to swaps %rsp values in and out into task->thread.usersp.
Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Link: http://lkml.kernel.org/r/1330273075-2949-1-git-send-email-siddhesh.poyarekar@gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
If a process has a non-x32 ia32 personality and changes to x32, the
process would keep its TS_COMPAT flag. x32 uses the presence of the
x32 flag on a syscall to determine compat status, so make sure
TS_COMPAT is cleared.
Signed-off-by: Bobby Powers <bobbypowers@gmail.com>
Link: http://lkml.kernel.org/r/1330230338-25077-1-git-send-email-bobbypowers@gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Some of the comments for the nesting NMI algorithm were stale and
had some references to some prototypes that were first tried.
I also updated the comments to be a little easier to understand
the flow of the code. It definitely needs the documentation.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In one case, use an address register that was computed earlier (and
with a simpler instruction), thus reducing the risk of a stall.
In the second case, eliminate a branch by using a conditional move (as
is already done in call_softirq and xen_do_hypervisor_callback).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4F4788A50200007800074A26@nat28.tlf.novell.com
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The saving and restoring of %rdx wasn't annotated at all, and the
jumping over sections where state gets partly restored wasn't handled
either.
Further, by folding the pushing of the previous frame in repeat_nmi
into that which so far was immediately preceding restart_nmi (after
moving the restore of %rdx ahead of that, since it doesn't get used
anymore when pushing prior frames), annotations of the replicated
frame creations can be made consistent too.
v2: Fully fold repeat_nmi into the normal code flow (adding a single
redundant instruction to the "normal" code path), thus retaining
the special protection of all instructions between repeat_nmi and
end_repeat_nmi.
Link: http://lkml.kernel.org/r/4F478B630200007800074A31@nat28.tlf.novell.com
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.
Typical usage scenarios:
#include <linux/static_key.h>
struct static_key key = STATIC_KEY_INIT_TRUE;
if (static_key_false(&key))
do unlikely code
else
do likely code
Or:
if (static_key_true(&key))
do likely code
else
do unlikely code
The static key is modified via:
static_key_slow_inc(&key);
...
static_key_slow_dec(&key);
The 'slow' prefix makes it abundantly clear that this is an
expensive operation.
I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.
On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Traditionally the kernel has refused to setup EFI at all if there's been
a mismatch in 32/64-bit mode between EFI and the kernel.
On some platforms that boot natively through EFI (Chrome OS being one),
we still need to get at least some of the static data such as memory
configuration out of EFI. Runtime services aren't as critical, and
it's a significant amount of work to implement switching between the
operating modes to call between kernel and firmware for thise cases. So
I'm ignoring it for now.
v5:
* Fixed some printk strings based on feedback
* Renamed 32/64-bit specific types to not have _ prefix
* Fixed bug in printout of efi runtime disablement
v4:
* Some of the earlier cleanup was accidentally reverted by this patch, fixed.
* Reworded some messages to not have to line wrap printk strings
v3:
* Reorganized to a series of patches to make it easier to review, and
do some of the cleanups I had left out before.
v2:
* Added graceful error handling for 32-bit kernel that gets passed
EFI data above 4GB.
* Removed some warnings that were missed in first version.
Signed-off-by: Olof Johansson <olof@lixom.net>
Link: http://lkml.kernel.org/r/1329081869-20779-6-git-send-email-olof@lixom.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch removes the x86-specific definition of irq_domain and replaces
it with the common implementation.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Current kernel MCE code reads ERST at the first reading of /dev/mcelog
(maybe in starting mcelogd,) even if the system does not support ERST,
which results in a fake "no such device" message (as described in [1].)
This problem is not critical, but can confuse system admins.
This patch fixes it by filtering the return value from lower (ACPI) layer.
[1] http://thread.gmane.org/gmane.linux.kernel/1060250
Reported by: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Link: https://lkml.org/lkml/2012/1/23/299
Signed-off-by: Tony Luck <tony.luck@intel.com>
When I previously fixed up the mce_device code, I used a static array of
the pointers. It was (rightfully) pointed out to me that I should be
using the per_cpu code instead.
This patch converts the code over to that structure, moving the variable
back into the per_cpu area, like it used to be for 3.2 and earlier.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: https://lkml.org/lkml/2012/1/27/165
Signed-off-by: Tony Luck <tony.luck@intel.com>
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.
Console log before:
Booting Node 0, Processors #1
smpboot cpu 1: start_ip = 96000
#2
smpboot cpu 2: start_ip = 96000
#3
smpboot cpu 3: start_ip = 96000
#4
smpboot cpu 4: start_ip = 96000
...
#31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs
Console log after:
Booting Node 0, Processors #1#2#3#4#5#6#7 Ok.
Booting Node 1, Processors #8#9#10#11#12#13#14#15 Ok.
Booting Node 0, Processors #16#17#18#19#20#21#22#23 Ok.
Booting Node 1, Processors #24#25#26#27#28#29#30#31
Brought up 32 CPUs
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
141168c36c ("x86: Simplify code by removing a !SMP #ifdefs
from 'struct cpuinfo_x86'") removed a bunch of CONFIG_SMP ifdefs
around code touching struct cpuinfo_x86 members but also caused
the following build error with Randy's randconfigs:
mce_amd.c:(.cpuinit.text+0x4723): undefined reference to `cpu_llc_shared_map'
Restore the #ifdef in threshold_create_bank() which creates
symlinks on the non-BSP CPUs.
There's a better patch series being worked on by Kevin Winchester
which will solve this in a cleaner fashion, but that series is
too ambitious for v3.3 merging - so we first queue up this trivial
fix and then do the rest for v3.4.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Acked-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Nick Bowler <nbowler@elliptictech.com>
Link: http://lkml.kernel.org/r/20120203191801.GA2846@x1.osrc.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For each logical CPU that is coming online, we spend 20msec for
checking the TSC synchronization. And as this is done
sequentially for each logical CPU boot, this time gets added up
depending on the number of logical CPU's supported by the
platform.
Minimize this by using the socket topology information.
If the target CPU coming online doesn't have any of its
core-siblings online, a timeout of 20msec will be used for the
TSC-warp measurement loop. Otherwise a smaller timeout of 2msec
will be used, as we have some information about this socket
already (and this information grows as we have more and more
logical-siblings in that socket).
Ideally we should be able to skip the TSC sync check on the
other core-siblings, if the first logical CPU in a socket passed
the sync test. But as the TSC is per-logical CPU and can
potentially be modified wrongly by the bios before the OS boot,
TSC sync test for smaller duration should be able to catch such
errors. Also this will catch the condition where all the cores
in the socket doesn't get reset at the same time.
For example, with this modification, time spent in TSC sync
checks on a 4 socket 10-core with HT system gets reduced from
1580msec to 212msec.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jack Steiner <steiner@sgi.com>
Cc: venki@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1328581940.29790.20.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Few cleanups suggested by Ingo Molnar.
- Rename struct uprobe_arch_info to struct arch_uprobe.
- Move insn from struct uprobe to struct arch_uprobe.
- Make arch specific uprobe functions to accept struct arch_uprobe
instead of struct uprobe.
- Move struct uprobe to kernel/uprobes.c from include/linux/uprobes.h
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Stone <jistone@redhat.com>
Link: http://lkml.kernel.org/r/20120222091602.15880.40249.sendpatchset@srdronam.in.ibm.com
[ Made various small improvements ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some versions of gcc spits a warning about the asm operand for
test_bit and also causes the first long of the instruction table
to be output.
Fix is similar to 7115e3fc on arch/x86/kernel/kprobes.c
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Josh Stone <jistone@redhat.com>
Link: http://lkml.kernel.org/r/20120222091535.15880.12502.sendpatchset@srdronam.in.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.
So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.
The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.
The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module. But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained. Even if it
isn't perhaps as contained as it _could_ be.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Instead of exporting the very low-level internals of the FPU state
save/restore code (ie things like 'fpu_owner_task'), we should export
the higher-level interfaces.
Inlining these things is pointless anyway: sure, sometimes the end
result is small, but while 'stts()' can result in just three x86
instructions, those are not cheap instructions (writing %cr0 is a
serializing instruction and a very slow one at that).
So the overhead of a function call is not noticeable, and we really
don't want random modules mucking about with our internal state save
logic anyway.
So this unexports 'fpu_owner_task', and instead uninlines and exports
the actual functions that modules can use: fpu_kernel_begin/end() and
unlazy_fpu().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211339590.5354@i5.linux-foundation.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
(And define it properly for x86-32, which had its 'current_task'
declaration in separate from x86-64)
Bitten by my dislike for modules on the machines I use, and the fact
that apparently nobody else actually wanted to test the patches I sent
out.
Snif. Nobody else cares.
Anyway, we probably should uninline the 'kernel_fpu_begin()' function
that is what modules actually use and that references this, but this is
the minimal fix for now.
Reported-by: Josh Boyer <jwboyer@gmail.com>
Reported-and-tested-by: Jongman Heo <jongman.heo@samsung.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus noticed that the cmp used to check if the code segment is
__KERNEL_CS or not did not specify a size. Perhaps it does not matter
as H. Peter Anvin noted that user space can not set the bottom two
bits of the %cs register. But it's best not to let the assembly choose
and change things between different versions of gas, but instead just
pick the size.
Four bytes are used to compare the saved code segment against
__KERNEL_CS. Perhaps this might mess up Xen, but we can fix that when
the time comes.
Also I noticed that there was another non-specified cmp that checks
the special stack variable if it is 1 or 0. This too probably doesn't
matter what cmp is used, but this patch uses cmpl just to make it non
ambiguous.
Link: http://lkml.kernel.org/r/CA+55aFxfAn9MWRgS3O5k2tqN5ys1XrhSFVO5_9ZAoZKDVgNfGA@mail.gmail.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow an x32 process to be started.
Originally-by: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
x32 uses the 64-bit signal frame format, obviously, but there are some
structures which mixes that with pointers or sizeof(long) types, as
such we have to create a handful of system calls specific to x32. By
and large these are a mixture of the 64-bit and the compat system
calls.
Originally-by: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
x32 shares most system calls with x86-64, but unfortunately some
subsystem (the input subsystem is the chief offender) which require
is_compat() when operating with a 32-bit userspace. The input system
actually has text files in sysfs whose meaning is dependent on
sizeof(long) in userspace!
We could solve this by having two completely disjoint system call
tables; requiring that each system call be duplicated. This patch
takes a different approach: we add a flag to the system call number;
this flag doesn't affect the system call dispatch but requests compat
treatment from affected subsystems for the duration of the system call.
The change of cmpq to cmpl is safe since it immediately follows the
and.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Export setup_sigcontext() and restore_sigcontext() from signal.c, so
we can use the 64-bit versions verbatim for x32.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There are some definitions which are duplicated between
kernel/signal.c and ia32/ia32_signal.c; move them to a common header
file.
Rather than adding stuff to existing header files which contain data
structures, create a new header file; hence the slightly odd name
("all the good ones were taken.")
Note: nothing relied on signal_fault() being defined in
<asm/ptrace.h>.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Split the 64-bit system calls into "64" (64-bit only) and "common"
(64-bit or x32) and add the x32 system call numbers.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
An x32 process is *almost* the same thing as a 64-bit process with a
32-bit address limit, but there are a few minor differences -- in
particular core dumps are 32 bits and signal handling is different.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Factor out IA32 (compatibility instruction set) from 32-bit address
space in the thread_info flags; this is a precondition patch for x32
support.
Originally-by: H. J. Lu <hjl.tools@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-4pr1xnnksprt7t0h3w5fw4rv@git.kernel.org
This makes us recognize when we try to restore FPU state that matches
what we already have in the FPU on this CPU, and avoids the restore
entirely if so.
To do this, we add two new data fields:
- a percpu 'fpu_owner_task' variable that gets written any time we
update the "has_fpu" field, and thus acts as a kind of back-pointer
to the task that owns the CPU. The exception is when we save the FPU
state as part of a context switch - if the save can keep the FPU
state around, we leave the 'fpu_owner_task' variable pointing at the
task whose FP state still remains on the CPU.
- a per-thread 'last_cpu' field, that indicates which CPU that thread
used its FPU on last. We update this on every context switch
(writing an invalid CPU number if the last context switch didn't
leave the FPU in a lazily usable state), so we know that *that*
thread has done nothing else with the FPU since.
These two fields together can be used when next switching back to the
task to see if the CPU still matches: if 'fpu_owner_task' matches the
task we are switching to, we know that no other task (or kernel FPU
usage) touched the FPU on this CPU in the meantime, and if the current
CPU number matches the 'last_cpu' field, we know that this thread did no
other FP work on any other CPU, so the FPU state on the CPU must match
what was saved on last context switch.
In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
CR0.TS bit.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This inlines what is usually just a couple of instructions, but more
importantly it also fixes the theoretical error case (can that FPU
restore really ever fail? Maybe we should remove the checking).
We can't start sending signals from within the scheduler, we're much too
deep in the kernel and are holding the runqueue lock etc. So don't
bother even trying.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes sure we clear the FPU usage counter for newly created tasks,
just so that we start off in a known state (for example, don't try to
preload the FPU state on the first task switch etc).
It also fixes a thinko in when we increment the fpu_counter at task
switch time, introduced by commit 34ddc81a23 ("i387: re-introduce FPU
state preloading at context switch time"). We should increment the
*new* task fpu_counter, not the old task, and only if we decide to use
that state (whether lazily or preloaded).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the irq happens in user mode, our kernel stack is empty
(apart from the pt_regs themselves, of course), so there's no
need or advantage to switch.
And it really doesn't save any stack space, quite the reverse:
it means that a nested interrupt cannot switch irq stacks. So
instead of saving kernel stack space, it actually causes the
potential for *more* stack usage.
Also simplify the preemption count copy when we do switch
stacks: just copy the whole preemption count, rather than just
the softirq parts of it. There is no advantage to the partial
copy: it is more effort to get a less correct result.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202191139260.10000@i5.linux-foundation.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, the NMI handler tests if it is nested by checking the
special variable saved on the stack (set during NMI handling)
and whether the saved stack is the NMI stack as well (to prevent
the race when the variable is set to zero).
But userspace may set their %rsp to any value as long as they do
not derefence it, and it may make it point to the NMI stack,
which will prevent NMIs from triggering while the userspace app
is running. (I tested this, and it is indeed the case)
Add another check to determine nested NMIs by looking at the
saved %cs (code segment register) and making sure that it is the
kernel code segment.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1329687817.1561.27.camel@acer.local.home
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After all the FPU state cleanups and finally finding the problem that
caused all our FPU save/restore problems, this re-introduces the
preloading of FPU state that was removed in commit b3b0870ef3 ("i387:
do not preload FPU state at task switch time").
However, instead of simply reverting the removal, this reimplements
preloading with several fixes, most notably
- properly abstracted as a true FPU state switch, rather than as
open-coded save and restore with various hacks.
In particular, implementing it as a proper FPU state switch allows us
to optimize the CR0.TS flag accesses: there is no reason to set the
TS bit only to then almost immediately clear it again. CR0 accesses
are quite slow and expensive, don't flip the bit back and forth for
no good reason.
- Make sure that the same model works for both x86-32 and x86-64, so
that there are no gratuitous differences between the two due to the
way they save and restore segment state differently due to
architectural differences that really don't matter to the FPU state.
- Avoid exposing the "preload" state to the context switch routines,
and in particular allow the concept of lazy state restore: if nothing
else has used the FPU in the meantime, and the process is still on
the same CPU, we can avoid restoring state from memory entirely, just
re-expose the state that is still in the FPU unit.
That optimized lazy restore isn't actually implemented here, but the
infrastructure is set up for it. Of course, older CPU's that use
'fnsave' to save the state cannot take advantage of this, since the
state saving also trashes the state.
In other words, there is now an actual _design_ to the FPU state saving,
rather than just random historical baggage. Hopefully it's easier to
follow as a result.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This moves the bit that indicates whether a thread has ownership of the
FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
(called 'has_fpu') in task_struct->thread.has_fpu.
This fixes two independent bugs at the same time:
- changing 'thread_info->status' from the scheduler causes nasty
problems for the other users of that variable, since it is defined to
be thread-synchronous (that's what the "TS_" part of the naming was
supposed to indicate).
So perfectly valid code could (and did) do
ti->status |= TS_RESTORE_SIGMASK;
and the compiler was free to do that as separate load, or and store
instructions. Which can cause problems with preemption, since a task
switch could happen in between, and change the TS_USEDFPU bit. The
change to TS_USEDFPU would be overwritten by the final store.
In practice, this seldom happened, though, because the 'status' field
was seldom used more than once, so gcc would generally tend to
generate code that used a read-modify-write instruction and thus
happened to avoid this problem - RMW instructions are naturally low
fat and preemption-safe.
- On x86-32, the current_thread_info() pointer would, during interrupts
and softirqs, point to a *copy* of the real thread_info, because
x86-32 uses %esp to calculate the thread_info address, and thus the
separate irq (and softirq) stacks would cause these kinds of odd
thread_info copy aliases.
This is normally not a problem, since interrupts aren't supposed to
look at thread information anyway (what thread is running at
interrupt time really isn't very well-defined), but it confused the
heck out of irq_fpu_usable() and the code that tried to squirrel
away the FPU state.
(It also caused untold confusion for us poor kernel developers).
It also turns out that using 'task_struct' is actually much more natural
for most of the call sites that care about the FPU state, since they
tend to work with the task struct for other reasons anyway (ie
scheduling). And the FPU data that we are going to save/restore is
found there too.
Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to
the %esp issue.
Cc: Arjan van de Ven <arjan@linux.intel.com>
Reported-and-tested-by: Raphael Prevost <raphael@buro.asia>
Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the uprobes code readable to me:
- improve the Kconfig text so that a mere mortal gets some idea
what CONFIG_UPROBES=y is really about
- do trivial renames to standardize around the uprobes_*() namespace
- clean up and simplify various code flow details
- separate basic blocks of functionality
- line break artifact and white space related removal
- use standard local varible definition blocks
- use vertical spacing to make things more readable
- remove unnecessary volatile
- restructure comment blocks to make them more uniform and
more readable in general
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Link: http://lkml.kernel.org/n/tip-ewbwhb8o6navvllsauu7k07p@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add uprobes support to the core kernel, with x86 support.
This commit adds the kernel facilities, the actual uprobes
user-space ABI and perf probe support comes in later commits.
General design:
Uprobes are maintained in an rb-tree indexed by inode and offset
(the offset here is from the start of the mapping). For a unique
(inode, offset) tuple, there can be at most one uprobe in the
rb-tree.
Since the (inode, offset) tuple identifies a unique uprobe, more
than one user may be interested in the same uprobe. This provides
the ability to connect multiple 'consumers' to the same uprobe.
Each consumer defines a handler and a filter (optional). The
'handler' is run every time the uprobe is hit, if it matches the
'filter' criteria.
The first consumer of a uprobe causes the breakpoint to be
inserted at the specified address and subsequent consumers are
appended to this list. On subsequent probes, the consumer gets
appended to the existing list of consumers. The breakpoint is
removed when the last consumer unregisters. For all other
unregisterations, the consumer is removed from the list of
consumers.
Given a inode, we get a list of the mms that have mapped the
inode. Do the actual registration if mm maps the page where a
probe needs to be inserted/removed.
We use a temporary list to walk through the vmas that map the
inode.
- The number of maps that map the inode, is not known before we
walk the rmap and keeps changing.
- extending vm_area_struct wasn't recommended, it's a
size-critical data structure.
- There can be more than one maps of the inode in the same mm.
We add callbacks to the mmap methods to keep an eye on text vmas
that are of interest to uprobes. When a vma of interest is mapped,
we insert the breakpoint at the right address.
Uprobe works by replacing the instruction at the address defined
by (inode, offset) with the arch specific breakpoint
instruction. We save a copy of the original instruction at the
uprobed address.
This is needed for:
a. executing the instruction out-of-line (xol).
b. instruction analysis for any subsequent fixups.
c. restoring the instruction back when the uprobe is unregistered.
We insert or delete a breakpoint instruction, and this
breakpoint instruction is assumed to be the smallest instruction
available on the platform. For fixed size instruction platforms
this is trivially true, for variable size instruction platforms
the breakpoint instruction is typically the smallest (often a
single byte).
Writing the instruction is done by COWing the page and changing
the instruction during the copy, this even though most platforms
allow atomic writes of the breakpoint instruction. This also
mirrors the behaviour of a ptrace() memory write to a PRIVATE
file map.
The core worker is derived from KSM's replace_page() logic.
In essence, similar to KSM:
a. allocate a new page and copy over contents of the page that
has the uprobed vaddr
b. modify the copy and insert the breakpoint at the required
address
c. switch the original page with the copy containing the
breakpoint
d. flush page tables.
replace_page() is being replicated here because of some minor
changes in the type of pages and also because Hugh Dickins had
plans to improve replace_page() for KSM specific work.
Instruction analysis on x86 is based on instruction decoder and
determines if an instruction can be probed and determines the
necessary fixups after singlestep. Instruction analysis is done
at probe insertion time so that we avoid having to repeat the
same analysis every time a probe is hit.
A lot of code here is due to the improvement/suggestions/inputs
from Peter Zijlstra.
Changelog:
(v10):
- Add code to clear REX.B prefix as suggested by Denys Vlasenko
and Masami Hiramatsu.
(v9):
- Use insn_offset_modrm as suggested by Masami Hiramatsu.
(v7):
Handle comments from Peter Zijlstra:
- Dont take reference to inode. (expect inode to uprobe_register to be sane).
- Use PTR_ERR to set the return value.
- No need to take reference to inode.
- use PTR_ERR to return error value.
- register and uprobe_unregister share code.
(v5):
- Modified del_consumer as per comments from Peter.
- Drop reference to inode before dropping reference to uprobe.
- Use i_size_read(inode) instead of inode->i_size.
- Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
- Includes errno.h as recommended by Stephen Rothwell to fix a build issue
on sparc defconfig
- Remove restrictions while unregistering.
- Earlier code leaked inode references under some conditions while
registering/unregistering.
- Continue the vma-rmap walk even if the intermediate vma doesnt
meet the requirements.
- Validate the vma found by find_vma before inserting/removing the
breakpoint
- Call del_consumer under mutex_lock.
- Use hash locks.
- Handle mremap.
- Introduce find_least_offset_node() instead of close match logic in
find_uprobe
- Uprobes no more depends on MM_OWNER; No reference to task_structs
while inserting/removing a probe.
- Uses read_mapping_page instead of grab_cache_page so that the pages
have valid content.
- pass NULL to get_user_pages for the task parameter.
- call SetPageUptodate on the new page allocated in write_opcode.
- fix leaking a reference to the new page under certain conditions.
- Include Instruction Decoder if Uprobes gets defined.
- Remove const attributes for instruction prefix arrays.
- Uses mm_context to know if the application is 32 bit.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Also-written-by: Jim Keniston <jkenisto@us.ibm.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Anton Arapov <anton@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux-mm <linux-mm@kvack.org>
Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
[ Made various small edits to the commit log ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
pending. In order to not leak FIP state from one process to another, we
need to do a floating point load after the fxsave of the old process,
and before the fxrstor of the new FPU state. That resets the state to
the (uninteresting) kernel load, rather than some potentially sensitive
user information.
We used to do this directly after the FPU state save, but that is
actually very inconvenient, since it
(a) corrupts what is potentially perfectly good FPU state that we might
want to lazy avoid restoring later and
(b) on x86-64 it resulted in a very annoying ordering constraint, where
"__unlazy_fpu()" in the task switch needs to be delayed until after
the DS segment has been reloaded just to get the new DS value.
Coupling it to the fxrstor instead of the fxsave automatically avoids
both of these issues, and also ensures that we only do it when actually
necessary (the FP state after a save may never actually get used). It's
simply a much more natural place for the leaked state cleanup.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yes, taking the trap to re-load the FPU/MMX state is expensive, but so
is spending several days looking for a bug in the state save/restore
code. And the preload code has some rather subtle interactions with
both paravirtualization support and segment state restore, so it's not
nearly as simple as it should be.
Also, now that we no longer necessarily depend on a single bit (ie
TS_USEDFPU) for keeping track of the state of the FPU, we migth be able
to do better. If we are really switching between two processes that
keep touching the FP state, save/restore is inevitable, but in the case
of having one process that does most of the FPU usage, we may actually
be able to do much better than the preloading.
In particular, we may be able to keep track of which CPU the process ran
on last, and also per CPU keep track of which process' FP state that CPU
has. For modern CPU's that don't destroy the FPU contents on save time,
that would allow us to do a lazy restore by just re-enabling the
existing FPU state - with no restore cost at all!
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This creates three helper functions that do the TS_USEDFPU accesses, and
makes everybody that used to do it by hand use those helpers instead.
In addition, there's a couple of helper functions for the "change both
CR0.TS and TS_USEDFPU at the same time" case, and the places that do
that together have been changed to use those. That means that we have
fewer random places that open-code this situation.
The intent is partly to clarify the code without actually changing any
semantics yet (since we clearly still have some hard to reproduce bug in
this area), but also to make it much easier to use another approach
entirely to caching the CR0.TS bit for software accesses.
Right now we use a bit in the thread-info 'status' variable (this patch
does not change that), but we might want to make it a full field of its
own or even make it a per-cpu variable.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5b1cbac377 ("i387: make irq_fpu_usable() tests more robust")
added a sanity check to the #NM handler to verify that we never cause
the "Device Not Available" exception in kernel mode.
However, that check actually pinpointed a (fundamental) race where we do
cause that exception as part of the signal stack FPU state save/restore
code.
Because we use the floating point instructions themselves to save and
restore state directly from user mode, we cannot do that atomically with
testing the TS_USEDFPU bit: the user mode access itself may cause a page
fault, which causes a task switch, which saves and restores the FP/MMX
state from the kernel buffers.
This kind of "recursive" FP state save is fine per se, but it means that
when the signal stack save/restore gets restarted, it will now take the
'#NM' exception we originally tried to avoid. With preemption this can
happen even without the page fault - but because of the user access, we
cannot just disable preemption around the save/restore instruction.
There are various ways to solve this, including using the
"enable/disable_page_fault()" helpers to not allow page faults at all
during the sequence, and fall back to copying things by hand without the
use of the native FP state save/restore instructions.
However, the simplest thing to do is to just allow the #NM from kernel
space, but fix the race in setting and clearing CR0.TS that this all
exposed: the TS bit changes and the TS_USEDFPU bit absolutely have to be
atomic wrt scheduling, so while the actual state save/restore can be
interrupted and restarted, the act of actually clearing/setting CR0.TS
and the TS_USEDFPU bit together must not.
Instead of just adding random "preempt_disable/enable()" calls to what
is already excessively ugly code, this introduces some helper functions
that mostly mirror the "kernel_fpu_begin/end()" functionality, just for
the user state instead.
Those helper functions should probably eventually replace the other
ad-hoc CR0.TS and TS_USEDFPU tests too, but I'll need to think about it
some more: the task switching functionality in particular needs to
expose the difference between the 'prev' and 'next' threads, while the
new helper functions intentionally were written to only work with
'current'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We currently include commas on both sides of the feature ID in a
modalias, but this prevents the lowest numbered feature of a CPU from
being matched. Since all feature IDs have the same length, we do not
need to worry about substring matches, so omit commas from the
modalias entirely.
Avoid generating multiple adjacent wildcards when there is no
feature ID to match.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
snprintf() does not return a negative value when truncating.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some code - especially the crypto layer - wants to use the x86
FP/MMX/AVX register set in what may be interrupt (typically softirq)
context.
That *can* be ok, but the tests for when it was ok were somewhat
suspect. We cannot touch the thread-specific status bits either, so
we'd better check that we're not going to try to save FP state or
anything like that.
Now, it may be that the TS bit is always cleared *before* we set the
USEDFPU bit (and only set when we had already cleared the USEDFP
before), so the TS bit test may actually have been sufficient, but it
certainly was not obviously so.
So this explicitly verifies that we will not touch the TS_USEDFPU bit,
and adds a few related sanity-checks. Because it seems that somehow
AES-NI is corrupting user FP state. The cause is not clear, and this
patch doesn't fix it, but while debugging it I really wanted the code to
be more obviously correct and robust.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was marked asmlinkage for some really old and stale legacy reasons.
Fix that and the equally stale comment.
Noticed when debugging the irq_fpu_usable() bugs.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The power and cpuidle tracepoints are called within a rcu_idle_exit()
section, and must be denoted with the _rcuidle() version of the tracepoint.
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Found out that show_msr=<cpus> is broken, when I asked a
user to use it to capture debug info about broken MTRR's
whose MTRR settings are probably different between CPUs.
Only the first CPUs MSRs are printed, but that is not
enough to track down the suspected bug.
For years we called print_cpu_msr from print_cpu_info(),
but this commit:
| commit 2eaad1fddd
| Author: Mike Travis <travis@sgi.com>
| Date: Thu Dec 10 17:19:36 2009 -0800
|
| x86: Limit the number of processor bootup messages
removed the print_cpu_info() call from all APs.
Put it back - it will only print MSRs when the user
specifically requests them via show_msr=<cpus>.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/1329069237-11483-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For L1 instruction cache and L2 cache the shared CPU information
is wrong. On current AMD family 15h CPUs those caches are shared
between both cores of a compute unit.
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=42607
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Petkov Borislav <Borislav.Petkov@amd.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20120208195229.GA17523@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The following patch fixes a bug introduced by the following
commit:
e050e3f0a7 ("perf: Fix broken interrupt rate throttling")
The patch caused the following warning to pop up depending on
the sampling frequency adjustments:
------------[ cut here ]------------
WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()
It was caused by the following call sequence:
perf_adjust_freq_unthr_context.part() {
stop()
if (delta > 0) {
perf_adjust_period() {
if (period > 8*...) {
stop()
...
start()
}
}
}
start()
}
Which caused a double start and a double stop, thus triggering
the assert in x86_pmu_start().
The patch fixes the problem by avoiding the double calls. We
pass a new argument to perf_adjust_period() to indicate whether
or not the event is already stopped. We can't just remove the
start/stop from that function because it's called from
__perf_event_overflow where the event needs to be reloaded via a
stop/start back-toback call.
The patch reintroduces the assertion in x86_pmu_start() which
was removed by commit:
84f2b9b ("perf: Remove deprecated WARN_ON_ONCE()")
In this second version, we've added calls to disable/enable PMU
during unthrottling or frequency adjustment based on bug report
of spurious NMI interrupts from Eric Dumazet.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: markus@trippelsdorf.de
Cc: paulus@samba.org
Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
[ Minor edits to the changelog and to the code ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stephane Eranian reported that doing a scheduler latency
measurements with perf on AMD doesn't work out as expected due
to the fact that the sched_clock() granularity is too coarse,
i.e. done in jiffies due to the sched_clock_stable not set,
which, if set, would mean that we get to use the TSC as sample
source which would give us much higher precision.
However, there's no reason not to set sched_clock_stable on AMD
because all families from F10h and upwards do have an invariant
TSC and have the CPUID flag to prove (CPUID_8000_0007_EDX[8]).
Make it so, #1.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Venki Pallipadi <venki@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/20120206132546.GA30854@quad
[ Should any non-standard system break the TSC, we should
mark them so explicitly, in their platform init handler, or
in a DMI quirk. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit simply cleans up the style used in this file to fall
in line with what's specified in CodingStyle. Mostly comment
changes, with a single removal of unneeded braces.
Note that the comments for all the DMI quirks in
reboot_dmi_table now all line up consistently using tabs instead
of spaces.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
Link: http://lkml.kernel.org/n/tip-lde9yy7qsomh0sdqevn7xp56@git.kernel.org
[ Fixed a few small details. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit reduces the X86_32 reboot_dmi_table and the X86_64
pci_reboot_dmi_table into a single table with a single set of
functions (e.g., only 1 call to core_initcall).
The table entries that use set_bios_reboot are grouped together
inside a #define CONFIG_X86_32 block.
Note that there's a single entry that uses set_kbd_reboot, which
used to be available only on X86_32. This commit moves that
entry outside the X86_32 block because it seems it never should
have been in there. There's multiple places in reboot.c that
assume KBD is valid regardless of X86_32/X86_64.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
Cc: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/n/tip-lv3aliubas2l3aenq8v3uklk@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
AMD processors will never support /dev/cpu/microcode updating so
just silently fail instead of printing out a warning for every
cpu.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1328552935-965-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So that we can get the perf bench exec stack fixes and then apply the
remaining fix for the files added after what is in perf/urgent.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
With the new throttling/unthrottling code introduced with
commit:
e050e3f0a7 ("perf: Fix broken interrupt rate throttling")
we occasionally hit two WARN_ON_ONCE() checks in:
- intel_pmu_pebs_enable()
- intel_pmu_lbr_enable()
- x86_pmu_start()
The assertions are no longer problematic. There is a valid
path where they can trigger but it is harmless.
The assertion can be triggered with:
$ perf record -e instructions:pp ....
Leading to paths:
intel_pmu_pebs_enable
intel_pmu_enable_event
x86_perf_event_set_period
x86_pmu_start
perf_adjust_freq_unthr_context
perf_event_task_tick
scheduler_tick
And:
intel_pmu_lbr_enable
intel_pmu_enable_event
x86_perf_event_set_period
x86_pmu_start
perf_adjust_freq_unthr_context.
perf_event_task_tick
scheduler_tick
cpuc->enabled is always on because when we get to
perf_adjust_freq_unthr_context() the PMU is not totally
disabled. Furthermore when we need to adjust a period,
we only stop the event we need to change and not the
entire PMU. Thus, when we re-enable, cpuc->enabled is
already set. Note that when we stop the event, both
pebs and lbr are stopped if necessary (and possible).
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20120202110401.GA30911@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This was done to resolve a merge and build problem with the
drivers/acpi/processor_driver.c file.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
bugs, x86: Fix printk levels for panic, softlockups and stack dumps
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf top: Fix number of samples displayed
perf tools: Fix strlen() bug in perf_event__synthesize_event_type()
perf tools: Fix broken build by defining _GNU_SOURCE in Makefile
x86/dumpstack: Remove unneeded check in dump_trace()
perf: Fix broken interrupt rate throttling
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix ancient race in do_exit()
sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
sched/s390: Fix compile error in sched/core.c
sched: Fix rq->nr_uninterruptible update race
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Remove VersaLogic Menlow reboot quirk
x86/reboot: Skip DMI checks if reboot set by user
x86: Properly parenthesize cmpxchg() macro arguments
This commit removes the reboot quirk originally added by commit
e19e074 ("x86: Fix reboot problem on VersaLogic Menlow boards").
Testing with a VersaLogic Ocelot (VL-EPMs-21a rev 1.00 w/ BIOS
6.5.102) revealed the following regarding the reboot hang
problem:
- v2.6.37 reboot=bios was needed.
- v2.6.38-rc1: behavior changed, reboot=acpi is needed,
reboot=kbd and reboot=bios results in system hang.
- v2.6.38: VersaLogic patch (e19e074 "x86: Fix reboot problem on
VersaLogic Menlow boards") was applied prior to v2.6.38-rc7. This
patch sets a quirk for VersaLogic Menlow boards that forces the use
of reboot=bios, which doesn't work anymore.
- v3.2: It seems that commit 660e34c ("x86: Reorder reboot method
preferences") changed the default reboot method to acpi prior to
v3.0-rc1, which means the default behavior is appropriate for the
Ocelot. No VersaLogic quirk is required.
The Ocelot board used for testing can successfully reboot w/out
having to pass any reboot= arguments for all 3 current versions
of the BIOS.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Michael D Labriola <mlabriol@gdeb.com>
Cc: Kushal Koolwal <kushalkoolwal@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/87vcnub9hu.fsf@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Skip DMI checks for vendor specific reboot quirks if the user
passed in a reboot= arg on the command line - we should never
override user choices.
Signed-off-by: Michael D Labriola <michael.d.labriola@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Michael D Labriola <mlabriol@gdeb.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/87wr8ab9od.fsf@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current device suspend/resume phases during system-wide power
transitions appear to be insufficient for some platforms that want
to use the same callback routines for saving device states and
related operations during runtime suspend/resume as well as during
system suspend/resume. In principle, they could point their
.suspend_noirq() and .resume_noirq() to the same callback routines
as their .runtime_suspend() and .runtime_resume(), respectively,
but at least some of them require device interrupts to be enabled
while the code in those routines is running.
It also makes sense to have device suspend-resume callbacks that will
be executed with runtime PM disabled and with device interrupts
enabled in case someone needs to run some special code in that
context during system-wide power transitions.
Apart from this, .suspend_noirq() and .resume_noirq() were introduced
as a workaround for drivers using shared interrupts and failing to
prevent their interrupt handlers from accessing suspended hardware.
It appears to be better not to use them for other porposes, or we may
have to deal with some serious confusion (which seems to be happening
already).
For the above reasons, introduce new device suspend/resume phases,
"late suspend" and "early resume" (and analogously for hibernation)
whose callback will be executed with runtime PM disabled and with
device interrupts enabled and whose callback pointers generally may
point to runtime suspend/resume routines.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
Smatch complains that we have some inconsistent NULL checking.
If "task" were NULL then it would lead to a NULL dereference
later. We can remove this test because earlier on in the
function we have:
if (!task)
task = current;
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Clemens Ladisch <clemens@ladisch.de>
Link: http://lkml.kernel.org/r/20120128105246.GA25092@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch is based on Andi Kleen's work:
Implement autoprobing/loading of modules serving CPU
specific features (x86cpu autoloading).
And Kay Siever's work to get rid of sysdev cpu structures
and making use of struct device instead.
Before, the cpuid driver had to be loaded to get the x86cpu
autoloading feature. With this patch autoloading works through
the /sys/devices/system/cpu object
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Don't try to describe the actual models for now.
v2: Fix typo: X86_VENDOR_ANY -> X86_FAMILY_ANY (trenn)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It is rather similar to CPB (boot capability) feature
and exists since fam10h (can be looked up in AMD's BKDG).
The feature is needed for powernow-k8 to cleanup init functions and to
provide proper autoloading matching with the new x86cpu modalias
feature.
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Borislav Petkov <bp@amd64.org>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There's a growing number of drivers that support a specific x86 feature
or CPU. Currently loading these drivers currently on a generic
distribution requires various driver specific hacks and it often
doesn't work.
This patch adds auto probing for drivers based on the x86 cpuid
information, in particular based on vendor/family/model number
and also based on CPUID feature bits.
For example a common issue is not loading the SSE 4.2 accelerated
CRC module: this can significantly lower the performance of BTRFS
which relies on fast CRC.
Another issue is loading the right CPUFREQ driver for the current CPU.
Currently distributions often try all all possible driver until
one sticks, which is not really a good way to do this.
It works with existing udev without any changes. The code
exports the x86 information as a generic string in sysfs
that can be matched by udev's pattern matching.
This scheme does not support numeric ranges, so if you want to
handle e.g. ranges of model numbers they have to be encoded
in ASCII or simply all models or families listed. Fixing
that would require changing udev.
Another issue is that udev will happily load all drivers that match,
there is currently no nice way to stop a specific driver from
being loaded if it's not needed (e.g. if you don't need fast CRC)
But there are not that many cpu specific drivers around and they're
all not that bloated, so this isn't a particularly serious issue.
Originally this patch added the modalias to the normal cpu
sysdevs. However sysdevs don't have all the infrastructure
needed for udev, so it couldn't really autoload drivers.
This patch instead adds the CPU modaliases to the cpuid devices,
which are real devices with full support for udev. This implies
that the cpuid driver has to be loaded to use this.
This patch just adds infrastructure, some driver conversions
in followups.
Thanks to Kay for helping with some sysfs magic.
v2: Constifcation, some updates
v4: (trenn@suse.de):
- Use kzalloc instead of kmalloc to terminate modalias buffer
- Use uppercase hex values to match correctly against hex values containing
letters
Cc: Dave Jones <davej@redhat.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jen Axboe <axboe@kernel.dk>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Magic constants like 0x0134 in code just invite questions on
where they come from, what they mean, can they be changed.
Provide #defines for the architecturally defined MCACOD values
with a reference to the Intel Software Developers manual which
describes them.
Signed-off-by: Tony Luck <tony.luck@intel.com>
rsyslog will display KERN_EMERG messages on a connected
terminal. However, these messages are useless/undecipherable
for a general user.
For example, after a softlockup we get:
Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
kernel:Stack:
Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
kernel:Call Trace:
Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <e8> ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89
This happens because the printk levels for these messages are
incorrect. Only an informational message should be displayed on
a terminal.
I modified the printk levels for various messages in the kernel
and tested the output by using the drivers/misc/lkdtm.c kernel
modules (ie, softlockups, panics, hard lockups, etc.) and
confirmed that the console output was still the same and that
the output to the terminals was correct.
For example, in the case of a softlockup we now see the much
more informative:
Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
BUG: soft lockup - CPU4 stuck for 60s!
instead of the above confusing messages.
AFAICT, the messages no longer have to be KERN_EMERG. In the
most important case of a panic we set console_verbose(). As for
the other less severe cases the correct data is output to the
console and /var/log/messages.
Successfully tested by me using the drivers/misc/lkdtm.c module.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: dzickus@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We've decided to provide CPU family specific container files
(starting with CPU family 15h). E.g. for family 15h we have to
load microcode_amd_fam15h.bin instead of microcode_amd.bin
Rationale is that starting with family 15h patch size is larger
than 2KB which was hard coded as maximum patch size in various
microcode loaders (not just Linux).
Container files which include patches larger than 2KB cause
different kinds of trouble with such old patch loaders. Thus we
have to ensure that the default container file provides only
patches with size less than 2KB.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20120120164412.GD24508@alberich.amd.com
[ documented the naming convention and tidied the code a bit. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This includes initial support for the recently published ACPI 5.0 spec.
In particular, support for the "hardware-reduced" bit that eliminates
the dependency on legacy hardware.
APEI has patches resulting from testing on real hardware.
Plus other random fixes.
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (52 commits)
acpi/apei/einj: Add extensions to EINJ from rev 5.0 of acpi spec
intel_idle: Split up and provide per CPU initialization func
ACPI processor: Remove unneeded variable passed by acpi_processor_hotadd_init V2
ACPI processor: Remove unneeded cpuidle_unregister_driver call
intel idle: Make idle driver more robust
intel_idle: Fix a cast to pointer from integer of different size warning in intel_idle
ACPI: kernel-parameters.txt : Add intel_idle.max_cstate
intel_idle: remove redundant local_irq_disable() call
ACPI processor: Fix error path, also remove sysdev link
ACPI: processor: fix acpi_get_cpuid for UP processor
intel_idle: fix API misuse
ACPI APEI: Convert atomicio routines
ACPI: Export interfaces for ioremapping/iounmapping ACPI registers
ACPI: Fix possible alignment issues with GAS 'address' references
ACPI, ia64: Use SRAT table rev to use 8bit or 16/32bit PXM fields (ia64)
ACPI, x86: Use SRAT table rev to use 8bit or 32bit PXM fields (x86/x86-64)
ACPI: Store SRAT table revision
ACPI, APEI, Resolve false conflict between ACPI NVS and APEI
ACPI, Record ACPI NVS regions
ACPI, APEI, EINJ, Refine the fix of resource conflict
...
JONGMAN HEO reports:
With current linus git (commit a25a2b84), I got following build error,
arch/x86/kernel/vm86_32.c: In function 'do_sys_vm86':
arch/x86/kernel/vm86_32.c:340: error: implicit declaration of function '__audit_syscall_exit'
make[3]: *** [arch/x86/kernel/vm86_32.o] Error 1
OK, I can reproduce it (32bit allmodconfig with AUDIT=y, AUDITSYSCALL=n)
It's due to commit d7e7528bcd: "Audit: push audit success and retcode
into arch ptrace.h".
Reported-by: JONGMAN HEO <jongman.heo@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
audit: no leading space in audit_log_d_path prefix
audit: treat s_id as an untrusted string
audit: fix signedness bug in audit_log_execve_info()
audit: comparison on interprocess fields
audit: implement all object interfield comparisons
audit: allow interfield comparison between gid and ogid
audit: complex interfield comparison helper
audit: allow interfield comparison in audit rules
Kernel: Audit Support For The ARM Platform
audit: do not call audit_getname on error
audit: only allow tasks to set their loginuid if it is -1
audit: remove task argument to audit_set_loginuid
audit: allow audit matching on inode gid
audit: allow matching on obj_uid
audit: remove audit_finish_fork as it can't be called
audit: reject entry,always rules
audit: inline audit_free to simplify the look of generic code
audit: drop audit_set_macxattr as it doesn't do anything
audit: inline checks for not needing to collect aux records
audit: drop some potentially inadvisable likely notations
...
Use evil merge to fix up grammar mistakes in Kconfig file.
Bad speling and horrible grammar (and copious swearing) is to be
expected, but let's keep it to commit messages and comments, rather than
expose it to users in config help texts or printouts.
pit_expect_msb() returns success wrongly in the below SMI scenario:
a. pit_verify_msb() has not yet seen the MSB transition.
b. we are close to the MSB transition though and got a SMI immediately after
returning from pit_verify_msb() which didn't see the MSB transition. PIT MSB
transition has happened somewhere during SMI execution.
c. returned from SMI and we noted down the 'tsc', saw the pit MSB change now and
exited the loop to calculate 'deltatsc'. Instead of noting the TSC at the MSB
transition, we are way off because of the SMI. And as the SMI happened
between the pit_verify_msb() and before the 'tsc' is recorded in the
for loop, 'delattsc' (d1/d2 in quick_pit_calibrate()) will be small and
quick_pit_calibrate() will not notice this error.
Depending on whether SMI disturbance happens while computing d1 or d2, we will
see the TSC calibrated value smaller or bigger than the expected value. As a
result, in a cluster we were seeing a variation of approximately +/- 20MHz in
the calibrated values, resulting in NTP failures.
[ As far as the SMI source is concerned, this is a periodic SMI that gets
disabled after ACPI is enabled by the OS. But the TSC calibration happens
before the ACPI is enabled. ]
To address this, change pit_expect_msb() so that
- the 'tsc' is the TSC in between the two reads that read the MSB
change from the PIT (same as before)
- the 'delta' is the difference in TSC from *before* the MSB changed
to *after* the MSB changed.
Now the delta is twice as big as before (it covers four PIT accesses,
roughly 4us) and quick_pit_calibrate() will loop a bit longer to get
the calibrated value with in the 500ppm precision. As the delta (d1/d2)
covers four PIT accesses, actual calibrated result might be closer to
250ppm precision.
As the loop now takes longer to stabilize, double MAX_QUICK_PIT_MS to 50.
SMI disturbance will showup as much larger delta's and the loop will take
longer than usual for the result to be with in the accepted precision. Or will
fallback to slow PIT calibration if it takes more than 50msec.
Also while we are at this, remove the calibration correction that aims to
get the result to the middle of the error bars. We really don't know which
direction to correct into, so remove it.
Reported-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1326843337.5291.4.camel@sbsiddha-mobl2
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Every arch calls:
if (unlikely(current->audit_context))
audit_syscall_entry()
which requires knowledge about audit (the existance of audit_context) in
the arch code. Just do it all in static inline in audit.h so that arch's
can remain blissfully ignorant.
Signed-off-by: Eric Paris <eparis@redhat.com>
The audit system previously expected arches calling to audit_syscall_exit to
supply as arguments if the syscall was a success and what the return code was.
Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
by converting from negative retcodes to an audit internal magic value stating
success or failure. This helper was wrong and could indicate that a valid
pointer returned to userspace was a failed syscall. The fix is to fix the
layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it
in turns calls back into arch code to collect the return value and to
determine if the syscall was a success or failure. We also define a generic
is_syscall_success() macro which determines success/failure based on if the
value is < -MAX_ERRNO. This works for arches like x86 which do not use a
separate mechanism to indicate syscall failure.
We make both the is_syscall_success() and regs_return_value() static inlines
instead of macros. The reason is because the audit function must take a void*
for the regs. (uml calls theirs struct uml_pt_regs instead of just struct
pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit
function takes a void* we need to use static inlines to cast it back to the
arch correct structure to dereference it.
The other major change is that on some arches, like ia64, MIPS and ppc, we
change regs_return_value() to give us the negative value on syscall failure.
THE only other user of this macro, kretprobe_example.c, won't notice and it
makes the value signed consistently for the audit functions across all archs.
In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
audit code as the return value. But the ptrace_64.h code defined the macro
regs_return_value() as regs[3]. I have no idea which one is correct, but this
patch now uses the regs_return_value() function, so it now uses regs[3].
For powerpc we previously used regs->result but now use the
regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is
always positive so the regs_return_value(), much like ia64 makes it negative
before calling the audit code when appropriate.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
Acked-by: Richard Weinberger <richard@nod.at> [for uml]
Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
Some firmware will access memory in ACPI NVS region via APEI. That
is, instructions in APEI ERST/EINJ table will read/write ACPI NVS
region. The original resource conflict checking in APEI code will
check memory/ioport accessed by APEI via general resource management
mechanism. But ACPI NVS region is marked as busy already, so that the
false resource conflict will prevent APEI ERST/EINJ to work.
To fix this, this patch record ACPI NVS regions, so that we can avoid
request resources for memory region inside it.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'x86-syscall-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move <asm/asm-offsets.h> from trace_syscalls.c to asm/syscall.h
x86, um: Fix typo in 32-bit system call modifications
um: Use $(srctree) not $(KBUILD_SRC)
x86, um: Mark system call tables readonly
x86, um: Use the same style generated syscall tables as native
um: Generate headers before generating user-offsets.s
um: Run host archheaders, allow use of host generated headers
kbuild, headers.sh: Don't make archheaders explicitly
x86, syscall: Allow syscall offset to be symbolic
x86, syscall: Re-fix typo in comment
x86: Simplify syscallhdr.sh
x86: Generate system call tables and unistd_*.h from tables
checksyscalls: Use arch/x86/syscalls/syscall_32.tbl as source
x86: Machine-readable syscall tables and scripts to process them
trace: Include <asm/asm-offsets.h> in trace_syscalls.c
x86-64, ia32: Move compat_ni_syscall into C and its own file
x86-64, syscall: Adjust comment spacing and remove typo
kbuild: Add support for an "archheaders" target
kbuild: Add support for installing generated asm headers
When suspending, there was a large list of warnings going something like:
Device 'machinecheck1' does not have a release() function, it is broken and must be fixed
This patch turns the static mce_devices into dynamically allocated, and
properly frees them when they are removed from the system. It solves
the warning messages on my laptop here.
Reported-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Djalal Harouni <tixxdz@opendz.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@amd64.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
perf tools: Fix compile error on x86_64 Ubuntu
perf report: Fix --stdio output alignment when --showcpuutilization used
perf annotate: Get rid of field_sep check
perf annotate: Fix usage string
perf kmem: Fix a memory leak
perf kmem: Add missing closedir() calls
perf top: Add error message for EMFILE
perf test: Change type of '-v' option to INCR
perf script: Add missing closedir() calls
tracing: Fix compile error when static ftrace is enabled
recordmcount: Fix handling of elf64 big-endian objects.
perf tools: Add const.h to MANIFEST to make perf-tar-src-pkg work again
perf tools: Add support for guest/host-only profiling
perf kvm: Do guest-only counting by default
perf top: Don't update total_period on process_sample
perf hists: Stop using 'self' for struct hist_entry
perf hists: Rename total_session to total_period
x86: Add counter when debug stack is used with interrupts enabled
x86: Allow NMIs to hit breakpoints in i386
x86: Keep current stack in NMI breakpoints
...
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, atomic: atomic64_read() take a const pointer
x86, UV: Update Boot messages for SGI UV2 platform
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPD2aFAAoJENkgDmzRrbjxNzsQAIeYbbrXYLjr6kQzUSngj/eC
FzjaTEfYTQIeuQCFJHcHthyc5lXV4sQbo3jOezW+Bp5yuDJL2aWIHesSfWZe7imu
zQdM4VshOYdAmUR9Q0AW5zhB8Smbs7/AyABiF2jm4p0ZPOuyMDSlei9sjvE9Vjvt
B7g5ht7L6kz0JbDnwwy0u5gs+tEitwpXYId9Y4ysZIBzIbL0qkPX8veOddGTMy0N
8xhWXaKtufpjvxFD2ORLDsw3AkoF1xXSNuFd/5nzCNpbeE7TW931jfkPoqJumuAO
7GLxcU9kKYl+IICobC6wBtsj/RrB7w+cBXMvPGwdBliam1qaRhUcJZi5FLM/Ha5d
2A9QDYNUpoXiO8JbPXrV9Z+Y0+Co8RilsQj7R/rjZh6AbbYCWt9nxzx2Svl/RfTr
xfiimHuB2P3rHjOvpCXULwOOuE5c8MzPuWncpdjiD3uGXOY/aY+X1m+if/quJw9D
grPlKL0+YiRakEYUeGG4M77KCqyKFZaF7L7UQPbqfZcj8V/9AW3/7U5I/B9RlAjs
idsr4fcf5s0N+oKUyTCW1ncpUDQNiwbU2NyJQqeu1ZxaRGj72AgyvsaNeyIPDyK+
f6x95Bi7i8KLjXc9Z1KvJwh2Nxt25gNUiTYVha/9H2NpJGd1cfI15kTOGXrgddVv
1pvuGcJDZwYiwfiXr3FL
=HHrh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://github.com/rustyrussell/linux
Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
* tag 'for-linus' of git://github.com/rustyrussell/linux:
module_param: check that bool parameters really are bool.
intelfbdrv.c: bailearly is an int module_param
paride/pcd: fix bool verbose module parameter.
module_param: make bool parameters really bool (drivers & misc)
module_param: make bool parameters really bool (arch)
module_param: make bool parameters really bool (core code)
kernel/async: remove redundant declaration.
printk: fix unnecessary module_param_name.
lirc_parallel: fix module parameter description.
module_param: avoid bool abuse, add bint for special cases.
module_param: check type correctness for module_param_array
modpost: use linker section to generate table.
modpost: use a table rather than a giant if/else statement.
modules: sysfs - export: taint, coresize, initsize
kernel/params: replace DEBUGP with pr_debug
module: replace DEBUGP with pr_debug
module: struct module_ref should contains long fields
module: Fix performance regression on modules with large symbol tables
module: Add comments describing how the "strmap" logic works
Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
generated table approach to adding __mod_*_device_table entries. The
ARM sa11x0 mcp bus needed to be converted to that too.
Commit 8a25a2fd12 ("cpu: convert 'cpu' and 'machinecheck' sysdev_class
to a regular subsystem") changed how things are dealt with in the MCE
subsystem. Some of the things that got broken due to this are CPU
hotplug and suspend/hibernate.
MCE uses per_cpu allocations of struct device. So, when a CPU goes
offline and comes back online, in order to ensure that we start from a
clean slate with respect to the MCE subsystem, zero out the entire
per_cpu device structure to 0 before using it.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.
It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel config: Fix the APB_TIMER selection
x86/mrst: Add additional debug prints for pb_keys
x86/intel config: Revamp configuration to allow for Moorestown and Medfield
x86/intel/scu/ipc: Match the changes in the x86 configuration
x86/apb: Fix configuration constraints
x86: Fix INTEL_MID silly
x86/Kconfig: Cyclone-timer depends on x86-summit
x86: Reduce clock calibration time during slave cpu startup
x86/config: Revamp configuration for MID devices
x86/sfi: Kill the IRQ as id hack
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, reboot: Fix typo in nmi reboot path
x86, NMI: Add to_cpumask() to silence compile warning
x86, NMI: NMI selftest depends on the local apic
x86: Add stack top margin for stack overflow checking
x86, NMI: NMI-selftest should handle the UP case properly
x86: Fix the 32-bit stackoverflow-debug build
x86, NMI: Add knob to disable using NMI IPIs to stop cpus
x86, NMI: Add NMI IPI selftest
x86, reboot: Use NMI instead of REBOOT_VECTOR to stop cpus
x86: Clean up the range of stack overflow checking
x86: Panic on detection of stack overflow
x86: Check stack overflow in detail
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/numa: Add constraints check for nid parameters
mm, x86: Remove debug_pagealloc_enabled
x86/mm: Initialize high mem before free_all_bootmem()
arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
x86: Fix mmap random address range
x86, mm: Unify zone_sizes_init()
x86, mm: Prepare zone_sizes_init() for unification
x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
x86, mm: Use max_pfn instead of highend_pfn
x86, mm: Move zone init from paging_init() on 64-bit
x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: (80 commits)
x86/PCI: Expand the x86_msi_ops to have a restore MSIs.
PCI: Increase resource array mask bit size in pcim_iomap_regions()
PCI: DEVICE_COUNT_RESOURCE should be equal to PCI_NUM_RESOURCES
PCI: pci_ids: add device ids for STA2X11 device (aka ConneXT)
PNP: work around Dell 1536/1546 BIOS MMCONFIG bug that breaks USB
x86/PCI: amd: factor out MMCONFIG discovery
PCI: Enable ATS at the device state restore
PCI: msi: fix imbalanced refcount of msi irq sysfs objects
PCI: kconfig: English typo in pci/pcie/Kconfig
PCI/PM/Runtime: make PCI traces quieter
PCI: remove pci_create_bus()
xtensa/PCI: convert to pci_scan_root_bus() for correct root bus resources
x86/PCI: convert to pci_create_root_bus() and pci_scan_root_bus()
x86/PCI: use pci_scan_bus() instead of pci_scan_bus_parented()
x86/PCI: read Broadcom CNB20LE host bridge info before PCI scan
sparc32, leon/PCI: convert to pci_scan_root_bus() for correct root bus resources
sparc/PCI: convert to pci_create_root_bus()
sh/PCI: convert to pci_scan_root_bus() for correct root bus resources
powerpc/PCI: convert to pci_create_root_bus()
powerpc/PCI: split PHB part out of pcibios_map_io_space()
...
Fix up conflicts in drivers/pci/msi.c and include/linux/pci_regs.h due
to the same patches being applied in other branches.
Andrew elucidates:
- First installmeant of MM. We have a HUGE number of MM patches this
time. It's crazy.
- MAINTAINERS updates
- backlight updates
- leds
- checkpatch updates
- misc ELF stuff
- rtc updates
- reiserfs
- procfs
- some misc other bits
* akpm: (124 commits)
user namespace: make signal.c respect user namespaces
workqueue: make alloc_workqueue() take printf fmt and args for name
procfs: add hidepid= and gid= mount options
procfs: parse mount options
procfs: introduce the /proc/<pid>/map_files/ directory
procfs: make proc_get_link to use dentry instead of inode
signal: add block_sigmask() for adding sigmask to current->blocked
sparc: make SA_NOMASK a synonym of SA_NODEFER
reiserfs: don't lock root inode searching
reiserfs: don't lock journal_init()
reiserfs: delay reiserfs lock until journal initialization
reiserfs: delete comments referring to the BKL
drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
drivers/rtc/: remove redundant spi driver bus initialization
drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
rtc: convert drivers/rtc/* to use module_platform_driver()
drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
...
Abstract the code sequence for adding a signal handler's sa_mask to
current->blocked because the sequence is identical for all architectures.
Furthermore, in the past some architectures actually got this code wrong,
so introduce a wrapper that all architectures can use.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'kvm-updates/3.3' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (74 commits)
KVM: PPC: Whitespace fix for kvm.h
KVM: Fix whitespace in kvm_para.h
KVM: PPC: annotate kvm_rma_init as __init
KVM: x86 emulator: implement RDPMC (0F 33)
KVM: x86 emulator: fix RDPMC privilege check
KVM: Expose the architectural performance monitoring CPUID leaf
KVM: VMX: Intercept RDPMC
KVM: SVM: Intercept RDPMC
KVM: Add generic RDPMC support
KVM: Expose a version 2 architectural PMU to a guests
KVM: Expose kvm_lapic_local_deliver()
KVM: x86 emulator: Use opcode::execute for Group 9 instruction
KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions
KVM: x86 emulator: Use opcode::execute for Group 1A instruction
KVM: ensure that debugfs entries have been created
KVM: drop bsp_vcpu pointer from kvm struct
KVM: x86: Consolidate PIT legacy test
KVM: x86: Do not rely on implicit inclusions
KVM: Make KVM_INTEL depend on CPU_SUP_INTEL
KVM: Use memdup_user instead of kmalloc/copy_from_user
...
* 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (65 commits)
tty: serial: imx: move del_timer_sync() to avoid potential deadlock
imx: add polled io uart methods
imx: Add save/restore functions for UART control regs
serial/imx: let probing fail for the dt case without a valid alias
serial/imx: propagate error from of_alias_get_id instead of using -ENODEV
tty: serial: imx: Allow UART to be a source for wakeup
serial: driver for m32 arch should not have DEC alpha errata
serial/documentation: fix documented name of DCD cpp symbol
atmel_serial: fix spinlock lockup in RS485 code
tty: Fix memory leak in virtual console when enable unicode translation
serial: use DIV_ROUND_CLOSEST instead of open coding it
serial: add support for 400 and 800 v3 series Titan cards
serial: bfin-uart: Remove ASYNC_CTS_FLOW flag for hardware automatic CTS.
serial: bfin-uart: Enable hardware automatic CTS only when CTS pin is available.
serial: make FSL errata depend on 8250_CONSOLE, not just 8250
serial: add irq handler for Freescale 16550 errata.
serial: manually inline serial8250_handle_port
serial: make 8250 timeout use the specified IRQ handler
serial: export the key functions for an 8250 IRQ handler
serial: clean up parameter passing for 8250 Rx IRQ handling
...
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
SGI UV systems print a message during boot:
UV: Found <num> blades
Due to packaging changes, the blade count is not accurate for
on the next generation of the platform. This patch corrects the
count.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20120106191900.GA19772@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits)
arm: fix up some samsung merge sysdev conversion problems
firmware: Fix an oops on reading fw_priv->fw in sysfs loading file
Drivers:hv: Fix a bug in vmbus_driver_unregister()
driver core: remove __must_check from device_create_file
debugfs: add missing #ifdef HAS_IOMEM
arm: time.h: remove device.h #include
driver-core: remove sysdev.h usage.
clockevents: remove sysdev.h
arm: convert sysdev_class to a regular subsystem
arm: leds: convert sysdev_class to a regular subsystem
kobject: remove kset_find_obj_hinted()
m86k: gpio - convert sysdev_class to a regular subsystem
mips: txx9_sram - convert sysdev_class to a regular subsystem
mips: 7segled - convert sysdev_class to a regular subsystem
sh: dma - convert sysdev_class to a regular subsystem
sh: intc - convert sysdev_class to a regular subsystem
power: suspend - convert sysdev_class to a regular subsystem
power: qe_ic - convert sysdev_class to a regular subsystem
power: cmm - convert sysdev_class to a regular subsystem
s390: time - convert sysdev_class to a regular subsystem
...
Fix up conflicts with 'struct sysdev' removal from various platform
drivers that got changed:
- arch/arm/mach-exynos/cpu.c
- arch/arm/mach-exynos/irq-eint.c
- arch/arm/mach-s3c64xx/common.c
- arch/arm/mach-s3c64xx/cpu.c
- arch/arm/mach-s5p64x0/cpu.c
- arch/arm/mach-s5pv210/common.c
- arch/arm/plat-samsung/include/plat/cpu.h
- arch/powerpc/kernel/sysfs.c
and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h
It was brought to my attention that my x86 change to use NMI in
the reboot path broke Intel Nehalem and Westmere boxes when
using kexec.
I realized I had mistyped the if statement in commit
3603a2512f and stuck the ')' in
the wrong spot. Putting it in the right spot fixes kexec again.
Doh.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1325866671-9797-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The MSI restore function will become a function pointer in an
x86_msi_ops struct. It defaults to the implementation in the
io_apic.c and msi.c. We piggyback on the indirection mechanism
introduced by "x86: Introduce x86_msi_ops".
Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Use "do { } while(0)" for empty lock_cmos()/unlock_cmos() macros
x86: Use "do { } while(0)" for empty flush_tlb_fix_spurious_fault() macro
x86, CPU: Drop superfluous get_cpu_cap() prototype
arch/x86/mm/pageattr.c: Quiet sparse noise; local functions should be static
arch/x86/kernel/ptrace.c: Quiet sparse noise
x86: Use kmemdup() in copy_thread(), rather than duplicating its implementation
x86: Replace the EVT_TO_HPET_DEV() macro with an inline function
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86: Fix atomic64_xxx_cx8() functions
x86: Fix and improve cmpxchg_double{,_local}()
x86_64, asm: Optimise fls(), ffs() and fls64()
x86, bitops: Move fls64.h inside __KERNEL__
x86: Fix and improve percpu_cmpxchg{8,16}b_double()
x86: Report cpb and eff_freq_ro flags correctly
x86/i386: Use less assembly in strlen(), speed things up a bit
x86: Use the same node_distance for 32 and 64-bit
x86: Fix rflags in FAKE_STACK_FRAME
x86: Clean up and extend do_int3()
x86: Call do_notify_resume() with interrupts enabled
x86/div64: Add a micro-optimization shortcut if base is power of two
x86-64: Cleanup some assembly entry points
x86-64: Slightly shorten line system call entry and exit paths
x86-64: Reduce amount of redundant code generated for invalidate_interruptNN
x86-64: Slightly shorten int_ret_from_sys_call
x86, efi: Convert efi_phys_get_time() args to physical addresses
x86: Default to vsyscall=emulate
x86-64: Set siginfo and context on vsyscall emulation faults
x86: consolidate xchg and xadd macros
...
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Skip cpus with apic-ids >= 255 in !x2apic_mode
x86, x2apic: Allow "nox2apic" to disable x2apic mode setup by BIOS
x86, x2apic: Fallback to xapic when BIOS doesn't setup interrupt-remapping
x86, acpi: Skip acpi x2apic entries if the x2apic feature is not present
x86, apic: Add probe() for apic_flat
x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86'
x86: Convert per-cpu counter icr_read_retry_count into a member of irq_stat
x86: Add per-cpu stat counter for APIC ICR read tries
pci, x86/io-apic: Allow PCI_IOAPIC to be user configurable on x86
x86: Fix the !CONFIG_NUMA build of the new CPU ID fixup code support
x86: Add NumaChip support
x86: Add x86_init platform override to fix up NUMA core numbering
x86: Make flat_init_apic_ldr() available
This factors out the AMD native MMCONFIG discovery so we can use it
outside amd_bus.c.
amd_bus.c reads AMD MSRs so it can remove the MMCONFIG area from the
PCI resources. We may also need the MMCONFIG information to work
around BIOS defects in the ACPI MCFG table.
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: stable@kernel.org # 2.6.34+
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
and it fixes the build error in the arch/x86/kernel/microcode_core.c
file, that the merge did not catch.
The microcode_core.c patch was provided by Stephen Rothwell
<sfr@canb.auug.org.au> who was invaluable in the merge issues involved
with the large sysdev removal process in the driver-core tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (106 commits)
perf kvm: Fix copy & paste error in description
perf script: Kill script_spec__delete
perf top: Fix a memory leak
perf stat: Introduce get_ratio_color() helper
perf session: Remove impossible condition check
perf tools: Fix feature-bits rework fallout, remove unused variable
perf script: Add generic perl handler to process events
perf tools: Use for_each_set_bit() to iterate over feature flags
perf tools: Unify handling of features when writing feature section
perf report: Accept fifos as input file
perf tools: Moving code in some files
perf tools: Fix out-of-bound access to struct perf_session
perf tools: Continue processing header on unknown features
perf tools: Improve macros for struct feature_ops
perf: builtin-record: Document and check that mmap_pages must be a power of two.
perf: builtin-record: Provide advice if mmap'ing fails with EPERM.
perf tools: Fix truncated annotation
perf script: look up thread using tid instead of pid
perf tools: Look up thread names for system wide profiling
perf tools: Fix comm for processes with named threads
...
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
cpu: Export cpu_up()
rcu: Apply ACCESS_ONCE() to rcu_boost() return value
Revert "rcu: Permit rt_mutex_unlock() with irqs disabled"
docs: Additional LWN links to RCU API
rcu: Augment rcu_batch_end tracing for idle and callback state
rcu: Add rcutorture tests for srcu_read_lock_raw()
rcu: Make rcutorture test for hotpluggability before offlining CPUs
driver-core/cpu: Expose hotpluggability to the rest of the kernel
rcu: Remove redundant rcu_cpu_stall_suppress declaration
rcu: Adaptive dyntick-idle preparation
rcu: Keep invoking callbacks if CPU otherwise idle
rcu: Irq nesting is always 0 on rcu_enter_idle_common
rcu: Don't check irq nesting from rcu idle entry/exit
rcu: Permit dyntick-idle with callbacks pending
rcu: Document same-context read-side constraints
rcu: Identify dyntick-idle CPUs on first force_quiescent_state() pass
rcu: Remove dynticks false positives and RCU failures
rcu: Reduce latency of rcu_prepare_for_idle()
rcu: Eliminate RCU_FAST_NO_HZ grace-period hang
rcu: Avoid needlessly IPIing CPUs at GP end
...
* 'core-memblock-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
memblock: Reimplement memblock allocation using reverse free area iterator
memblock: Kill early_node_map[]
score: Use HAVE_MEMBLOCK_NODE_MAP
s390: Use HAVE_MEMBLOCK_NODE_MAP
mips: Use HAVE_MEMBLOCK_NODE_MAP
ia64: Use HAVE_MEMBLOCK_NODE_MAP
SuperH: Use HAVE_MEMBLOCK_NODE_MAP
sparc: Use HAVE_MEMBLOCK_NODE_MAP
powerpc: Use HAVE_MEMBLOCK_NODE_MAP
memblock: Implement memblock_add_node()
memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users
memblock: Track total size of regions automatically
powerpc: Cleanup memblock usage
memblock: Reimplement memblock_enforce_memory_limit() using __memblock_remove()
memblock: Make memblock functions handle overflowing range @size
memblock: Reimplement __memblock_remove() using memblock_isolate_range()
memblock: Separate out memblock_isolate_range() from memblock_set_node()
memblock: Kill memblock_init()
memblock: Kill sentinel entries at the end of static region arrays
memblock: Add __memblock_dump_all()
...
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Action required data path signature is defined in table 15-19 of SDM:
+-----------------------------------------------------------------------------+
| SRAR Error | Valid | OVER | UC | EN | MISCV | ADDRV | PCC | S | AR | MCACOD |
| Data Load | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0x134 |
+-----------------------------------------------------------------------------+
Recognise this, and pass MCE_AR_SEVERITY code back to do_machine_check() if
we have the action handler configured (CONFIG_MEMORY_FAILURE=y)
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
All non-urgent actions (reporting low severity errors and handling
"action-optional" errors) are now handled by a work queue. This
means that TIF_MCE_NOTIFY can be used to block execution for a
thread experiencing an "action-required" fault until we get all
cpus out of the machine check handler (and the thread that hit
the fault into mce_notify_process().
We use the new mce_{save,find,clear}_info() API to get information
from do_machine_check() to mce_notify_process(), and then use the
newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
the error (possibly signalling the process).
Update some comments to make the new code flows clearer.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Machine checks on Intel cpus interrupt execution on all cpus, regardless
of interrupt masking. We have a need to save some data about the cause
of the machine check (physical address) in the machine check handler that
can be retrieved later to attempt recovery in a more flexible execution
state.
Signed-off-by: Tony Luck <tony.luck@intel.com>
The MCI_STATUS_MISCV and MCI_STATUS_ADDRV bits in the bank status
registers define whether the MISC and ADDR registers respectively
contain valid data - provide a helper function to check these bits
and read the registers when needed.
In addition, processors that support software error recovery (as
indicated by the MCG_SER_P bit in the MCG_CAP register) may include
some undefined bits in the ADDR register - mask these out.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There is only one caller of memory_failure(), all other users call
__memory_failure() and pass in the flags argument explicitly. The
lone user of memory_failure() will soon need to pass flags too.
Add flags argument to the callsite in mce.c. Delete the old memory_failure()
function, and then rename __memory_failure() without the leading "__".
Provide clearer message when action optional memory errors are ignored.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
This has not been used for some years now. It's time to remove it.
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
If the x2apic mode is disabled for reasons like interrupt-remapping
not available etc, then we need to skip the logical cpu bringup of
apic-id's >= 255. Otherwise as the platform is in xapic mode, init/startup
IPI's will consider only the low 8-bits and there is a possibility of
re-sending init/startup IPI's to the logical cpu that is already online.
This will avoid potential reboots/unpredictable behavior etc.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/20111222014632.702932458@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently "nox2apic" boot parameter was not enabling x2apic mode if the cpu,
kernel are all capable of enabling x2apic mode and the OS handover
happened in xapic mode.
However If the bios enabled x2apic prior to OS handover, using "nox2apic"
boot parameter had no effect.
If the boot cpu's apicid is < 255, enable "nox2apic" boot parameter to
disable the x2apic mode setup by the bios. This will enable the kernel to
fallback to xapic mode and bringup only the cpu's which has apic-id < 255.
-v2: fix patch error and two compiling warning
make disable_x2apic to be __init
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/CAE9FiQUeB-3uxJAMiHsz=uPWoFv5Hg1pVepz7aU6YtqOxMC-=Q@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
On some of the recent Intel SNB platforms, by default bios is pre-enabling
x2apic mode in the cpu with out setting up interrupt-remapping.
This case was resulting in the kernel to panic as the cpu is already in
x2apic mode but the OS was not able to enable interrupt-remapping (which
is a pre-req for using x2apic capability).
On these platforms all the apic-ids are < 255 and the kernel can fallback to
xapic mode if the bios has not enabled interrupt-remapping (which is
mostly the case if the bios has not exported interrupt-remapping tables to the
OS).
Reported-by: Berck E. Nash <flyboy@gmail.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20111222014632.600418637@sbsiddha-desk.sc.intel.com
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If the x2apic feature is not present (either the cpu is not capable of it
or the user has disabled the feature using boot-parameter etc), ignore the
x2apic MADT and SRAT entries provided by the ACPI tables.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20111222014632.540896503@sbsiddha-desk.sc.intel.com
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently we start with the default apic_flat mode and switch to some other
apic model depending on the apic drivers acpi_madt_oem_check() routines and
later followed by the apic drivers probe() routines.
Once we selected non flat mode there was no case where we fall back to
flat mode again.
Upcoming changes allow bios-enabled x2apic mode to be disabled by the OS
if interrupt-remapping etc is not setup properly by the bios.
We now has a case for the apic to fall back to legacy flat mode during
apic driver probe() seqeuence. Add a simple flat_probe() which allows
the apic_flat mode to be the last fallback option.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20111222014632.484984298@sbsiddha-desk.sc.intel.com
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Use raw_spin_unlock_irqrestore() as equivalent to
raw_spin_lock_irqsave().
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1324646665-13334-1-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The sysdev.h file should not be needed by any in-kernel code, so remove
the .h file from these random files that seem to still want to include
it.
The sysdev code will be going away soon, so this include needs to be
removed no matter what.
Cc: Jiandong Zheng <jdzheng@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Kukjin Kim <kgene.kim@samsung.com>
Cc: David Brown <davidb@codeaurora.org>
Cc: Daniel Walker <dwalker@fifo99.com>
Cc: Bryan Huntsman <bryanh@codeaurora.org>
Cc: Ben Dooks <ben-linux@fluff.org>
Cc: Wan ZongShun <mcuos.com@gmail.com>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "Venkatesh Pallipadi
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Mathieu Desnoyers pointed out a case that can cause issues with
NMIs running on the debug stack:
int3 -> interrupt -> NMI -> int3
Because the interrupt changes the stack, the NMI will not see that
it preempted the debug stack. Looking deeper at this case,
interrupts only happen when the int3 is from userspace or in
an a location in the exception table (fixup).
userspace -> int3 -> interurpt -> NMI -> int3
All other int3s that happen in the kernel should be processed
without ever enabling interrupts, as the do_trap() call will
panic the kernel if it is called to process any other location
within the kernel.
Adding a counter around the sections that enable interrupts while
using the debug stack allows the NMI to also check that case.
If the NMI sees that it either interrupted a task using the debug
stack or the debug counter is non-zero, then it will have to
change the IDT table to make the int3 not change stacks (which will
corrupt the stack if it does).
Note, I had to move the debug_usage functions out of processor.h
and into debugreg.h because of the static inlined functions to
inc and dec the debug_usage counter. __get_cpu_var() requires
smp.h which includes processor.h, and would fail to build.
Link: http://lkml.kernel.org/r/1323976535.23971.112.camel@gandalf.stny.rr.com
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
With i386, NMIs and breakpoints use the current stack and they
do not reset the stack pointer to a fix point that might corrupt
a previous NMI or breakpoint (as it does in x86_64). But NMIs are
still not made to be re-entrant, and need to prevent the case that
an NMI hitting a breakpoint (which does an iret), doesn't allow
another NMI to run.
The fix is to let the NMI be in 3 different states:
1) not running
2) executing
3) latched
When no NMI is executing on a given CPU, the state is "not running".
When the first NMI comes in, the state is switched to "executing".
On exit of that NMI, a cmpxchg is performed to switch the state
back to "not running" and if that fails, the NMI is restarted.
If a breakpoint is hit and does an iret, which re-enables NMIs,
and another NMI comes in before the first NMI finished, it will
detect that the state is not in the "not running" state and the
current NMI is nested. In this case, the state is switched to "latched"
to let the interrupted NMI know to restart the NMI handler, and
the nested NMI exits without doing anything.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We want to allow NMI handlers to have breakpoints to be able to
remove stop_machine from ftrace, kprobes and jump_labels. But if
an NMI interrupts a current breakpoint, and then it triggers a
breakpoint itself, it will switch to the breakpoint stack and
corrupt the data on it for the breakpoint processing that it
interrupted.
Instead, have the NMI check if it interrupted breakpoint processing
by checking if the stack that is currently used is a breakpoint
stack. If it is, then load a special IDT that changes the IST
for the debug exception to keep the same stack in kernel context.
When the NMI is done, it puts it back.
This way, if the NMI does trigger a breakpoint, it will keep
using the same stack and not stomp on the breakpoint data for
the breakpoint it interrupted.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In x86, when an NMI goes off, the CPU goes into an NMI context that
prevents other NMIs to trigger on that CPU. If an NMI is suppose to
trigger, it has to wait till the previous NMI leaves NMI context.
At that time, the next NMI can trigger (note, only one more NMI will
trigger, as only one can be latched at a time).
The way x86 gets out of NMI context is by calling iret. The problem
with this is that this causes problems if the NMI handle either
triggers an exception, or a breakpoint. Both the exception and the
breakpoint handlers will finish with an iret. If this happens while
in NMI context, the CPU will leave NMI context and a new NMI may come
in. As NMI handlers are not made to be re-entrant, this can cause
havoc with the system, not to mention, the nested NMI will write
all over the previous NMI's stack.
Linus Torvalds proposed the following workaround to this problem:
https://lkml.org/lkml/2010/7/14/264
"In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like
NMI entry:
- load percpu NMI stack frame pointer
- if non-zero we know we're nested, and should ignore this NMI:
- we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
- before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
- set the NMI stack pointer to the current stack pointer
NMI exit (not the above "immediate exit because we nested"):
clear the percpu NMI stack pointer
Just do the iret.
Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.
And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy."
I first tried to follow this advice but as I started implementing this
code, a few gotchas showed up.
One, is accessing per-cpu variables in the NMI handler.
The problem is that per-cpu variables use the %gs register to get the
variable for the given CPU. But as the NMI may happen in userspace,
we must first perform a SWAPGS to get to it. The NMI handler already
does this later in the code, but its too late as we have saved off
all the registers and we don't want to do that for a disabled NMI.
Peter Zijlstra suggested to keep all variables on the stack. This
simplifies things greatly and it has the added benefit of cache locality.
Two, faulting on the iret.
I really wanted to make this work, but it was becoming very hacky, and
I never got it to be stable. The iret already had a fault handler for
userspace faulting with bad segment registers, and getting NMI to trigger
a fault and detect it was very tricky. But for strange reasons, the system
would usually take a double fault and crash. I never figured out why
and decided to go with a simple "jmp" approach. The new approach I took
also simplified things.
Finally, the last problem with Linus's approach was to have the nested
NMI handler do a ret instead of an iret to give the first NMI NMI-context
again.
The problem is that ret is much more limited than an iret. I couldn't figure
out how to get the stack back where it belonged. I could have copied the
current stack, pushed the return onto it, but my fear here is that there
may be some place that writes data below the stack pointer. I know that
is not something code should depend on, but I don't want to chance it.
I may add this feature later, but for now, an NMI handler that loses NMI
context will not get it back.
Here's what is done:
When an NMI comes in, the HW pushes the interrupt stack frame onto the
per cpu NMI stack that is selected by the IST.
A special location on the NMI stack holds a variable that is set when
the first NMI handler runs. If this variable is set then we know that
this is a nested NMI and we process the nested NMI code.
There is still a race when this variable is cleared and an NMI comes
in just before the first NMI does the return. For this case, if the
variable is cleared, we also check if the interrupted stack is the
NMI stack. If it is, then we process the nested NMI code.
Why the two tests and not just test the interrupted stack?
If the first NMI hits a breakpoint and loses NMI context, and then it
hits another breakpoint and while processing that breakpoint we get a
nested NMI. When processing a breakpoint, the stack changes to the
breakpoint stack. If another NMI comes in here we can't rely on the
interrupted stack to be the NMI stack.
If the variable is not set and the interrupted task's stack is not the
NMI stack, then we know this is the first NMI and we can process things
normally. But in order to do so, we need to do a few things first.
1) Set the stack variable that tells us that we are in an NMI handler
2) Make two copies of the interrupt stack frame.
One copy is used to return on iret
The other is used to restore the first one if we have a nested NMI.
This is what the stack will look like:
+-------------------------+
| original SS |
| original Return RSP |
| original RFLAGS |
| original CS |
| original RIP |
+-------------------------+
| temp storage for rdx |
+-------------------------+
| NMI executing variable |
+-------------------------+
| Saved SS |
| Saved Return RSP |
| Saved RFLAGS |
| Saved CS |
| Saved RIP |
+-------------------------+
| copied SS |
| copied Return RSP |
| copied RFLAGS |
| copied CS |
| copied RIP |
+-------------------------+
| pt_regs |
+-------------------------+
The original stack frame contains what the HW put in when we entered
the NMI.
We store %rdx as a temp variable to use. Both the original HW stack
frame and this %rdx storage will be clobbered by nested NMIs so we
can not rely on them later in the first NMI handler.
The next item is the special stack variable that is set when we execute
the rest of the NMI handler.
Then we have two copies of the interrupt stack. The second copy is
modified by any nested NMIs to let the first NMI know that we triggered
a second NMI (latched) and that we should repeat the NMI handler.
If the first NMI hits an exception or breakpoint that takes it out of
NMI context, if a second NMI comes in before the first one finishes,
it will update the copied interrupt stack to point to a fix up location
to trigger another NMI.
When the first NMI calls iret, it will instead jump to the fix up
location. This fix up location will copy the saved interrupt stack back
to the copy and execute the nmi handler again.
Note, the nested NMI knows enough to check if it preempted a previous
NMI handler while it is in the fixup location. If it has, it will not
modify the copied interrupt stack and will just leave as if nothing
happened. As the NMI handle is about to execute again, there's no reason
to latch now.
To test all this, I forced the NMI handler to call iret and take itself
out of NMI context. I also added assemble code to write to the serial to
make sure that it hits the nested path as well as the fix up path.
Everything seems to be working fine.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Linus cleaned up the NMI handler but it still needs some comments to
explain why it uses save_paranoid but not paranoid_exit. Just to keep
others from adding that in the future, document why it's not used.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The NMI handler uses the paranoid_exit routine that checks the
NEED_RESCHED flag, and if it is set and the return is for userspace,
then interrupts are enabled, the stack is swapped to the thread's stack,
and schedule is called. The problem with this is that we are still in an
NMI context until an iret is executed. This means that any new NMIs are
now starved until an interrupt or exception occurs and does the iret.
As NMIs can not be masked and can interrupt any location, they are
treated as a special case. NEED_RESCHED should not be set in an NMI
handler. The interruption by the NMI should not disturb the work flow
for scheduling. Any IPI sent to a processor after sending the
NEED_RESCHED would have to wait for the NMI anyway, and after the IPI
finishes the schedule would be called as required.
There is no reason to do anything special leaving an NMI. Remove the
call to paranoid_exit and do a simple return. This not only fixes the
bug of starved NMIs, but it also cleans up the code.
Link: http://lkml.kernel.org/r/CA+55aFzgM55hXTs4griX5e9=v_O+=ue+7Rj0PTD=M7hFYpyULQ@mail.gmail.com
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Turner <pjt@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Extend the mmap control page with fields so that userspace can compute
time deltas relative to the provided time fields.
Currently only implemented for x86 with constant and nonstop TSC.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/n/tip-3u1jucza77j3wuvs0x2bic0f@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement a correct pmu::event_idx for the x86 counter index rules and
set CR4.PCE on CPU_STARTING.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-mwxab34dibqgzk5zywutfnha@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add event maps for Intel x86 processors (with architected PMU v2 or later).
On AMD, there is frequency scaling but no Turbo. There is no core
cycle event not subject to frequency scaling, therefore we do not
provide a mapping.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323559734-3488-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the encoding and definitions necessary for the
unhalted_reference_cycles event avaialble since Intel Core 2 processors.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323559734-3488-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Several fields in struct cpuinfo_x86 were not defined for the
!SMP case, likely to save space. However, those fields still
have some meaning for UP, and keeping them allows some #ifdef
removal from other files. The additional size of the UP kernel
from this change is not significant enough to worry about
keeping up the distinction:
text data bss dec hex filename
4737168 506459 972040 6215667 5ed7f3 vmlinux.o.before
4737444 506459 972040 6215943 5ed907 vmlinux.o.after
for a difference of 276 bytes for an example UP config.
If someone wants those 276 bytes back badly then it should
be implemented in a cleaner way.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: Steffen Persvold <sp@numascale.com>
Link: http://lkml.kernel.org/r/1324428742-12498-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When printing the code bytes in show_registers(), the markers around the
byte at the fault address could make the printk() format string look
like a valid log level and facility code. This would prevent this byte
from being printed and result in a spurious newline:
[ 7555.765589] Code: 8b 32 e9 94 00 00 00 81 7d 00 ff 00 00 00 0f 87 96 00 00 00 48 8b 83 c0 00 00 00 44 89 e2 44 89 e6 48 89 df 48 8b 80 d8 02 00 00
[ 7555.765683] 8b 48 28 48 89 d0 81 e2 ff 0f 00 00 48 c1 e8 0c 48 c1 e0 04
Add KERN_CONT where needed, and elsewhere in show_registers() for
consistency.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Link: http://lkml.kernel.org/r/4EEFA7AE.9020407@ladisch.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LAPIC related statistics are grouped inside the per-cpu
structure irq_stat, so there is no need for icr_read_retry_count
to be a standalone per-cpu variable.
This patch moves icr_read_retry_count to where it belongs.
Suggested-y: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mce-inject provides a mechanism to simulate errors so that test
scripts can check for correct operation of the kernel without
requiring any specialized hardware to create rare events.
The existing code can simulate events in normal process context
and also in NMI context - but not in IRQ context. This patch
fills that gap.
Link: https://lkml.org/lkml/2011/12/7/537
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
My box with following cpuinfo needs the cx8 enabling still:
vendor_id : CentaurHauls
cpu family : 6
model : 13
model name : VIA Eden Processor 1200MHz
stepping : 0
cpu MHz : 1199.940
cache size : 128 KB
This fixes valgrind to work on my box (it requires and checks
cx8 from cpuinfo).
Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Link: http://lkml.kernel.org/r/1323961888-10223-1-git-send-email-timo.teras@iki.fi
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add the flags to get rid of the [9] and [10] feature names
in cpuinfo's 'power management' fields and replace them with
meaningful names.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1323875574-17881-1-git-send-email-joerg.roedel@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Thermal throttle and power limit events are not defined as MCE errors in x86
architecture and should not generate MCE errors in mcelog.
Current kernel generates fake software defined MCE errors for these events.
This may confuse users because they may think the machine has real MCE errors
while actually only thermal throttle or power limit events happen.
To make it worse, buggy firmware on some platforms may falsely generate
the events. Therefore, kernel reports MCE errors which users think as real
hardware errors. Although the firmware bugs should be fixed, on the other hand,
kernel should not report MCE errors either.
So mcelog is not a good mechanism to report these events. To report the events, we count them in respective counters (core_power_limit_count,
package_power_limit_count, core_throttle_count, and package_throttle_count) in
/sys/devices/system/cpu/cpu#/thermal_throttle/. Users can check the counters
for each event on each CPU. Please note that all CPU's on one package report
duplicate counters. It's user application's responsibity to retrieve a package
level counter for one package.
This patch doesn't report package level power limit, core level power limit, and
package level thermal throttle events in mcelog. When the events happen, only
report them in respective counters in sysfs.
Since core level thermal throttle has been legacy code in kernel for a while and
users accepted it as MCE error in mcelog, core level thermal throttle is still
reported in mcelog. In the mean time, the event is counted in a counter in sysfs
as well.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Borislav Petkov <bp@amd64.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20111215001945.GA21009@linux-os.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add a function which drains whatever MCEs were logged in already during
boot and before the decoder chains were registered.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Once we've found and validated the ucode patch for the current CPU,
there's no need to iterate over the remaining patches in the binary
image. Exit then and save us a bunch of cycles.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Basically, what we did until now is take out a chunk of the firmware
image, vmalloc space for it and inspect it before application. And
repeat.
This patch changes all that so that we look at each ucode patch from
the firmware image, check it for sanity and copy it to local buffer for
application only once and if it passes all checks. Thus, vmalloc-ing for
each piece is gone, we can do proper size checking only of the patch
which is destined for the CPU of the current machine instead of each
single patch, which is clearly wrong.
Oh yeah, simplify and cleanup the code while at it, along with adding
comments as to what actually happens.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a simple 4K page which gets allocated on driver init and freed on
driver exit instead of vmalloc'ing small buffers for each ucode patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
In the IPI delivery slow path (NMI delivery) we retry the ICR
read to check for delivery completion a limited number of times.
[ The reason for the limited retries is that some of the places
where it is used (cpu boot, kdump, etc) IPI delivery might not
succeed (due to a firmware bug or system crash, for example)
and in such a case it is better to give up and resume
execution of other code. ]
This patch adds a new entry to /proc/interrupts, RTR, which
tells user space the number of times we retried the ICR read in
the IPI delivery slow path.
This should give some insight into how well the APIC
message delivery hardware is working - if the counts are way
too large then we are hitting a (very-) slow path way too
often.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/n/tip-vzsp20lo2xdzh5f70g0eis2s@git.kernel.org
[ extended the changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is currently a large divide between kernel development and the
development of EFI boot loaders. The idea behind this patch is to give
the kernel developers full control over the EFI boot process. As
H. Peter Anvin put it,
"The 'kernel carries its own stub' approach been very successful in
dealing with BIOS, and would make a lot of sense to me for EFI as
well."
This patch introduces an EFI boot stub that allows an x86 bzImage to
be loaded and executed by EFI firmware. The bzImage appears to the
firmware as an EFI application. Luckily there are enough free bits
within the bzImage header so that it can masquerade as an EFI
application, thereby coercing the EFI firmware into loading it and
jumping to its entry point. The beauty of this masquerading approach
is that both BIOS and EFI boot loaders can still load and run the same
bzImage, thereby allowing a single kernel image to work in any boot
environment.
The EFI boot stub supports multiple initrds, but they must exist on
the same partition as the bzImage. Command-line arguments for the
kernel can be appended after the bzImage name when run from the EFI
shell, e.g.
Shell> bzImage console=ttyS0 root=/dev/sdb initrd=initrd.img
v7:
- Fix checkpatch warnings.
v6:
- Try to allocate initrd memory just below hdr->inird_addr_max.
v5:
- load_options_size is UTF-16, which needs dividing by 2 to convert
to the corresponding ASCII size.
v4:
- Don't read more than image->load_options_size
v3:
- Fix following warnings when compiling CONFIG_EFI_STUB=n
arch/x86/boot/tools/build.c: In function ‘main’:
arch/x86/boot/tools/build.c:138:24: warning: unused variable ‘pe_header’
arch/x86/boot/tools/build.c:138:15: warning: unused variable ‘file_sz’
- As reported by Matthew Garrett, some Apple machines have GOPs that
don't have hardware attached. We need to weed these out by
searching for ones that handle the PCIIO protocol.
- Don't allocate memory if no initrds are on cmdline
- Don't trust image->load_options_size
Maarten Lankhorst noted:
- Don't strip first argument when booted from efibootmgr
- Don't allocate too much memory for cmdline
- Don't update cmdline_size, the kernel considers it read-only
- Don't accept '\n' for initrd names
v2:
- File alignment was too large, was 8192 should be 512. Reported by
Maarten Lankhorst on LKML.
- Added UGA support for graphics
- Use VIDEO_TYPE_EFI instead of hard-coded number.
- Move linelength assignment until after we've assigned depth
- Dynamically fill out AddressOfEntryPoint in tools/build.c
- Don't use magic number for GDT/TSS stuff. Requested by Andi Kleen
- The bzImage may need to be relocated as it may have been loaded at
a high address address by the firmware. This was required to get my
macbook booting because the firmware loaded it at 0x7cxxxxxx, which
triggers this error in decompress_kernel(),
if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
error("Destination address too large");
Cc: Mike Waychison <mikew@google.com>
Cc: Matthew Garrett <mjg@redhat.com>
Tested-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1321383097.2657.9.camel@mfleming-mobl1.ger.corp.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This hangs my MacBook Air at boot time; I get no console
messages at all. I reverted this on top of -rc5 and my machine
boots again.
This reverts commit e8c7106280.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Those two APIs were provided to optimize the calls of
tick_nohz_idle_enter() and rcu_idle_enter() into a single
irq disabled section. This way no interrupt happening in-between would
needlessly process any RCU job.
Now we are talking about an optimization for which benefits
have yet to be measured. Let's start simple and completely decouple
idle rcu and dyntick idle logics to simplify.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The idle notifier, called by enter_idle(), enters into rcu read
side critical section but at that time we already switched into
the RCU-idle window (rcu_idle_enter() has been called). And it's
illegal to use rcu_read_lock() in that state.
This results in rcu reporting its bad mood:
[ 1.275635] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110()
[ 1.275635] Hardware name: AMD690VM-FMH
[ 1.275635] Modules linked in:
[ 1.275635] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #252
[ 1.275635] Call Trace:
[ 1.275635] [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0
[ 1.275635] [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20
[ 1.275635] [<ffffffff817d6f22>] __atomic_notifier_call_chain+0xd2/0x110
[ 1.275635] [<ffffffff817d6f71>] atomic_notifier_call_chain+0x11/0x20
[ 1.275635] [<ffffffff810018a0>] enter_idle+0x20/0x30
[ 1.275635] [<ffffffff81001995>] cpu_idle+0xa5/0x110
[ 1.275635] [<ffffffff817a7465>] rest_init+0xe5/0x140
[ 1.275635] [<ffffffff817a73c8>] ? rest_init+0x48/0x140
[ 1.275635] [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc
[ 1.275635] [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135
[ 1.275635] [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4
[ 1.275635] ---[ end trace a22d306b065d4a66 ]---
Fix this by entering rcu extended quiescent state later, just before
the CPU goes to sleep.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.
To prepare for fixing this, add two new APIs:
tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu().
If no use of RCU is made in the idle loop between
tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch
must instead call the new *_norcu() version such that the arch doesn't
need to call rcu_idle_enter() and rcu_idle_exit().
Otherwise the arch must call tick_nohz_enter_idle() and
tick_nohz_exit_idle() and also call explicitly:
- rcu_idle_enter() after its last use of RCU before the CPU is put
to sleep.
- rcu_idle_exit() before the first use of RCU after the CPU is woken
up.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The tick_nohz_stop_sched_tick() function, which tries to delay
the next timer tick as long as possible, can be called from two
places:
- From the idle loop to start the dytick idle mode
- From interrupt exit if we have interrupted the dyntick
idle mode, so that we reprogram the next tick event in
case the irq changed some internal state that requires this
action.
There are only few minor differences between both that
are handled by that function, driven by the ts->inidle
cpu variable and the inidle parameter. The whole guarantees
that we only update the dyntick mode on irq exit if we actually
interrupted the dyntick idle mode, and that we enter in RCU extended
quiescent state from idle loop entry only.
Split this function into:
- tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
dynticks idle mode unconditionally if it can, and enters into RCU
extended quiescent state.
- tick_nohz_irq_exit() which only updates the dynticks idle mode
when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).
To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
into tick_nohz_idle_exit().
This simplifies the code and micro-optimize the irq exit path (no need
for local_irq_save there). This also prepares for the split between
dynticks and rcu extended quiescent state logics. We'll need this split to
further fix illegal uses of RCU in extended quiescent states in the idle
loop.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Introduce a symbol, EFI_LOADER_SIGNATURE instead of using the magic
strings, which also helps to reduce the amount of ifdeffery.
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1318848017-12301-1-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.
On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:
BUG: unable to handle kernel paging request at f7f22280
IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
[...]
Call Trace:
[<c104f8ca>] ? page_is_ram+0x1a/0x40
[<c1025aff>] reserve_memtype+0xdf/0x2f0
[<c1024dc9>] set_memory_uc+0x49/0xa0
[<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
[<c19216d4>] start_kernel+0x291/0x2f2
[<c19211c7>] ? loglevel+0x1b/0x1b
[<c19210bf>] i386_start_kernel+0xbf/0xc8
A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().
Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,
https://bugzilla.redhat.com/show_bug.cgi?id=748516
Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The get_cpu_cap() external function prototype was declared twice
so lose one of them.
Clean up the header guard while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1322594083-14507-1-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.
On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.
This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
The only function of memblock_analyze() is now allowing resize of
memblock region arrays. Rename it to memblock_allow_resize() and
update its users.
* The following users remain the same other than renaming.
arm/mm/init.c::arm_memblock_init()
microblaze/kernel/prom.c::early_init_devtree()
powerpc/kernel/prom.c::early_init_devtree()
openrisc/kernel/prom.c::early_init_devtree()
sh/mm/init.c::paging_init()
sparc/mm/init_64.c::paging_init()
unicore32/mm/init.c::uc32_memblock_init()
* In the following users, analyze was used to update total size which
is no longer necessary.
powerpc/kernel/machine_kexec.c::reserve_crashkernel()
powerpc/kernel/prom.c::early_init_devtree()
powerpc/mm/init_32.c::MMU_init()
powerpc/mm/tlb_nohash.c::__early_init_mmu()
powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
sh/kernel/machine_kexec.c::reserve_crashkernel()
* x86/kernel/e820.c::memblock_x86_fill() was directly setting
memblock_can_resize before populating memblock and calling analyze
afterwards. Call memblock_allow_resize() before start populating.
memblock_can_resize is now static inside memblock.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>
memblock_init() initializes arrays for regions and memblock itself;
however, all these can be done with struct initializers and
memblock_init() can be removed. This patch kills memblock_init() and
initializes memblock with struct initializer.
The only difference is that the first dummy entries don't have .nid
set to MAX_NUMNODES initially. This doesn't cause any behavior
difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Gcc complains if we don't cast this to a struct cpumask pointer.
arch/x86/kernel/nmi_selftest.c:93:2:
warning: passing argument 1 of ‘cpumask_empty’ from
incompatible pointer type [enabled by default]
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/20111207110612.GA3437@mwanda
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It seems that a margin for stack overflow checking is added to
top of a kernel stack but is not added to IRQ and exception
stacks in stack_overflow_check(). Therefore, the overflows of
IRQ and exception stacks are always detected only after they
actually occurred and data corruption might occur due to them.
This patch adds the margin to top of IRQ and exception stacks
as well as a kernel stack to enhance reliability.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20111207082910.9847.3359.stgit@ltc219.sdl.hitachi.co.jp
[ removed the #undef - we typically don't do that for uncommon names ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
KVM needs to know perf capability to decide which PMU it can expose to a
guest.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-8-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement the disabling of arch events as a quirk so that we can print
a message along with it. This creates some visibility into the problem
space and could allow us to work on adding more work-around like the
AAJ80 one.
Requested-by: Ingo Molnar <mingo@elte.hu>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-wcja2z48wklzu1b0nkz0a5y7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel CPUs report non-available architectural events in cpuid leaf
0AH.EBX. Use it to disable events that are not available according
to CPU.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-7-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/kernel/built-in.o(.text+0x4c71): Section mismatch in
reference from the function arch_jump_label_transform_static() to the
function .init.text:text_poke_early()
The function arch_jump_label_transform_static() references
the function __init text_poke_early().
This is often because arch_jump_label_transform_static lacks a __init
annotation or the annotation of text_poke_early is wrong.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/n/tip-9lefe89mrvurrwpqw5h8xm8z@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86_64 kernel pushes the fake kernel stack in
arch/x86/kernel/entry_64.S:FAKE_STACK_FRAME, and
rflags register in it does not conform to the specification.
Although Intel's manual[1] says bit 1 of it shall be set to 1,
this bit is cleared to 0 on pushing the fake stack.
[1] Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Vol.1 3-21 Figure 3-8. EFLAGS Register
If it is not on purpose, it is better to be fixed, because
it can lead some tools misunderstanding the stack frame. For example,
"crash" utility[2] actually detects it and warns you like
below:
RIP: ffffffff8005dfa2 RSP: ffff8104ce0c7f58 RFLAGS: 00000200
[...]
bt: WARNING: possibly bogus exception frame
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Tested-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This avoids a scheduling failure for cases like:
cycles, cycles, instructions, instructions (on Core2)
Which would end up being programmed like:
PMC0, PMC1, FP-instructions, fail
Because all events will have the same weight.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-8tnwb92asqj7xajqqoty4gel@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:
counter mask weight
amd_f15_PMC30 0x09 2 <--- overlapping counters
amd_f15_PMC20 0x07 3
amd_f15_PMC53 0x38 3
The scheduler does not find then an existing solution. Here is an
example:
event code counter failure possible solution
0x02E PMC[3,0] 0 3
0x043 PMC[2:0] 1 0
0x045 PMC[2:0] 2 1
0x046 PMC[2:0] FAIL 2
The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.
To solve this, we now save the scheduler state of events with
overlapping counter counstraints. If we fail to schedule the events
we rollback to those states and try to use another free counter.
Constraints with overlapping counters are marked with a new introduced
overlap flag. We set the overlap flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_OVERLAP() macro can be used for this.
Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.
On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.
V2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.
V3:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
as return value.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).
The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.
With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.
V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.
V4:
* Fix broken event assignment if weight of the first event is not
wmin (perf_sched_init()).
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since there is a possibility of !KPROBES int3 listeners
(such as kgdb) and since DIE_TRAP is currently not being
used by anybody, notify all listeners with DIE_INT3.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20111025142159.GB21225@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do_notify_resume() gets called with interrupts disabled on x86_32. This
is different from the x86_64 behavior, where interrupts are enabled at
the time.
Queries on lkml on this issue hasn't yielded any clear answer. Lets make
x86_32 behave the same as x86_64, unless there is a real reason to
maintain status quo.
Please refer https://lkml.org/lkml/2011/9/27/130 for more
details.
A similar change was suggested in ARM:
https://lkml.org/lkml/2011/8/25/231
My 32-bit machine works fine (tm) with this patch.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20111025141812.GA21225@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I used "ifdef CONFIG_NUMA" simply because it doesn't make
sense in a non-numa configuration even with SMP enabled.
Besides, the only place where it is called right now is
in kernel/cpu/amd.c:srat_detect_node() within the
"CONFIG_NUMA" protected part.
Signed-off-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intr_remapping: Fix section mismatch in ir_dev_scope_init()
intel-iommu: Fix section mismatch in dmar_parse_rmrr_atsr_dev()
x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions
x86, AMD: Correct align_va_addr documentation
x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms
x86/mrst: Battery fixes
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode
x86: Fix "Acer Aspire 1" reboot hang
x86/mtrr: Resolve inconsistency with Intel processor manual
x86: Document rdmsr_safe restrictions
x86, microcode: Fix the failure path of microcode update driver init code
Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup
x86/mpparse: Account for bus types other than ISA and PCI
x86, mrst: Change the pmic_gpio device type to IPC
mrst: Added some platform data for the SFI translations
x86,mrst: Power control commands update
x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
x86, UV: Fix UV2 hub part number
x86: Add user_mode_vm check in stack_overflow_check
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix loss of notification with multi-event
perf, x86: Force IBS LVT offset assignment for family 10h
perf, x86: Disable PEBS on SandyBridge chips
trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter
perf session: Fix crash with invalid CPU list
perf python: Fix undefined symbol problem
perf/x86: Enable raw event access to Intel offcore events
perf: Don't use -ENOSPC for out of PMU resources
perf: Do not set task_ctx pointer in cpuctx if there are no events in the context
perf/x86: Fix PEBS instruction unwind
oprofile, x86: Fix crash when unloading module (nmi timer mode)
oprofile: Fix crash when unloading module (hr timer mode)
I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.
Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tsc=reliable boot parameter is supposed to skip all the TSC
stablility checks during boot time.
On a 8-socket system where we want to run an experiment with the
"tsc=reliable" boot option, TSC synchronization checks are not
getting skipped and marking the TSC as not stable.
Check for tsc_clocksource_reliable (which is set via
tsc=reliable or for platforms supporting synthetic TSC_RELIABLE
feature bit etc) and when set, skip the TSC synchronization
tests during boot.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1320446537.15071.14.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
system_call_after_swapgs doesn't really benefit from forcing
alignment from it - quite the opposite, native code needlessly
so far got a big NOP instruction inserted in front of it. Xen
being the only user of the separate entry point can well live
with the branch going to three bytes into a cache line.
The compatibility mode ptregs entry points for one can make use
of the GLOBAL() macro, and should be suitably aligned. Their
shared continuation point (ia32_ptregs_common) otoh doesn't need
to be global at all, but should continue to be properly aligned.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4CEEA020000780006407D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
GET_THREAD_INFO() involves a memory read immediately followed by
an "sub" on the value read, in turn (in several cases)
immediately followed by a use of the calculated value as the
base address of a memory access. This combination of
instructions has a non-negligible potential for stalls.
In the system call entry point code, however, the (fixed) offset
of the stack pointer from the end of the stack is generally
known, and hence we can instead avoid the memory load and
subtract, and instead do the memory reference using %rsp as the
base register. To do so in a legible fashion, introduce a
THREAD_INFO() macro which, provided a register (generally %rsp)
and the known offset from the end of the stack, produces a
suitable memory access operand.
The patch attempts to only touch the fast paths (no auditing and
alike), but manages to do so only in the 64-bit entry point
case; the compatibility mode entry points have so many
interdependencies between their various branch targets that it
was necessary to also adjust the slow paths to eliminate the
risk of having missed some register dependency during code
analysis.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4CD690200007800064075@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previously these up to 32 entry points, consisting of all the
same code except for their very first instruction, consumed 0x70
bytes per instance. Just like for device interrupt entry points,
fold them together so that they all use a single instance of the
code after having pushed their vector indicator (resulting in
0x10 bytes per instance, to retain 16-byte alignment of the
individual entry points).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4CA230200007800064065@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Testing for a return to ring 0 was necessary here solely because
of the branch out of ret_from_fork. That branch, however, can be
directed to retint_restore_args, and thus the test-and-branch
can be eliminated here.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4C7EE0200007800064028@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds support for Numascale NumaChip large-SMP systems. It is
needed to enable the booting of more than ~168 cores.
v2:
- [Steffen] enumerate only accessible northbridges
- [Daniel] rediffed and validated against 3.1-rc10
v3:
- [Daniel] use x86_init core numbering override
- [Daniel] cleanups as per feedback
v4:
- [Daniel] use updated x86_cpuinit override
v5:
- drop disabling interrupts locally, as ISR write is atomic; drop delay
- added read-mostly annotations where appropriate
- require CONFIG_SMP, so drop conditional path
Workload tested on 96 cores/16 sockets.
Signed-off-by: Steffen Persvold <sp@numascale.com>
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323101246-2400-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add an x86_init vector for handling inconsistent core numbering.
This is useful for multi-fabric platforms, such as Numascale
NumaConnect.
v2:
- use struct x86_cpuinit_ops
- provide default fall-back function to warn
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow flat_init_apic_ldr() to be used outside the compilation
unit for similar APIC implementations.
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reduce the startup time for slave cpus.
Adds hooks for an arch-specific function for clock calibration.
These hooks are used on x86. If a newly started cpu has the
same phys_proc_id as a core already active, uses the TSC for the
delay loop and has a CONSTANT_TSC, use the already-calculated
value of loops_per_jiffy.
This patch reduces the time required to start slave cpus on a
4096 cpu system from: 465 sec OLD 62 sec NEW
This reduces boot time on a 4096p system by almost 7 minutes.
Nice...
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Stultz <john.stultz@linaro.org>
[fix CONFIG_SMP=n build]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The last parameter to sort() is a pointer to the function used
to swap items. This parameter should be NULL, not 0, when not
used. This quiets the following sparse warning:
warning: Using plain integer as NULL pointer
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: hartleys@visionengravers.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel MID x86 platforms have a memory mapped virtual RTC
instead. No MID platform have the default ports (and
accessing them may do weird stuff).
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace the bubble sort in sanitize_e820_map() with a call to
the generic kernel sort function to avoid pathological
performance with large maps.
On large (thousands of entries) E820 maps, the previous code
took minutes to run; with this change it's now milliseconds.
Signed-off-by: Mike Ditto <mditto@google.com>
Cc: sassmann@kpanic.de
Cc: yuenn@google.com
Cc: Stefan Assmann <sassmann@kpanic.de>
Cc: Nancy Yuen <yuenn@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ptrace_set_debugreg() is only used in this file and should be
static. This also quiets the following sparse warning:
warning: symbol 'ptrace_set_debugreg' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: hartleys@visionengravers.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails. It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).
There's a log of problems at:
https://bbs.archlinux.org/viewtopic.php?id=124136
This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.
Reported-by: Uroš Vampl <mobile.leecher@gmail.com>
Tested-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Following is from Notes of section 11.5.3 of Intel processor
manual available at:
http://www.intel.com/Assets/PDF/manual/325384.pdf
For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.
This patch provides resolution for that.
Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This follows on from the patch applied in 3.2rc1 which creates
an INTEL_MID configuration. We can now add the entry for
Medfield specific code. After this is merged the final patch
will be submitted which moves the rest of the device Kconfig
dependancies to MRST/MEDFIELD/INTEL_MID as appropriate.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original macro worked only when applied to variables named
'evt'. While this could have been fixed by simply renaming the
macro argument, a more type-safe replacement is preferred.
Signed-off-by: Ferenc Wagner <wferi@niif.hu>
Cc: Venkatesh Pallipadi \(Venki\) <venki@google.com>
Link: http://lkml.kernel.org/r/8ed5c66c02041226e8cf8b4d5d6b41e543d90bd6.1321626272.git.wferi@niif.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In commit f8924e770e ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better. Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus. After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.
In the cases of ISA and PCI, there's not much difference. But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.
Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied. With this change, the machine
boots reliably without "noapic".
Fixes http://bugs.debian.org/586494
Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The panic_on_stackoverflow variable needs to be avilable
on the 32-bit side as well ...
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell OptiPlex 990 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/201111160019.51303.rjw@sisk.pl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This essentially reverts:
2b666859ec: x86: Default to vsyscall=native for now
The ABI breakage should now be fixed by:
commit 48c4206f5b02f28c4c78a1f5b491d3772fb64fb9
Author: Andy Lutomirski <luto@mit.edu>
Date: Thu Oct 20 08:48:19 2011 -0700
x86-64: Set siginfo and context on vsyscall emulation faults
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/93154af3b2b6d208906ae02d80d92cf60c6fa94f.1320712291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To make this work, we teach the page fault handler how to send
signals on failed uaccess. This only works for user addresses
(kernel addresses will never hit the page fault handler in the
first place), so we need to generate signals for those
separately.
This gets the tricky case right: if the user buffer spans
multiple pages and only the second page is invalid, we set
cr2 and si_addr correctly. UML relies on this behavior to
"fault in" pages as needed.
We steal a bit from thread_info.uaccess_err to enable this.
Before this change, uaccess_err was a 32-bit boolean value.
This fixes issues with UML when vsyscall=emulate.
Reported-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4c8f91de7ec5cd2ef0f59521a04e1015f11e42b4.1320712291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The previous patch modified the stop cpus path to use NMI
instead of IRQ as the way to communicate to the other cpus to
shutdown. There were some concerns that various machines may
have problems with using an NMI IPI.
This patch creates a selftest to check if NMI is working at
boot. The idea is to help catch any issues before the machine
panics and we learn the hard way.
Loosely based on the locking-selftest.c file, this separate file
runs a couple of simple tests and reports the results. The
output looks like:
...
Brought up 4 CPUs
----------------
| NMI testsuite:
--------------------
remote IPI: ok |
local IPI: ok |
--------------------
Good, all 2 testcases passed! |
---------------------------------
Total of 4 processors activated (21330.61 BogoMIPS).
...
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: seiji.aguchi@hds.com
Cc: vgoyal@redhat.com
Cc: mjg@redhat.com
Cc: tony.luck@intel.com
Cc: gong.chen@intel.com
Cc: satoru.moriya@hds.com
Cc: avi@redhat.com
Cc: Andi Kleen <andi@firstfloor.org>
Link: http://lkml.kernel.org/r/1318533267-18880-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A recent discussion started talking about the locking on the
pstore fs and how it relates to the kmsg infrastructure. We
noticed it was possible for userspace to r/w to the pstore fs
(grabbing the locks in the process) and block the panic path
from r/w to the same fs.
The reason was the cpu with the lock could be doing work while
the crashing cpu is panic'ing. Busting those spinlocks might
cause those cpus to step on each other's data. Fine, fair
enough.
It was suggested it would be nice to serialize the panic path
(ie stop the other cpus) and have only one cpu running. This
would allow us to bust the spinlocks and not worry about another
cpu stepping on the data.
Of course, smp_send_stop() does this in the panic case.
kmsg_dump() would have to be moved to be called after it. Easy
enough.
The only problem is on x86 the smp_send_stop() function calls
the REBOOT_VECTOR. Any cpu with irqs disabled (which pstore and
its backend ERST would do), block this IPI and thus do not stop.
This makes it difficult to reliably log data to the pstore fs.
The patch below switches from the REBOOT_VECTOR to NMI (and
mimics what kdump does). Switching to NMI allows us to deliver
the IPI when irqs are disabled, increasing the reliability of
this function.
However, Andi carefully noted that on some machines this
approach does not work because of broken BIOSes or whatever.
To help accomodate this, the next couple of patches will run a
selftest and provide a knob to disable.
V2:
uses atomic ops to serialize the cpu that shuts everyone down
V3:
comment cleanup
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: seiji.aguchi@hds.com
Cc: vgoyal@redhat.com
Cc: mjg@redhat.com
Cc: tony.luck@intel.com
Cc: gong.chen@intel.com
Cc: satoru.moriya@hds.com
Cc: avi@redhat.com
Cc: Andi Kleen <andi@firstfloor.org>
Link: http://lkml.kernel.org/r/1318533267-18880-2-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There was a mixup when the SGI UV2 hub chip was sent to be
fabricated, and it ended up with the wrong part number in the
HRP_NODE_ID mmr. Future versions of the chip will (may) have the
correct part number. Change the UV infrastructure to recognize
both part numbers as valid IDs of a UV2 hub chip.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20111129210058.GA20452@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The overflow checking of kernel stack checks if the stack
pointer points to the available kernel stack range, which is
derived from the original overflow checking.
It is clear that curbase address is always less than low
boundary of available kernel stack. So, this patch removes the
first condition that checks if the pointer is higher than
curbase.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060845.11076.40916.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Currently, messages are just output on the detection of stack
overflow, which is not sufficient for systems that need a
high reliability. This is because in general the overflow may
corrupt data, and the additional corruption may occur due to
reading them unless systems stop.
This patch adds the sysctl parameter
kernel.panic_on_stackoverflow and causes a panic when detecting
the overflows of kernel, IRQ and exception stacks except user
stack according to the parameter. It is disabled by default.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, only kernel stack is checked for the overflow, which
is not sufficient for systems that need a high reliability. To
enhance it, it is required to check the IRQ and exception
stacks, as well.
This patch checks all the stack types and will cause messages of
stacks in detail when free stack space drops below a certain
limit except user stack.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060829.11076.51733.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
The kernel stack overflow is checked in stack_overflow_check(),
which may wrongly detect the overflow if the stack pointer in
user space points to the kernel stack intentionally or
accidentally. So, the actual overflow is never detected after
this misdetection because WARN_ONCE() is used on the detection
of it.
This patch adds user-mode-vm checking before it to avoid this
problem and bails out early if the user stack is used.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060821.11076.55315.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
On AMD family 10h we see firmware bug messages like the following:
[Firmware Bug]: cpu 6, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
[Firmware Bug]: cpu 6, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
[Firmware Bug]: using offset 1 for IBS interrupts
[Firmware Bug]: workaround enabled for IBS LVT offset
perf: AMD IBS detected (0x00000007)
We always see this, since the offsets are not assigned by the BIOS for
this family. Force LVT offset assignment in this case. If the OS
assignment fails, fallback to BIOS settings and try to setup this.
The fallback to BIOS settings weakens the family check since
force_ibs_eilvt_setup() may fail e.g. in case of virtual machines.
But setup may still succeed if BIOS offsets are correct.
Other families don't have a workaround implemented that assigns LVT
offsets. It's ok, to drop calling force_ibs_eilvt_setup() for that
families.
With the patch the [Firmware Bug] messages vanish. We see now:
IBS: LVT offset 1 assigned
perf: AMD IBS detected (0x00000007)
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111109162225.GO12451@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
People with old AMD chips are getting hung boots, because commit
bcb80e5387 ("x86, microcode, AMD: Add microcode revision to
/proc/cpuinfo") moved the microcode detection too early into
"early_init_amd()".
At that point we are *so* early in the booth that the exception tables
haven't even been set up yet, so the whole
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
doesn't actually work: if the rdmsr does a GP fault (due to non-existant
MSR register on older CPU's), we can't fix it up yet, and the boot fails.
Fix it by simply moving the code to a slightly later point in the boot
(init_amd() instead of early_init_amd()), since the kernel itself
doesn't even really care about the microcode patchlevel at this point
(or really ever: it's made available to user space in /proc/cpuinfo, and
updated if you do a microcode load).
Reported-tested-and-bisected-by: Larry Finger <Larry.Finger@lwfinger.net>
Tested-by: Bob Tracy <rct@gherkin.frus.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea behind commit d91ee5863b ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code. It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).
But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle. This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.
In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:
Brought up 2 CPUs
invalid opcode: 0000 [#1] SMP
Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
RIP: e030:[<ffffffff81015d1d>] [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
...
Call Trace:
[<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
RIP [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
RSP <ffff8801d28ddf10>
In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall. Meaning we end up going to
hypervisor twice instead of just once.
The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.
We want to do that, but only for one specific case: Xen. This patch
does that.
Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts & resolutions:
* arch/x86/xen/setup.c
dc91c728fd "xen: allow extra memory to be in multiple regions"
24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."
conflicted on xen_add_extra_mem() updates. The resolution is
trivial as the latter just want to replace
memblock_x86_reserve_range() with memblock_reserve().
* drivers/pci/intel-iommu.c
166e9278a3 "x86/ia64: intel-iommu: move to drivers/iommu/"
5dfe8660a3 "bootmem: Replace work_with_active_regions() with..."
conflicted as the former moved the file under drivers/iommu/.
Resolved by applying the chnages from the latter on the moved
file.
* mm/Kconfig
6661672053 "memblock: add NO_BOOTMEM config symbol"
c378ddd53f "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"
conflicted trivially. Both added config options. Just
letting both add their own options resolves the conflict.
* mm/memblock.c
d1f0ece6cd "mm/memblock.c: small function definition fixes"
ed7b56a799 "memblock: Remove memblock_memory_can_coalesce()"
confliected. The former updates function removed by the
latter. Resolution is trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
The tsc code uses CLOCK_TICK_RATE which on x86
is defined to just be the same as PIT_TICK_RATE.
This patch updates the code use the later
as we want to depecrate and remove the global
CLOCK_TICK_RATE symbol.
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Prevent tracing of preempt_disable() in get_cpu_var() in
kvm_clock_read(). When CONFIG_DEBUG_PREEMPT is enabled,
preempt_disable/enable() are traced and this causes the function_graph
tracer to go into an infinite recursion. By open coding the
preempt_disable() around the get_cpu_var(), we can use the notrace
version which prevents preempt_disable/enable() from being traced and
prevents the recursion.
Based on a similar patch for Xen from Jeremy Fitzhardinge.
Tested-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Generate system call tables and unistd_*.h automatically from the
tables in arch/x86/syscalls. All other information, like NR_syscalls,
is auto-generated, some of which is in asm-offsets_*.c.
This allows us to keep all the system call information in one place,
and allows for kernel space and user space to see different
information; this is currently used for the ia32 system call numbers
when building the 64-bit kernel, but will be used by the x32 ABI in
the near future.
This also removes some gratuitious differences between i386, x86-64
and ia32; in particular, now all system call tables are generated with
the same mechanism.
Cc: H. J. Lu <hjl.tools@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Adjust spacing for comment so that it matches the multiline comment
style used in the rest of the kernel, and remove word duplication.
It is not really clear what version of gcc this refers to, but the
extra & doesn't cause any harm, so there is no reason to remove it.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add support to specify which HSU port to use as an early console. This can
be selected by passing "earlyprintk=hsu<n>" on the kernel command line. By
default port 0 is still used.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The option iommu=group_mf indicates the that the iommu driver should
expose all functions of a multi-function PCI device as the same
iommu_device_group. This is useful for disallowing individual functions
being exposed as independent devices to userspace as there are often
hidden dependencies. Virtual functions are not affected by this option.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
It appears that stop_machine_text_poke() wants to be called on all CPUs,
like it's done from text_poke_smp(). Fix text_poke_smp_batch() to do
this.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1319702072-32676-1-git-send-email-rabin@rab.in
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the core offcore support is fixed up (thanks Stephane) and we
have sane generic events utilizing them, re-enable the raw access to
the feature as well.
Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify
an offcore event, either one works and neither guarantees you'll end
up on a particular offcore MSR.
Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>.
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
People (Linus) objected to using -ENOSPC to signal not having enough
resources on the PMU to satisfy the request. Use -EINVAL.
Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-xv8geaz2zpbjhlx0svmpp28n@git.kernel.org
[ merged to newer kernel, fixed up MIPS impact ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Masami spotted that we always try to decode the instruction stream as
64bit instructions when running a 64bit kernel, this doesn't work for
ia32-compat proglets.
Use TIF_IA32 to detect if we need to use the 32bit instruction
decoder.
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
with "apic=verbose" the print_IO_APIC() function tries to print
IRQ to pin mappings for every active irq. It assumes chip_data
is of type irq_cfg and may cause an oops if not.
As the print_IO_APIC() is called from a late_initcall other
chained irq chips may already be registered with custom
chip_data information, causing an oops. This is the case with
intel MID SoC devices with gpio demuxers registered as irq_chips.
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
[ -v2: fixed build failure ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moorestown/Medfield platform does not have port 0x61 to report
NMI status, nor does it have external NMI sources. The only NMI
sources are from lapic, as results of perf counter overflow or
IPI, e.g. NMI watchdog or spin lock debug.
Reading port 0x61 on Moorestown will return 0xff which misled
NMI handlers to false critical errors such memory parity error.
The subsequent ioport access for NMI handling can also cause
undefined behavior on Moorestown.
This patch allows kernel process NMI due to watchdog or backrace
dump without unnecessary hangs.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[hand applied]
Signed-off-by: Alan Cox <alan@linux.intel.com>
lapic timer calibration can be combined with tsc in platform
specific calibration functions. if such calibration result is
obtained early, we can skip the redundant calibration loops.
Signed-off-by: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
nr_legacy_irqs is set in probe_nr_irqs_gsi, we should not clear
it after that. Otherwise, the result is that MSI irqs will be
allocated from the wrong range for the systems without legacy
PIC.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some wall clock devices use MMIO based HW register, this new
function will give them a chance to do some initialization work
before their get/set_time service get called.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.
Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'upstream/jump-label-noearly' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
jump-label: initialize jump-label subsystem much earlier
x86/jump_label: add arch_jump_label_transform_static()
s390/jump-label: add arch_jump_label_transform_static()
jump_label: add arch_jump_label_transform_static() to optimise non-live code updates
sparc/jump_label: drop arch_jump_label_text_poke_early()
x86/jump_label: drop arch_jump_label_text_poke_early()
jump_label: if a key has already been initialized, don't nop it out
stop_machine: make stop_machine safe and efficient to call early
jump_label: use proper atomic_t initializer
Conflicts:
- arch/x86/kernel/jump_label.c
Added __init_or_module to arch_jump_label_text_poke_early vs
removal of that function entirely
- kernel/stop_machine.c
same patch ("stop_machine: make stop_machine safe and efficient
to call early") merged twice, with whitespace fix in one version
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (21 commits)
MAINTAINERS: add an entry for Edac Sandy Bridge driver
edac: tag sb_edac as EXPERIMENTAL, as it requires more testing
EDAC: Fix incorrect edac mode reporting in sb_edac
edac: sb_edac: Add it to the building system
edac: Add an experimental new driver to support Sandy Bridge CPU's
i7300_edac: Fix error cleanup logic
i7core_edac: Initialize memory name with cpu, channel, bank
i7core_edac: Fix compilation on 32 bits arch
i7core_edac: scrubbing fixups
EDAC: Correct Kconfig dependencies
i7core_edac: return -ENODEV if no MC is found
i7core_edac: use edac's own way to print errors
MAINTAINERS: remove dropped edac_mce.* from the file
i7core_edac: Drop the edac_mce facility
x86, MCE: Use notifier chain only for MCE decoding
EDAC i7core: Use mce socketid for better compatibility
i7core_edac: Don't enable memory scrubbing for Xeon 35xx
i7core_edac: Add scrubbing support
edac: Move edac main structs to include/linux/edac.h
i7core_edac: Fix oops when trying to inject errors
...
Remove edac_mce pieces and use the normal MCE decoder notifier chain by
retaining the same functionality with considerably less code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
http://marc.info/?l=linux-mm&m=130105930902915&w=2
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These files were implicitly getting EXPORT_SYMBOL via device.h
which was including module.h, but that will be fixed up shortly.
By fixing these now, we can avoid seeing things like:
arch/x86/kernel/rtc.c:29: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/pci-dma.c:20: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/e820.c:69: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’
[ with input from Randy Dunlap <rdunlap@xenotime.net> and also
from Stephen Rothwell <sfr@canb.auug.org.au> ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
In removing the presence of <linux/module.h> from some of the
more common <linux/something.h> files, this implict include
of <linux/topology.h> was uncovered.
CC arch/x86/kernel/vsyscall_64.o
arch/x86/kernel/vsyscall_64.c: In function ‘vsyscall_set_cpu’:
arch/x86/kernel/vsyscall_64.c:259: error: implicit declaration of function ‘cpu_to_node’
Explicitly call it out so the cleanup can take place.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Drop the edac_mce custom hook in favor of the generic notifier
mechanism. Also, do not log the error to mcelog if the notified agent
was able to decode it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
* 'x86-rdrand-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, random: Verify RDRAND functionality and allow it to be disabled
x86, random: Architectural inlines to get random integers with RDRAND
random: Add support for architectural random hooks
Fix up trivial conflicts in drivers/char/random.c: the architectural
random hooks touched "get_random_int()" that was simplified to use MD5
and not do the keyptr thing any more (see commit 6e5714eaf7: "net:
Compute protocol sequence numbers and fragment IDs using MD5").
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, AMD: Add microcode revision to /proc/cpuinfo
x86, microcode: Correct microcode revision format
coretemp: Get microcode revision from cpu_data
x86, intel: Use c->microcode for Atom errata check
x86, intel: Output microcode revision in /proc/cpuinfo
x86, microcode: Don't request microcode from userspace unnecessarily
Fix up trivial conflicts in arch/x86/kernel/cpu/amd.c (conflict between
moving AMD BSP code to cpu_dev helper function and adding AMD microcode
revision to /proc/cpuinfo code)
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Hyper-V: Integrate the clocksource with Hyper-V detection code
Fix up conflicts in drivers/staging/hv/Makefile manually (some of the hv
code has moved out of staging to drivers/hv/)
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, amd: Include linux/elf.h since we use stuff from asm/elf.h
x86: cache_info: Update calculation of AMD L3 cache indices
x86: cache_info: Kill the atomic allocation in amd_init_l3_cache()
x86: cache_info: Kill the moronic shadow struct
x86: cache_info: Remove bogus free of amd_l3_cache data
x86, amd: Include elf.h explicitly, prepare the code for the module.h split
x86-32, amd: Move va_align definition to unbreak 32-bit build
x86, amd: Move BSP code to cpu_dev helper
x86: Add a BSP cpu_dev helper
x86, amd: Avoid cache aliasing penalties on AMD family 15h
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, unistd: Remove bogus __IGNORE_getcpu
x86, mm, trivial: Remove unnecessary get_order() in free_thread_info()
x86, cleanup: Remove unneeded version.h include from arch/x86/
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64: Fix CFI data for interrupt frames
x86-64: Don't apply destructive erratum workaround on unaffected CPUs
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Standardize on CONFIG_SPARSE_IRQ=y
x86, ioapic: Clean up ioapic/apic_id usage
x86, ioapic: Factor out print_IO_APIC() to only print one io apic
x86, ioapic: Print out irte with right ioapic index
x86, ioapic: Split up setup_ioapic_entry()
x86, ioapic: Pass struct irq_attr * to setup_ioapic_irq()
apic, i386/bigsmp: Fix false warnings regarding logical APIC ID mismatches
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
perf symbols: Increase symbol KSYM_NAME_LEN size
perf hists browser: Refuse 'a' hotkey on non symbolic views
perf ui browser: Use libslang to read keys
perf tools: Fix tracing info recording
perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
perf hists: Don't consider filtered entries when calculating column widths
perf hists: Don't decay total_period for filtered entries
perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
perf hists browser: Do not exit on tab key with single event
perf annotate browser: Don't change selection line when returning from callq
perf tools: handle endianness of feature bitmap
perf tools: Add prelink suggestion to dso update message
perf script: Fix unknown feature comment
perf hists browser: Apply the dso and thread filters when merging new batches
perf hists: Move the dso and thread filters from hist_browser
perf ui browser: Honour the xterm colors
perf top tui: Give color hints just on the percentage, like on --stdio
perf ui browser: Make the colors configurable and change the defaults
perf tui: Remove unneeded call to newtCls on startup
perf hists: Don't format the percentage on hist_entry__snprintf
...
Fix up conflicts in arch/x86/kernel/kprobes.c manually.
Ingo's tree did the insane "add volatile to const array", which just
doesn't make sense ("volatile const"?). But we could remove the const
*and* make the array volatile to make doubly sure that gcc doesn't
optimize it away..
Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
reader_lock has been turned into a raw lock by the core locking merge,
and there was a new user of it introduced in this perf core merge. Make
sure that new use also uses the raw accessor functions.
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
lockdep: Comment all warnings
lib: atomic64: Change the type of local lock to raw_spinlock_t
locking, lib/atomic64: Annotate atomic64_lock::lock as raw
locking, x86, iommu: Annotate qi->q_lock as raw
locking, x86, iommu: Annotate irq_2_ir_lock as raw
locking, x86, iommu: Annotate iommu->register_lock as raw
locking, dma, ipu: Annotate bank_lock as raw
locking, ARM: Annotate low level hw locks as raw
locking, drivers/dca: Annotate dca_lock as raw
locking, powerpc: Annotate uic->lock as raw
locking, x86: mce: Annotate cmci_discover_lock as raw
locking, ACPI: Annotate c3_lock as raw
locking, oprofile: Annotate oprofilefs lock as raw
locking, video: Annotate vga console lock as raw
locking, latencytop: Annotate latency_lock as raw
locking, timer_stats: Annotate table_lock as raw
locking, rwsem: Annotate inner lock as raw
locking, semaphores: Annotate inner lock as raw
locking, sched: Annotate thread_group_cputimer as raw
...
Fix up conflicts in kernel/posix-cpu-timers.c manually: making
cputimer->cputime a raw lock conflicted with the ABBA fix in commit
bcd5cff721 ("cputimer: Cure lock inversion").
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, ioapic: Consolidate the explicit EOI code
x86, ioapic: Restore the mask bit correctly in eoi_ioapic_irq()
x86, kdump, ioapic: Reset remote-IRR in clear_IO_APIC
iommu: Rename the DMAR and INTR_REMAP config options
x86, ioapic: Define irq_remap_modify_chip_defaults()
x86, msi, intr-remap: Use the ioapic set affinity routine
iommu: Cleanup ifdefs in detect_intel_iommu()
iommu: No need to set dmar_disabled in check_zero_address()
iommu: Move IOMMU specific code to intel-iommu.c
intr_remap: Call dmar_dev_scope_init() explicitly
x86, x2apic: Enable the bios request for x2apic optout
This allows jump-label entries to be cheaply updated on code which is
not yet live.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
It is no longer used.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...
When compiling an i386_defconfig kernel with gcc-4.6.1-9.fc15.i686, I
noticed a warning about the asm operand for test_bit in kprobes'
can_boost. I discovered that this caused only the first long of
twobyte_is_boostable[] to be output.
Jakub filed and fixed gcc PR50571 to correct the warning and this output
issue. But to solve it for less current gcc, we can make kprobes'
twobyte_is_boostable[] non-const, and it won't be optimized out.
Before:
CC arch/x86/kernel/kprobes.o
In file included from include/linux/bitops.h:22:0,
from include/linux/kernel.h:17,
from [...]/arch/x86/include/asm/percpu.h:44,
from [...]/arch/x86/include/asm/current.h:5,
from [...]/arch/x86/include/asm/processor.h:15,
from [...]/arch/x86/include/asm/atomic.h:6,
from include/linux/atomic.h:4,
from include/linux/mutex.h:18,
from include/linux/notifier.h:13,
from include/linux/kprobes.h:34,
from arch/x86/kernel/kprobes.c:43:
[...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
[...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input
without lvalue in asm operand 1 is deprecated [enabled by default]
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 00 00 00 00 bt %eax,0x0
554: R_386_32 .rodata.cst4
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
Contents of section .rodata.cst4:
0000 4c030000 L...
Only a single long of twobyte_is_boostable[] is in the object file.
After, without the const on twobyte_is_boostable:
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 20 00 00 00 bt %eax,0x20
554: R_386_32 .data
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
0010 00000000 00000000 00000000 00000000 ................
0020 4c030000 0f000200 ffff0000 ffcff0c0 L...............
0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77 ....;.......&..w
Now all 32 bytes are output into .data instead.
Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable microcode revision output for AMD after 506ed6b53e ("x86,
intel: Output microcode revision in /proc/cpuinfo") did it for Intel.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
506ed6b53e ("x86, intel: Output microcode revision in /proc/cpuinfo")
added microcode revision format to /proc/cpuinfo and the MCE handler in
decimal format but both AMD and Intel patch levels are handled as hex
numbers. Fix it.
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
When compiling an i386_defconfig kernel with
gcc-4.6.1-9.fc15.i686, I noticed a warning about the asm operand
for test_bit in kprobes' can_boost. I discovered that this
caused only the first long of twobyte_is_boostable[] to be
output.
Jakub filed and fixed gcc PR50571 to correct the warning and
this output issue. But to solve it for less current gcc, we can
make kprobes' twobyte_is_boostable[] volatile, and it won't be
optimized out.
Before:
CC arch/x86/kernel/kprobes.o
In file included from include/linux/bitops.h:22:0,
from include/linux/kernel.h:17,
from [...]/arch/x86/include/asm/percpu.h:44,
from [...]/arch/x86/include/asm/current.h:5,
from [...]/arch/x86/include/asm/processor.h:15,
from [...]/arch/x86/include/asm/atomic.h:6,
from include/linux/atomic.h:4,
from include/linux/mutex.h:18,
from include/linux/notifier.h:13,
from include/linux/kprobes.h:34,
from arch/x86/kernel/kprobes.c:43:
[...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
[...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 00 00 00 00 bt %eax,0x0
554: R_386_32 .rodata.cst4
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
Contents of section .rodata.cst4:
0000 4c030000 L...
Only a single long of twobyte_is_boostable[] is in the object
file.
After, with volatile:
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 20 00 00 00 bt %eax,0x20
554: R_386_32 .data
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
0010 00000000 00000000 00000000 00000000 ................
0020 4c030000 0f000200 ffff0000 ffcff0c0 L...............
0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77 ....;.......&..w
Now all 32 bytes are output into .data instead.
Signed-off-by: Josh Stone <jistone@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Link: http://lkml.kernel.org/r/1318899645-4068-1-git-send-email-jistone@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the cpu update level is available the Atom PSE errata
check can use it directly without reading the MSR again.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1318466795-7393-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I got a request to make it easier to determine the microcode
update level on Intel CPUs. This patch adds a new "microcode"
field to /proc/cpuinfo.
The microcode level is also outputed on fatal machine checks
together with the other CPUID model information.
I removed the respective code from the microcode update driver,
it just reads the field from cpu_data. Also when the microcode
is updated it fills in the new values too.
I had to add a memory barrier to native_cpuid to prevent it
being optimized away when the result is not used.
This turns out to clean up further code which already got this
information manually. This is done in followon patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1318466795-7393-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Requesting the microcode from userspace *every time* when onlining CPUs
(during a CPU hotplug operation) is unnecessary. Thus, ensure that
once the kernel gets the microcode after booting, it is not freed nor
invalidated when a CPU goes offline, so that it can be reused when that
CPU comes back online, without requesting userspace for it again. As a
result, the CPU hotplug operations become faster as well.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/4E91F908.5010006@linux.vnet.ibm.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Sparseirq got introduced in v2.6.28 and Thomas did a huge cleanup
around v2.6.38 that eliminated basically all disadvantages
of it.
So we can remove non-sparseirq support now and simplify
our IRQ degrees of freedom a bit.
Suggested-and-acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/4E95E21D.6090200@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While looking at the code, apic_id sometime is referred to index
of ioapic, but sometime is used for phys apic id. and some even
use apic for real apic id. It is very confusing.
So try to limit apic_id or ioapic_id to be real apic id for
ioapic, and use ioapic_idx for ioapic index in the array.
-v2: Suggested by Ingo, use ioapic_idx consistently, instead of ioapic
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542DC.3090509@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is getting too big after the interrupt remaping entries debug
print out was added.
Original print_IO_APIC() becomes print_IO_APICs().
New print_IO_APIC() will only print one ioapic's registers
As a side-effect this clean-up also made checkpatch.pl happier.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542D3.5000008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo pointed out that setup_ioapic_entry() is way too big now.
Split the intr-remap code out into setup_ir_ioapic_entry().
Also pass struct io_apic_irq_attr * instead of 5 parameters
in those two functions.
At last in setup_ir_ioapic_entry() we don't need to panic.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542BB.4070807@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do not expand that struct, and just pass pointer to reduce the
number of parameters in related functions.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542B1.7050800@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This UML breakage:
linux-2.6.30.1[3800] vsyscall fault (exploit attempt?) ip:ffffffffff600000 cs:33 sp:7fbfb9c498 ax:ffffffffff600000 si:0 di:606790
linux-2.6.30.1[3856] vsyscall fault (exploit attempt?) ip:ffffffffff600000 cs:33 sp:7fbfb13168 ax:ffffffffff600000 si:0 di:606790
Is caused by commit 3ae36655 ("x86-64: Rework vsyscall emulation and add
vsyscall= parameter") - the vsyscall emulation code is not fully cooked
yet as UML relies on some rather fragile SIGSEGV semantics.
Linus suggested in https://lkml.org/lkml/2011/8/9/376 to default
to vsyscall=native for now, this patch implements that.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Andrew Lutomirski <luto@mit.edu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20111005214047.GE14406@localhost.pp.htv.fi
Signed-off-by: Ingo Molnar <mingo@elte.hu>
nmi.c needs an #include <linux/mca.h>:
arch/x86/kernel/nmi.c: In function ‘unknown_nmi_error’:
arch/x86/kernel/nmi.c:286:6: error: ‘MCA_bus’ undeclared (first use in this function)
arch/x86/kernel/nmi.c:286:6: note: each undeclared identifier is reported only once for each function it appears in
Another one is the hpwdt driver:
drivers/watchdog/hpwdt.c:507:9: error: ‘NMI_DONE’ undeclared (first use in this function)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements IBS feature detection and initialzation. The
code is shared between perf and oprofile. If IBS is available on the
system for perf, a pmu is setup.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1316597423-25723-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moving IBS macros from oprofile to <asm/perf_event.h> to make it
available to perf. No additional changes.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1316597423-25723-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the NMI handler are broken into lists, increment the appropriate
stats for each list. This allows us to see what is going on when they
get printed out in the next patch.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-6-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previous patches allow the NMI subsystem to process multipe NMI events
in one NMI. As previously discussed this can cause issues when an event
triggered another NMI but is processed in the current NMI. This causes the
next NMI to go unprocessed and become an 'unknown' NMI.
To handle this, we first have to flag whether or not the NMI handler handled
more than one event or not. If it did, then there exists a chance that
the next NMI might be already processed. Once the NMI is flagged as a
candidate to be swallowed, we next look for a back-to-back NMI condition.
This is determined by looking at the %rip from pt_regs. If it is the same
as the previous NMI, it is assumed the cpu did not have a chance to jump
back into a non-NMI context and execute code and instead handled another NMI.
If both of those conditions are true then we will swallow any unknown NMI.
There still exists a chance that we accidentally swallow a real unknown NMI,
but for now things seem better.
An optimization has also been added to the nmi notifier rountine. Because x86
can latch up to one NMI while currently processing an NMI, we don't have to
worry about executing _all_ the handlers in a standalone NMI. The idea is
if multiple NMIs come in, the second NMI will represent them. For those
back-to-back NMI cases, we have the potentail to drop NMIs. Therefore only
execute all the handlers in the second half of a detected back-to-back NMI.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-5-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Just convert all the files that have an nmi handler to the new routines.
Most of it is straight forward conversion. A couple of places needed some
tweaking like kgdb which separates the debug notifier from the nmi handler
and mce removes a call to notify_die.
[Thanks to Ying for finding out the history behind that mce call
https://lkml.org/lkml/2010/5/27/114
And Boris responding that he would like to remove that call because of it
https://lkml.org/lkml/2011/9/21/163]
The things that get converted are the registeration/unregistration routines
and the nmi handler itself has its args changed along with code removal
to check which list it is on (most are on one NMI list except for kgdb
which has both an NMI routine and an NMI Unknown routine).
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Corey Minyard <minyard@acm.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The NMI handlers used to rely on the notifier infrastructure. This worked
great until we wanted to support handling multiple events better.
One of the key ideas to the nmi handling is to process _all_ the handlers for
each NMI. The reason behind this switch is because NMIs are edge triggered.
If enough NMIs are triggered, then they could be lost because the cpu can
only latch at most one NMI (besides the one currently being processed).
In order to deal with this we have decided to process all the NMI handlers
for each NMI. This allows the handlers to determine if they recieved an
event or not (the ones that can not determine this will be left to fend
for themselves on the unknown NMI list).
As a result of this change it is now possible to have an extra NMI that
was destined to be received for an already processed event. Because the
event was processed in the previous NMI, this NMI gets dropped and becomes
an 'unknown' NMI. This of course will cause printks that scare people.
However, we prefer to have extra NMIs as opposed to losing NMIs and as such
are have developed a basic mechanism to catch most of them. That will be
a later patch.
To accomplish this idea, I unhooked the nmi handlers from the notifier
routines and created a new mechanism loosely based on doIRQ. The reason
for this is the notifier routines have a couple of shortcomings. One we
could't guarantee all future NMI handlers used NOTIFY_OK instead of
NOTIFY_STOP. Second, we couldn't keep track of the number of events being
handled in each routine (most only handle one, perf can handle more than one).
Third, I wanted to eventually display which nmi handlers are registered in
the system in /proc/interrupts to help see who is generating NMIs.
The patch below just implements the new infrastructure but doesn't wire it up
yet (that is the next patch). Its design is based on doIRQ structs and the
atomic notifier routines. So the rcu stuff in the patch isn't entirely untested
(as the notifier routines have soaked it) but it should be double checked in
case I copied the code wrong.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nmi stuff is changing a lot and adding more functionality. Split it
out from the traps.c file so it doesn't continue to pollute that file.
This makes it easier to find and expand all the future nmi related work.
No real functional changes here.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-2-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel does not have guest/host-only bit in perf counters like AMD
does. To support GO/HO bits KVM needs to switch EVENTSELn values
(or PERF_GLOBAL_CTRL if available) at a guest entry. If a counter is
configured to count only in a guest mode it stays disabled in a host,
but VMX is configured to switch it to enabled value during guest entry.
This patch adds GO/HO tracking to Intel perf code and provides interface
for KVM to get a list of MSRs that need to be switched on a guest entry.
Only cpus with architectural PMU (v1 or later) are supported with this
patch. To my knowledge there is not p6 models with VMX but without
architectural PMU and p4 with VMX are rare and the interface is general
enough to support them if need arise.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317816084-18026-7-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The AMD perf-counters support counting in guest or host-mode
only. Make use of that feature when user-space specified
guest/host-mode only counting.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317816084-18026-3-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The patch titled "x86: Don't use frame pointer to save old stack
on irq entry" did not properly adjust CFI directives, so this
patch is a follow-up to that one.
With the old stack pointer no longer stored in a callee-saved
register (plus some offset), we now have to use a CFA expression
to describe the memory location where it is being found. This
requires the use of .cfi_escape (allowing arbitrary byte streams
to be emitted into .eh_frame), as there is no
.cfi_def_cfa_expression (which also cannot reasonably be
expected, as it would require a full expression parser).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/4E8360200200007800058467@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These warnings (generally one per CPU) are a result of
initializing x86_cpu_to_logical_apicid while apic_default is
still in use, but the check in setup_local_APIC() being done
when apic_bigsmp was already used as an override in
default_setup_apic_routing():
Overriding APIC driver with bigsmp
Enabling APIC mode: Physflat. Using 5 I/O APICs
------------[ cut here ]------------
WARNING: at .../arch/x86/kernel/apic/apic.c:1239
...
CPU 1 irqstacks, hard=f1c9a000 soft=f1c9c000
Booting Node 0, Processors #1
smpboot cpu 1: start_ip = 9e000
Initializing CPU#1
------------[ cut here ]------------
WARNING: at .../arch/x86/kernel/apic/apic.c:1239
setup_local_APIC+0x137/0x46b() Hardware name: ...
CPU1 logical APIC ID: 2 != 8
...
Fix this (for the time being, i.e. until
x86_32_early_logical_apicid() will get removed again, as Tejun
says ought to be possible) by overriding the previously stored
values at the point where the APIC driver gets overridden.
v2: Move this and the pre-existing override logic into
arch/x86/kernel/apic/bigsmp_32.c.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> (2.6.39 and onwards)
Link: http://lkml.kernel.org/r/4E835D16020000780005844C@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After merging the moduleh tree, today's linux-next build (x86_64
allmodconfig) failed like this:
arch/x86/kernel/sys_x86_64.c:28:10: warning: 'enum align_flags' declared inside parameter list
arch/x86/kernel/sys_x86_64.c:28:10: warning: its scope is only this definition or declaration, which is probably not what you
want arch/x86/kernel/sys_x86_64.c:28:22: error: parameter 3 ('flags') has incomplete type
[...]
Presumably caused by the module.h split interacting with a
new commit dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties
on AMD family 15h") from the x8 tree.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20110928174214.17a58be15d84d67c185930e1@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix (rare) build error by adding <asm/apicdef.h> header file:
arch/x86/kernel/cpu/perf_event_amd.c:350:2: error: 'BAD_APICID' undeclared (first use in this function)
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Andre Przywara <andre.przywara@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/4E820138.90301@xenotime.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are numerous broken references to Documentation files (in other
Documentation files, in comments, etc.). These broken references are
caused by typo's in the references, and by renames or removals of the
Documentation files. Some broken references are simply odd.
Fix these broken references, sometimes by dropping the irrelevant text
they were part of.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The CPU support for perf events on x86 was implemented via included C files
with #ifdefs. Clean this up by creating a new header file and compiling
the vendor-specific files as needed.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1314747665-2090-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A deadlock was introduced on x86 in commit ef68c8f87e ("x86:
Serialize EFI time accesses on rtc_lock") because efi_get_time()
and friends can be called with rtc_lock already held by
read_persistent_time(), e.g.:
timekeeping_init()
read_persistent_clock() <-- acquire rtc_lock
efi_get_time()
phys_efi_get_time() <-- acquire rtc_lock <DEADLOCK>
To fix this let's push the locking down into the get_wallclock()
and set_wallclock() implementations. Only the clock
implementations that access the x86 RTC directly need to acquire
rtc_lock, so it makes sense to push the locking down into the
rtc, vrtc and efi code.
The virtualization implementations don't require rtc_lock to be
held because they provide their own serialization.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Acked-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Avi Kivity <avi@redhat.com> [for the virtualization aspect]
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a workaround for a UV2 hub bug that affects the format of system
global addresses.
The GRU API for UV2 was inadvertently broken by a hardware change. The
format of the physical address used for TLB dropins and for addresses used
with instructions running in unmapped mode has changed. This change was
not documented and became apparent only when diags failed running on
system simulators.
For UV1, TLB and GRU instruction physical addresses are identical to
socket physical addresses (although high NASID bits must be OR'ed into the
address).
For UV2, socket physical addresses need to be converted. The NODE portion
of the physical address needs to be shifted so that the low bit is in bit
39 or bit 40, depending on an MMR value.
It is not yet clear if this bug will be fixed in a silicon respin. If it
is fixed, the hub revision will be incremented & the workaround disabled.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For older IO-APIC's, we were clearing the remote-IRR by changing
the RTE trigger mode to edge and then back to level. We wanted
to mask the RTE during this process, so we were essentially
doing mask+edge and then to unmask+level.
As part of the commit ca64c47cec,
we moved this EOI process earlier where the IO-APIC RTE is
masked. So we were wrongly unmasking it in the eoi_ioapic_irq().
So change the remote-IRR clear sequence in eoi_ioapic_irq() to
mask + edge and then restore the previous RTE entry which will
restore the mask status as well as the level trigger.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Rafael Wysocki <rjw@novell.com>
Cc: lchiquitto@novell.com
Cc: jbeulich@novell.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110825190657.210286410@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the kdump scenario mentioned below, we can have a case where
the device using level triggered interrupt will not generate any
interrupts in the kdump kernel.
1. IO-APIC sends a level triggered interrupt to the CPU's local APIC.
2. Kernel crashed before the CPU services this interrupt, leaving
the remote-IRR in the IO-APIC set.
3. kdump kernel boot sequence does clear_IO_APIC() as part of IO-APIC
initialization. But this fails to reset remote-IRR bit of the
IO-APIC RTE as the remote-IRR bit is read-only.
4. Device using that level triggered entry can't generate any
more interrupts because of the remote-IRR bit.
In clear_IO_APIC_pin(), check if the remote-IRR bit is set and if
so do an explicit attempt to clear it (by doing EOI write on
modern io-apic's and changing trigger mode to edge/level on
older io-apic's). Also before doing the explicit EOI to the
io-apic, ensure that the trigger mode is indeed set to level.
This will enable the explicit EOI to the io-apic to reset the
remote-IRR bit.
Tested-by: Leonardo Chiquitto <lchiquitto@novell.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Fixes: https://bugzilla.novell.com/show_bug.cgi?id=701686
Cc: Rafael Wysocki <rjw@novell.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Thomas Renninger <trenn@suse.de>
Cc: jbeulich@novell.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110825190657.157502602@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the platforms which are x2apic and interrupt-remapping
capable, Linux kernel is enabling x2apic even if the BIOS
doesn't. This is to take advantage of the features that x2apic
brings in.
Some of the OEM platforms are running into issues because of
this, as their bios is not x2apic aware. For example, this was
resulting in interrupt migration issues on one of the platforms.
Also if the BIOS SMI handling uses APIC interface to send SMI's,
then the BIOS need to be aware of x2apic mode that OS has
enabled.
On some of these platforms, BIOS doesn't have a HW mechanism to
turnoff the x2apic feature to prevent OS from enabling it.
To resolve this mess, recent changes to the VT-d2 specification:
http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf
includes a mechanism that provides BIOS a way to request system
software to opt out of enabling x2apic mode.
Look at the x2apic optout flag in the DMAR tables before
enabling the x2apic mode in the platform. Also print a warning
that we have disabled x2apic based on the BIOS request.
Kernel boot parameter "intremap=no_x2apic_optout" can be used to
override the BIOS x2apic optout request.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: joerg.roedel@amd.com
Cc: tony.luck@intel.com
Cc: dwmw2@infradead.org
Link: http://lkml.kernel.org/r/20110824001456.171766616@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes the typo in parameters passed to
x86_32 switch_to() description.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
del_timer_sync() can cause a deadlock when called in interrupt context.
It is used with on_each_cpu() in some parts for sysfs files like bank*,
check_interval, cmci_disabled and ignore_ce.
However, use of on_each_cpu() results in calling the function passed
as the argument in interrupt context. This causes a flood of nested
warnings from del_timer_sync() (it runs on each CPU) caused even by a
simple file access like:
$ echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
Fortunately, these MCE-specific files are rarely used and AFAIK only few
MCE geeks experience this warning.
To remove the warning, move timer deletion outside of the interrupt
context.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The cmci_discover_lock can be taken in atomic context (cpu bring
up sequence) and therefore cannot be preempted on -rt.
In mainline this change documents the low level nature of
the lock - otherwise there's no functional difference. Lockdep
and Sparse checking will work as usual.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
L3 subcaches 0 and 1 of AMD Family 15h CPUs can have a size of 2MB.
Update the calculation routine for the number of L3 indices to
reflect that.
Signed-off-by: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rosenfeld Hans <Hans.Rosenfeld@amd.com>
Cc: Herrmann3 Andreas <Andreas.Herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Frank Arnold <Frank.Arnold@amd.com>
Link: http://lkml.kernel.org/r/20110726170449.GB32536@aftab
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not a good reason to allocate memory in the smp function call
just because someone thought it's the most conveniant place.
The AMD L3 data is coupled to the northbridge info by a pointer to the
corresponding north bridge data. So allocating it with the northbridge
data and referencing the northbridge in the cache_info code instead
uses less memory and gets rid of that atomic allocation hack in the
smp function call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20110723212626.688229918@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit f9b90566c ("x86: reduce stack usage in init_intel_cacheinfo")
introduced a shadow structure to reduce the stack usage on large
machines instead of making the smaller structure embedded into the
large one. That's definitely a candidate for the bad taste award.
Move the small struct into the large one and get rid of the ugly type
casts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20110723212626.625651773@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
free_cache_attributes() kfree's:
per_cpu(ici_cpuid4_info, cpu)->l3
which is a pointer to memory which was allocated as a block in
amd_init_l3_cache(). l3 of a particular cpu points to a part of this
memory blob. The part and the rest of the blob are still referenced by
other cpus.
As far as I can tell from the git history this is a leftover from the
conversion from per cpu to node data with commit ba06edb63(x86,
cacheinfo: Make L3 cache info per node) and the following commit
f658bcfb2(x86, cacheinfo: Cleanup L3 cache index disable support)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20110723212626.550539989@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nfsservctl system call is now gone, so we should remove all
linkage for it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
entry_32.S contained a hardcoded alternative instruction entry, and the
format changed in commit 59e97e4d6f ("x86: Make alternative
instruction pointers relative").
Replace the hardcoded entry with the altinstruction_entry macro. This
fixes the 32-bit boot with CONFIG_X86_INVD_BUG=y.
Reported-and-tested-by: Arnaud Lacombe <lacombar@gmail.com>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While removing custom rendezvous code and switching to stop_machine,
commit 192d885742 ("x86, mtrr: use stop_machine APIs for doing MTRR
rendezvous") completely dropped mtrr setting code on !CONFIG_SMP
breaking MTRR settting on UP.
Fix it by removing the incorrect CONFIG_SMP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Anders Eriksson <aeriksson@fastmail.fm>
Tested-and-acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, vdso: On system call restart after SYSENTER, use int $0x80
x86, UV: Remove UV delay in starting slave cpus
x86, olpc: Wait for last byte of EC command to be accepted
Because THREAD_SIZE is defined as PAGE_SIZE << THREAD_ORDER on x86, the
call of get_order(THREAD_SIZE) can be replaced with THREAD_ORDER.
Signed-off-by: Zhao Jin <cronozhj@gmail.com>
Link: http://lkml.kernel.org/r/4E4FB5A9.700@gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On -rt kfree() can schedule, but CPU_STARTING is before the CPU is
fully up and running. These are contradictory, so avoid it. Instead
push the kfree() to CPU_ONLINE where we're free to schedule.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kwd4j6ayld5thrscvaxgjquv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip:
x86-64: Rework vsyscall emulation and add vsyscall= parameter
x86-64: Wire up getcpu syscall
x86: Remove unnecessary compile flag tweaks for vsyscall code
x86-64: Add vsyscall:emulate_vsyscall trace event
x86-64: Add user_64bit_mode paravirt op
x86-64, xen: Enable the vvar mapping
x86-64: Work around gold bug 13023
x86-64: Move the "user" vsyscall segment out of the data segment.
x86-64: Pad vDSO to a page boundary
There are three choices:
vsyscall=native: Vsyscalls are native code that issues the
corresponding syscalls.
vsyscall=emulate (default): Vsyscalls are emulated by instruction
fault traps, tested in the bad_area path. The actual contents of
the vsyscall page is the same as the vsyscall=native case except
that it's marked NX. This way programs that make assumptions about
what the code in the page does will not be confused when they read
that code.
vsyscall=none: Trying to execute a vsyscall will segfault.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/8449fb3abf89851fd6b2260972666a6f82542284.1312988155.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As of commit 98d0ac38ca
Author: Andy Lutomirski <luto@mit.edu>
Date: Thu Jul 14 06:47:22 2011 -0400
x86-64: Move vread_tsc and vread_hpet into the vDSO
user code no longer directly calls into code in arch/x86/kernel/, so
we don't need compile flag hacks to make it safe. All vdso code is
in the vdso directory now.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/835cd05a4c7740544d09723d6ba48f4406f9826c.1312988155.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When the moduleu.h splitting tree is merged to the latest
tip:x86/cpu tree, the x86_64 allmodconfig build fails like this:
arch/x86/kernel/cpu/amd.c: In function 'bsp_init_amd':
arch/x86/kernel/cpu/amd.c:437:3: error: 'va_align' undeclared (first use in this function)
arch/x86/kernel/cpu/amd.c:438:23: error: 'ALIGN_VA_32' undeclared (first use in this function)
arch/x86/kernel/cpu/amd.c:438:37: error: 'ALIGN_VA_64' undeclared (first use in this function)
This is caused by the module.h split up intreacting with commit
dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties on AMD
family 15h") from the tip:x86/cpu tree.
I have added the following patch for today (this, or something
similar, could be applied to the tip tree directly - the
export.h include below was added by the module.h splitup).
So include elf.h to use va_align and remove this implicit
dependency on module.h doing it for us.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20110810114956.238d66772883636e3040d29f@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add support to Romely-EP SandyBridge.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Anhua Xu <anhua.xu@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312264895-2010-1-git-send-email-youquan.song@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
hpa reported that dfb09f9b7a breaks 32-bit
builds with the following error message:
/home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:437: undefined
reference to `va_align'
/home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:436: undefined
reference to `va_align'
This is due to the fact that va_align is a global in a 64-bit only
compilation unit. Move it to mmap.c where it is visible to both
subarches.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1312633899-1131-1-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Delete the 10 msec delay between the INIT and SIPI when starting
slave cpus. I can find no requirement for this delay. BIOS also
has similar code sequences without the delay.
Removing the delay reduces boot time by 40 sec. Every bit helps.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110805140900.GA6774@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>