Thermal throttle and power limit events are not defined as MCE errors in x86
architecture and should not generate MCE errors in mcelog.
Current kernel generates fake software defined MCE errors for these events.
This may confuse users because they may think the machine has real MCE errors
while actually only thermal throttle or power limit events happen.
To make it worse, buggy firmware on some platforms may falsely generate
the events. Therefore, kernel reports MCE errors which users think as real
hardware errors. Although the firmware bugs should be fixed, on the other hand,
kernel should not report MCE errors either.
So mcelog is not a good mechanism to report these events. To report the events, we count them in respective counters (core_power_limit_count,
package_power_limit_count, core_throttle_count, and package_throttle_count) in
/sys/devices/system/cpu/cpu#/thermal_throttle/. Users can check the counters
for each event on each CPU. Please note that all CPU's on one package report
duplicate counters. It's user application's responsibity to retrieve a package
level counter for one package.
This patch doesn't report package level power limit, core level power limit, and
package level thermal throttle events in mcelog. When the events happen, only
report them in respective counters in sysfs.
Since core level thermal throttle has been legacy code in kernel for a while and
users accepted it as MCE error in mcelog, core level thermal throttle is still
reported in mcelog. In the mean time, the event is counted in a counter in sysfs
as well.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Acked-by: Borislav Petkov <bp@amd64.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/20111215001945.GA21009@linux-os.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add a function which drains whatever MCEs were logged in already during
boot and before the decoder chains were registered.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Once we've found and validated the ucode patch for the current CPU,
there's no need to iterate over the remaining patches in the binary
image. Exit then and save us a bunch of cycles.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Basically, what we did until now is take out a chunk of the firmware
image, vmalloc space for it and inspect it before application. And
repeat.
This patch changes all that so that we look at each ucode patch from
the firmware image, check it for sanity and copy it to local buffer for
application only once and if it passes all checks. Thus, vmalloc-ing for
each piece is gone, we can do proper size checking only of the patch
which is destined for the CPU of the current machine instead of each
single patch, which is clearly wrong.
Oh yeah, simplify and cleanup the code while at it, along with adding
comments as to what actually happens.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add a simple 4K page which gets allocated on driver init and freed on
driver exit instead of vmalloc'ing small buffers for each ucode patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
In the IPI delivery slow path (NMI delivery) we retry the ICR
read to check for delivery completion a limited number of times.
[ The reason for the limited retries is that some of the places
where it is used (cpu boot, kdump, etc) IPI delivery might not
succeed (due to a firmware bug or system crash, for example)
and in such a case it is better to give up and resume
execution of other code. ]
This patch adds a new entry to /proc/interrupts, RTR, which
tells user space the number of times we retried the ICR read in
the IPI delivery slow path.
This should give some insight into how well the APIC
message delivery hardware is working - if the counts are way
too large then we are hitting a (very-) slow path way too
often.
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Cc: Jörn Engel <joern@logfs.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/n/tip-vzsp20lo2xdzh5f70g0eis2s@git.kernel.org
[ extended the changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is currently a large divide between kernel development and the
development of EFI boot loaders. The idea behind this patch is to give
the kernel developers full control over the EFI boot process. As
H. Peter Anvin put it,
"The 'kernel carries its own stub' approach been very successful in
dealing with BIOS, and would make a lot of sense to me for EFI as
well."
This patch introduces an EFI boot stub that allows an x86 bzImage to
be loaded and executed by EFI firmware. The bzImage appears to the
firmware as an EFI application. Luckily there are enough free bits
within the bzImage header so that it can masquerade as an EFI
application, thereby coercing the EFI firmware into loading it and
jumping to its entry point. The beauty of this masquerading approach
is that both BIOS and EFI boot loaders can still load and run the same
bzImage, thereby allowing a single kernel image to work in any boot
environment.
The EFI boot stub supports multiple initrds, but they must exist on
the same partition as the bzImage. Command-line arguments for the
kernel can be appended after the bzImage name when run from the EFI
shell, e.g.
Shell> bzImage console=ttyS0 root=/dev/sdb initrd=initrd.img
v7:
- Fix checkpatch warnings.
v6:
- Try to allocate initrd memory just below hdr->inird_addr_max.
v5:
- load_options_size is UTF-16, which needs dividing by 2 to convert
to the corresponding ASCII size.
v4:
- Don't read more than image->load_options_size
v3:
- Fix following warnings when compiling CONFIG_EFI_STUB=n
arch/x86/boot/tools/build.c: In function ‘main’:
arch/x86/boot/tools/build.c:138:24: warning: unused variable ‘pe_header’
arch/x86/boot/tools/build.c:138:15: warning: unused variable ‘file_sz’
- As reported by Matthew Garrett, some Apple machines have GOPs that
don't have hardware attached. We need to weed these out by
searching for ones that handle the PCIIO protocol.
- Don't allocate memory if no initrds are on cmdline
- Don't trust image->load_options_size
Maarten Lankhorst noted:
- Don't strip first argument when booted from efibootmgr
- Don't allocate too much memory for cmdline
- Don't update cmdline_size, the kernel considers it read-only
- Don't accept '\n' for initrd names
v2:
- File alignment was too large, was 8192 should be 512. Reported by
Maarten Lankhorst on LKML.
- Added UGA support for graphics
- Use VIDEO_TYPE_EFI instead of hard-coded number.
- Move linelength assignment until after we've assigned depth
- Dynamically fill out AddressOfEntryPoint in tools/build.c
- Don't use magic number for GDT/TSS stuff. Requested by Andi Kleen
- The bzImage may need to be relocated as it may have been loaded at
a high address address by the firmware. This was required to get my
macbook booting because the firmware loaded it at 0x7cxxxxxx, which
triggers this error in decompress_kernel(),
if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
error("Destination address too large");
Cc: Mike Waychison <mikew@google.com>
Cc: Matthew Garrett <mjg@redhat.com>
Tested-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1321383097.2657.9.camel@mfleming-mobl1.ger.corp.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This hangs my MacBook Air at boot time; I get no console
messages at all. I reverted this on top of -rc5 and my machine
boots again.
This reverts commit e8c7106280.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Those two APIs were provided to optimize the calls of
tick_nohz_idle_enter() and rcu_idle_enter() into a single
irq disabled section. This way no interrupt happening in-between would
needlessly process any RCU job.
Now we are talking about an optimization for which benefits
have yet to be measured. Let's start simple and completely decouple
idle rcu and dyntick idle logics to simplify.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The idle notifier, called by enter_idle(), enters into rcu read
side critical section but at that time we already switched into
the RCU-idle window (rcu_idle_enter() has been called). And it's
illegal to use rcu_read_lock() in that state.
This results in rcu reporting its bad mood:
[ 1.275635] WARNING: at include/linux/rcupdate.h:194 __atomic_notifier_call_chain+0xd2/0x110()
[ 1.275635] Hardware name: AMD690VM-FMH
[ 1.275635] Modules linked in:
[ 1.275635] Pid: 0, comm: swapper Not tainted 3.0.0-rc6+ #252
[ 1.275635] Call Trace:
[ 1.275635] [<ffffffff81051c8a>] warn_slowpath_common+0x7a/0xb0
[ 1.275635] [<ffffffff81051cd5>] warn_slowpath_null+0x15/0x20
[ 1.275635] [<ffffffff817d6f22>] __atomic_notifier_call_chain+0xd2/0x110
[ 1.275635] [<ffffffff817d6f71>] atomic_notifier_call_chain+0x11/0x20
[ 1.275635] [<ffffffff810018a0>] enter_idle+0x20/0x30
[ 1.275635] [<ffffffff81001995>] cpu_idle+0xa5/0x110
[ 1.275635] [<ffffffff817a7465>] rest_init+0xe5/0x140
[ 1.275635] [<ffffffff817a73c8>] ? rest_init+0x48/0x140
[ 1.275635] [<ffffffff81cc5ca3>] start_kernel+0x3d1/0x3dc
[ 1.275635] [<ffffffff81cc5321>] x86_64_start_reservations+0x131/0x135
[ 1.275635] [<ffffffff81cc5412>] x86_64_start_kernel+0xed/0xf4
[ 1.275635] ---[ end trace a22d306b065d4a66 ]---
Fix this by entering rcu extended quiescent state later, just before
the CPU goes to sleep.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.
To prepare for fixing this, add two new APIs:
tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu().
If no use of RCU is made in the idle loop between
tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch
must instead call the new *_norcu() version such that the arch doesn't
need to call rcu_idle_enter() and rcu_idle_exit().
Otherwise the arch must call tick_nohz_enter_idle() and
tick_nohz_exit_idle() and also call explicitly:
- rcu_idle_enter() after its last use of RCU before the CPU is put
to sleep.
- rcu_idle_exit() before the first use of RCU after the CPU is woken
up.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The tick_nohz_stop_sched_tick() function, which tries to delay
the next timer tick as long as possible, can be called from two
places:
- From the idle loop to start the dytick idle mode
- From interrupt exit if we have interrupted the dyntick
idle mode, so that we reprogram the next tick event in
case the irq changed some internal state that requires this
action.
There are only few minor differences between both that
are handled by that function, driven by the ts->inidle
cpu variable and the inidle parameter. The whole guarantees
that we only update the dyntick mode on irq exit if we actually
interrupted the dyntick idle mode, and that we enter in RCU extended
quiescent state from idle loop entry only.
Split this function into:
- tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
dynticks idle mode unconditionally if it can, and enters into RCU
extended quiescent state.
- tick_nohz_irq_exit() which only updates the dynticks idle mode
when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).
To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
into tick_nohz_idle_exit().
This simplifies the code and micro-optimize the irq exit path (no need
for local_irq_save there). This also prepares for the split between
dynticks and rcu extended quiescent state logics. We'll need this split to
further fix illegal uses of RCU in extended quiescent states in the idle
loop.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Introduce a symbol, EFI_LOADER_SIGNATURE instead of using the magic
strings, which also helps to reduce the amount of ifdeffery.
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1318848017-12301-1-git-send-email-matt@console-pimps.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.
On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:
BUG: unable to handle kernel paging request at f7f22280
IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
[...]
Call Trace:
[<c104f8ca>] ? page_is_ram+0x1a/0x40
[<c1025aff>] reserve_memtype+0xdf/0x2f0
[<c1024dc9>] set_memory_uc+0x49/0xa0
[<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
[<c19216d4>] start_kernel+0x291/0x2f2
[<c19211c7>] ? loglevel+0x1b/0x1b
[<c19210bf>] i386_start_kernel+0xbf/0xc8
A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().
Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,
https://bugzilla.redhat.com/show_bug.cgi?id=748516
Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The get_cpu_cap() external function prototype was declared twice
so lose one of them.
Clean up the header guard while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1322594083-14507-1-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.
On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.
This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
The only function of memblock_analyze() is now allowing resize of
memblock region arrays. Rename it to memblock_allow_resize() and
update its users.
* The following users remain the same other than renaming.
arm/mm/init.c::arm_memblock_init()
microblaze/kernel/prom.c::early_init_devtree()
powerpc/kernel/prom.c::early_init_devtree()
openrisc/kernel/prom.c::early_init_devtree()
sh/mm/init.c::paging_init()
sparc/mm/init_64.c::paging_init()
unicore32/mm/init.c::uc32_memblock_init()
* In the following users, analyze was used to update total size which
is no longer necessary.
powerpc/kernel/machine_kexec.c::reserve_crashkernel()
powerpc/kernel/prom.c::early_init_devtree()
powerpc/mm/init_32.c::MMU_init()
powerpc/mm/tlb_nohash.c::__early_init_mmu()
powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
sh/kernel/machine_kexec.c::reserve_crashkernel()
* x86/kernel/e820.c::memblock_x86_fill() was directly setting
memblock_can_resize before populating memblock and calling analyze
afterwards. Call memblock_allow_resize() before start populating.
memblock_can_resize is now static inside memblock.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>
memblock_init() initializes arrays for regions and memblock itself;
however, all these can be done with struct initializers and
memblock_init() can be removed. This patch kills memblock_init() and
initializes memblock with struct initializer.
The only difference is that the first dummy entries don't have .nid
set to MAX_NUMNODES initially. This doesn't cause any behavior
difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Gcc complains if we don't cast this to a struct cpumask pointer.
arch/x86/kernel/nmi_selftest.c:93:2:
warning: passing argument 1 of ‘cpumask_empty’ from
incompatible pointer type [enabled by default]
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/20111207110612.GA3437@mwanda
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It seems that a margin for stack overflow checking is added to
top of a kernel stack but is not added to IRQ and exception
stacks in stack_overflow_check(). Therefore, the overflows of
IRQ and exception stacks are always detected only after they
actually occurred and data corruption might occur due to them.
This patch adds the margin to top of IRQ and exception stacks
as well as a kernel stack to enhance reliability.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20111207082910.9847.3359.stgit@ltc219.sdl.hitachi.co.jp
[ removed the #undef - we typically don't do that for uncommon names ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
KVM needs to know perf capability to decide which PMU it can expose to a
guest.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-8-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement the disabling of arch events as a quirk so that we can print
a message along with it. This creates some visibility into the problem
space and could allow us to work on adding more work-around like the
AAJ80 one.
Requested-by: Ingo Molnar <mingo@elte.hu>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-wcja2z48wklzu1b0nkz0a5y7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel CPUs report non-available architectural events in cpuid leaf
0AH.EBX. Use it to disable events that are not available according
to CPU.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-7-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/kernel/built-in.o(.text+0x4c71): Section mismatch in
reference from the function arch_jump_label_transform_static() to the
function .init.text:text_poke_early()
The function arch_jump_label_transform_static() references
the function __init text_poke_early().
This is often because arch_jump_label_transform_static lacks a __init
annotation or the annotation of text_poke_early is wrong.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/n/tip-9lefe89mrvurrwpqw5h8xm8z@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86_64 kernel pushes the fake kernel stack in
arch/x86/kernel/entry_64.S:FAKE_STACK_FRAME, and
rflags register in it does not conform to the specification.
Although Intel's manual[1] says bit 1 of it shall be set to 1,
this bit is cleared to 0 on pushing the fake stack.
[1] Intel(R) 64 and IA-32 Architectures Software Developer's Manual
Vol.1 3-21 Figure 3-8. EFLAGS Register
If it is not on purpose, it is better to be fixed, because
it can lead some tools misunderstanding the stack frame. For example,
"crash" utility[2] actually detects it and warns you like
below:
RIP: ffffffff8005dfa2 RSP: ffff8104ce0c7f58 RFLAGS: 00000200
[...]
bt: WARNING: possibly bogus exception frame
Signed-off-by: Seiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
Tested-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This avoids a scheduling failure for cases like:
cycles, cycles, instructions, instructions (on Core2)
Which would end up being programmed like:
PMC0, PMC1, FP-instructions, fail
Because all events will have the same weight.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-8tnwb92asqj7xajqqoty4gel@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:
counter mask weight
amd_f15_PMC30 0x09 2 <--- overlapping counters
amd_f15_PMC20 0x07 3
amd_f15_PMC53 0x38 3
The scheduler does not find then an existing solution. Here is an
example:
event code counter failure possible solution
0x02E PMC[3,0] 0 3
0x043 PMC[2:0] 1 0
0x045 PMC[2:0] 2 1
0x046 PMC[2:0] FAIL 2
The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.
To solve this, we now save the scheduler state of events with
overlapping counter counstraints. If we fail to schedule the events
we rollback to those states and try to use another free counter.
Constraints with overlapping counters are marked with a new introduced
overlap flag. We set the overlap flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_OVERLAP() macro can be used for this.
Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.
On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.
V2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.
V3:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
as return value.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).
The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.
With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.
V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.
V4:
* Fix broken event assignment if weight of the first event is not
wmin (perf_sched_init()).
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since there is a possibility of !KPROBES int3 listeners
(such as kgdb) and since DIE_TRAP is currently not being
used by anybody, notify all listeners with DIE_INT3.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20111025142159.GB21225@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do_notify_resume() gets called with interrupts disabled on x86_32. This
is different from the x86_64 behavior, where interrupts are enabled at
the time.
Queries on lkml on this issue hasn't yielded any clear answer. Lets make
x86_32 behave the same as x86_64, unless there is a real reason to
maintain status quo.
Please refer https://lkml.org/lkml/2011/9/27/130 for more
details.
A similar change was suggested in ARM:
https://lkml.org/lkml/2011/8/25/231
My 32-bit machine works fine (tm) with this patch.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20111025141812.GA21225@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I used "ifdef CONFIG_NUMA" simply because it doesn't make
sense in a non-numa configuration even with SMP enabled.
Besides, the only place where it is called right now is
in kernel/cpu/amd.c:srat_detect_node() within the
"CONFIG_NUMA" protected part.
Signed-off-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intr_remapping: Fix section mismatch in ir_dev_scope_init()
intel-iommu: Fix section mismatch in dmar_parse_rmrr_atsr_dev()
x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions
x86, AMD: Correct align_va_addr documentation
x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms
x86/mrst: Battery fixes
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode
x86: Fix "Acer Aspire 1" reboot hang
x86/mtrr: Resolve inconsistency with Intel processor manual
x86: Document rdmsr_safe restrictions
x86, microcode: Fix the failure path of microcode update driver init code
Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup
x86/mpparse: Account for bus types other than ISA and PCI
x86, mrst: Change the pmic_gpio device type to IPC
mrst: Added some platform data for the SFI translations
x86,mrst: Power control commands update
x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
x86, UV: Fix UV2 hub part number
x86: Add user_mode_vm check in stack_overflow_check
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix loss of notification with multi-event
perf, x86: Force IBS LVT offset assignment for family 10h
perf, x86: Disable PEBS on SandyBridge chips
trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter
perf session: Fix crash with invalid CPU list
perf python: Fix undefined symbol problem
perf/x86: Enable raw event access to Intel offcore events
perf: Don't use -ENOSPC for out of PMU resources
perf: Do not set task_ctx pointer in cpuctx if there are no events in the context
perf/x86: Fix PEBS instruction unwind
oprofile, x86: Fix crash when unloading module (nmi timer mode)
oprofile: Fix crash when unloading module (hr timer mode)
I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.
Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tsc=reliable boot parameter is supposed to skip all the TSC
stablility checks during boot time.
On a 8-socket system where we want to run an experiment with the
"tsc=reliable" boot option, TSC synchronization checks are not
getting skipped and marking the TSC as not stable.
Check for tsc_clocksource_reliable (which is set via
tsc=reliable or for platforms supporting synthetic TSC_RELIABLE
feature bit etc) and when set, skip the TSC synchronization
tests during boot.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1320446537.15071.14.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
system_call_after_swapgs doesn't really benefit from forcing
alignment from it - quite the opposite, native code needlessly
so far got a big NOP instruction inserted in front of it. Xen
being the only user of the separate entry point can well live
with the branch going to three bytes into a cache line.
The compatibility mode ptregs entry points for one can make use
of the GLOBAL() macro, and should be suitably aligned. Their
shared continuation point (ia32_ptregs_common) otoh doesn't need
to be global at all, but should continue to be properly aligned.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4CEEA020000780006407D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
GET_THREAD_INFO() involves a memory read immediately followed by
an "sub" on the value read, in turn (in several cases)
immediately followed by a use of the calculated value as the
base address of a memory access. This combination of
instructions has a non-negligible potential for stalls.
In the system call entry point code, however, the (fixed) offset
of the stack pointer from the end of the stack is generally
known, and hence we can instead avoid the memory load and
subtract, and instead do the memory reference using %rsp as the
base register. To do so in a legible fashion, introduce a
THREAD_INFO() macro which, provided a register (generally %rsp)
and the known offset from the end of the stack, produces a
suitable memory access operand.
The patch attempts to only touch the fast paths (no auditing and
alike), but manages to do so only in the 64-bit entry point
case; the compatibility mode entry points have so many
interdependencies between their various branch targets that it
was necessary to also adjust the slow paths to eliminate the
risk of having missed some register dependency during code
analysis.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4CD690200007800064075@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previously these up to 32 entry points, consisting of all the
same code except for their very first instruction, consumed 0x70
bytes per instance. Just like for device interrupt entry points,
fold them together so that they all use a single instance of the
code after having pushed their vector indicator (resulting in
0x10 bytes per instance, to retain 16-byte alignment of the
individual entry points).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4CA230200007800064065@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Testing for a return to ring 0 was necessary here solely because
of the branch out of ret_from_fork. That branch, however, can be
directed to retint_restore_args, and thus the test-and-branch
can be eliminated here.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/4ED4C7EE0200007800064028@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds support for Numascale NumaChip large-SMP systems. It is
needed to enable the booting of more than ~168 cores.
v2:
- [Steffen] enumerate only accessible northbridges
- [Daniel] rediffed and validated against 3.1-rc10
v3:
- [Daniel] use x86_init core numbering override
- [Daniel] cleanups as per feedback
v4:
- [Daniel] use updated x86_cpuinit override
v5:
- drop disabling interrupts locally, as ISR write is atomic; drop delay
- added read-mostly annotations where appropriate
- require CONFIG_SMP, so drop conditional path
Workload tested on 96 cores/16 sockets.
Signed-off-by: Steffen Persvold <sp@numascale.com>
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323101246-2400-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add an x86_init vector for handling inconsistent core numbering.
This is useful for multi-fabric platforms, such as Numascale
NumaConnect.
v2:
- use struct x86_cpuinit_ops
- provide default fall-back function to warn
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-2-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow flat_init_apic_ldr() to be used outside the compilation
unit for similar APIC implementations.
Signed-off-by: Daniel J Blueman <daniel@numascale-asia.com>
Cc: Steffen Persvold <sp@numascale.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Link: http://lkml.kernel.org/r/1323073238-32686-1-git-send-email-daniel@numascale-asia.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reduce the startup time for slave cpus.
Adds hooks for an arch-specific function for clock calibration.
These hooks are used on x86. If a newly started cpu has the
same phys_proc_id as a core already active, uses the TSC for the
delay loop and has a CONSTANT_TSC, use the already-calculated
value of loops_per_jiffy.
This patch reduces the time required to start slave cpus on a
4096 cpu system from: 465 sec OLD 62 sec NEW
This reduces boot time on a 4096p system by almost 7 minutes.
Nice...
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: John Stultz <john.stultz@linaro.org>
[fix CONFIG_SMP=n build]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The last parameter to sort() is a pointer to the function used
to swap items. This parameter should be NULL, not 0, when not
used. This quiets the following sparse warning:
warning: Using plain integer as NULL pointer
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: hartleys@visionengravers.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel MID x86 platforms have a memory mapped virtual RTC
instead. No MID platform have the default ports (and
accessing them may do weird stuff).
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace the bubble sort in sanitize_e820_map() with a call to
the generic kernel sort function to avoid pathological
performance with large maps.
On large (thousands of entries) E820 maps, the previous code
took minutes to run; with this change it's now milliseconds.
Signed-off-by: Mike Ditto <mditto@google.com>
Cc: sassmann@kpanic.de
Cc: yuenn@google.com
Cc: Stefan Assmann <sassmann@kpanic.de>
Cc: Nancy Yuen <yuenn@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ptrace_set_debugreg() is only used in this file and should be
static. This also quiets the following sparse warning:
warning: symbol 'ptrace_set_debugreg' was not declared. Should it be static?
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: hartleys@visionengravers.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails. It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).
There's a log of problems at:
https://bbs.archlinux.org/viewtopic.php?id=124136
This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.
Reported-by: Uroš Vampl <mobile.leecher@gmail.com>
Tested-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Following is from Notes of section 11.5.3 of Intel processor
manual available at:
http://www.intel.com/Assets/PDF/manual/325384.pdf
For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.
This patch provides resolution for that.
Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This follows on from the patch applied in 3.2rc1 which creates
an INTEL_MID configuration. We can now add the entry for
Medfield specific code. After this is merged the final patch
will be submitted which moves the rest of the device Kconfig
dependancies to MRST/MEDFIELD/INTEL_MID as appropriate.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The original macro worked only when applied to variables named
'evt'. While this could have been fixed by simply renaming the
macro argument, a more type-safe replacement is preferred.
Signed-off-by: Ferenc Wagner <wferi@niif.hu>
Cc: Venkatesh Pallipadi \(Venki\) <venki@google.com>
Link: http://lkml.kernel.org/r/8ed5c66c02041226e8cf8b4d5d6b41e543d90bd6.1321626272.git.wferi@niif.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In commit f8924e770e ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better. Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus. After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.
In the cases of ISA and PCI, there's not much difference. But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.
Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied. With this change, the machine
boots reliably without "noapic".
Fixes http://bugs.debian.org/586494
Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The panic_on_stackoverflow variable needs to be avilable
on the 32-bit side as well ...
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell OptiPlex 990 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/201111160019.51303.rjw@sisk.pl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This essentially reverts:
2b666859ec: x86: Default to vsyscall=native for now
The ABI breakage should now be fixed by:
commit 48c4206f5b02f28c4c78a1f5b491d3772fb64fb9
Author: Andy Lutomirski <luto@mit.edu>
Date: Thu Oct 20 08:48:19 2011 -0700
x86-64: Set siginfo and context on vsyscall emulation faults
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Adrian Bunk <bunk@stusta.de>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/93154af3b2b6d208906ae02d80d92cf60c6fa94f.1320712291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To make this work, we teach the page fault handler how to send
signals on failed uaccess. This only works for user addresses
(kernel addresses will never hit the page fault handler in the
first place), so we need to generate signals for those
separately.
This gets the tricky case right: if the user buffer spans
multiple pages and only the second page is invalid, we set
cr2 and si_addr correctly. UML relies on this behavior to
"fault in" pages as needed.
We steal a bit from thread_info.uaccess_err to enable this.
Before this change, uaccess_err was a 32-bit boolean value.
This fixes issues with UML when vsyscall=emulate.
Reported-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4c8f91de7ec5cd2ef0f59521a04e1015f11e42b4.1320712291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The previous patch modified the stop cpus path to use NMI
instead of IRQ as the way to communicate to the other cpus to
shutdown. There were some concerns that various machines may
have problems with using an NMI IPI.
This patch creates a selftest to check if NMI is working at
boot. The idea is to help catch any issues before the machine
panics and we learn the hard way.
Loosely based on the locking-selftest.c file, this separate file
runs a couple of simple tests and reports the results. The
output looks like:
...
Brought up 4 CPUs
----------------
| NMI testsuite:
--------------------
remote IPI: ok |
local IPI: ok |
--------------------
Good, all 2 testcases passed! |
---------------------------------
Total of 4 processors activated (21330.61 BogoMIPS).
...
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: seiji.aguchi@hds.com
Cc: vgoyal@redhat.com
Cc: mjg@redhat.com
Cc: tony.luck@intel.com
Cc: gong.chen@intel.com
Cc: satoru.moriya@hds.com
Cc: avi@redhat.com
Cc: Andi Kleen <andi@firstfloor.org>
Link: http://lkml.kernel.org/r/1318533267-18880-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A recent discussion started talking about the locking on the
pstore fs and how it relates to the kmsg infrastructure. We
noticed it was possible for userspace to r/w to the pstore fs
(grabbing the locks in the process) and block the panic path
from r/w to the same fs.
The reason was the cpu with the lock could be doing work while
the crashing cpu is panic'ing. Busting those spinlocks might
cause those cpus to step on each other's data. Fine, fair
enough.
It was suggested it would be nice to serialize the panic path
(ie stop the other cpus) and have only one cpu running. This
would allow us to bust the spinlocks and not worry about another
cpu stepping on the data.
Of course, smp_send_stop() does this in the panic case.
kmsg_dump() would have to be moved to be called after it. Easy
enough.
The only problem is on x86 the smp_send_stop() function calls
the REBOOT_VECTOR. Any cpu with irqs disabled (which pstore and
its backend ERST would do), block this IPI and thus do not stop.
This makes it difficult to reliably log data to the pstore fs.
The patch below switches from the REBOOT_VECTOR to NMI (and
mimics what kdump does). Switching to NMI allows us to deliver
the IPI when irqs are disabled, increasing the reliability of
this function.
However, Andi carefully noted that on some machines this
approach does not work because of broken BIOSes or whatever.
To help accomodate this, the next couple of patches will run a
selftest and provide a knob to disable.
V2:
uses atomic ops to serialize the cpu that shuts everyone down
V3:
comment cleanup
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robert Richter <robert.richter@amd.com>
Cc: seiji.aguchi@hds.com
Cc: vgoyal@redhat.com
Cc: mjg@redhat.com
Cc: tony.luck@intel.com
Cc: gong.chen@intel.com
Cc: satoru.moriya@hds.com
Cc: avi@redhat.com
Cc: Andi Kleen <andi@firstfloor.org>
Link: http://lkml.kernel.org/r/1318533267-18880-2-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There was a mixup when the SGI UV2 hub chip was sent to be
fabricated, and it ended up with the wrong part number in the
HRP_NODE_ID mmr. Future versions of the chip will (may) have the
correct part number. Change the UV infrastructure to recognize
both part numbers as valid IDs of a UV2 hub chip.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20111129210058.GA20452@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The overflow checking of kernel stack checks if the stack
pointer points to the available kernel stack range, which is
derived from the original overflow checking.
It is clear that curbase address is always less than low
boundary of available kernel stack. So, this patch removes the
first condition that checks if the pointer is higher than
curbase.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060845.11076.40916.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Currently, messages are just output on the detection of stack
overflow, which is not sufficient for systems that need a
high reliability. This is because in general the overflow may
corrupt data, and the additional corruption may occur due to
reading them unless systems stop.
This patch adds the sysctl parameter
kernel.panic_on_stackoverflow and causes a panic when detecting
the overflows of kernel, IRQ and exception stacks except user
stack according to the parameter. It is disabled by default.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, only kernel stack is checked for the overflow, which
is not sufficient for systems that need a high reliability. To
enhance it, it is required to check the IRQ and exception
stacks, as well.
This patch checks all the stack types and will cause messages of
stacks in detail when free stack space drops below a certain
limit except user stack.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060829.11076.51733.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
The kernel stack overflow is checked in stack_overflow_check(),
which may wrongly detect the overflow if the stack pointer in
user space points to the kernel stack intentionally or
accidentally. So, the actual overflow is never detected after
this misdetection because WARN_ONCE() is used on the detection
of it.
This patch adds user-mode-vm checking before it to avoid this
problem and bails out early if the user stack is used.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060821.11076.55315.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
On AMD family 10h we see firmware bug messages like the following:
[Firmware Bug]: cpu 6, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
[Firmware Bug]: cpu 6, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
[Firmware Bug]: using offset 1 for IBS interrupts
[Firmware Bug]: workaround enabled for IBS LVT offset
perf: AMD IBS detected (0x00000007)
We always see this, since the offsets are not assigned by the BIOS for
this family. Force LVT offset assignment in this case. If the OS
assignment fails, fallback to BIOS settings and try to setup this.
The fallback to BIOS settings weakens the family check since
force_ibs_eilvt_setup() may fail e.g. in case of virtual machines.
But setup may still succeed if BIOS offsets are correct.
Other families don't have a workaround implemented that assigns LVT
offsets. It's ok, to drop calling force_ibs_eilvt_setup() for that
families.
With the patch the [Firmware Bug] messages vanish. We see now:
IBS: LVT offset 1 assigned
perf: AMD IBS detected (0x00000007)
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111109162225.GO12451@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
People with old AMD chips are getting hung boots, because commit
bcb80e5387 ("x86, microcode, AMD: Add microcode revision to
/proc/cpuinfo") moved the microcode detection too early into
"early_init_amd()".
At that point we are *so* early in the booth that the exception tables
haven't even been set up yet, so the whole
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
doesn't actually work: if the rdmsr does a GP fault (due to non-existant
MSR register on older CPU's), we can't fix it up yet, and the boot fails.
Fix it by simply moving the code to a slightly later point in the boot
(init_amd() instead of early_init_amd()), since the kernel itself
doesn't even really care about the microcode patchlevel at this point
(or really ever: it's made available to user space in /proc/cpuinfo, and
updated if you do a microcode load).
Reported-tested-and-bisected-by: Larry Finger <Larry.Finger@lwfinger.net>
Tested-by: Bob Tracy <rct@gherkin.frus.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea behind commit d91ee5863b ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code. It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).
But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle. This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.
In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:
Brought up 2 CPUs
invalid opcode: 0000 [#1] SMP
Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
RIP: e030:[<ffffffff81015d1d>] [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
...
Call Trace:
[<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
RIP [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
RSP <ffff8801d28ddf10>
In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall. Meaning we end up going to
hypervisor twice instead of just once.
The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.
We want to do that, but only for one specific case: Xen. This patch
does that.
Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts & resolutions:
* arch/x86/xen/setup.c
dc91c728fd "xen: allow extra memory to be in multiple regions"
24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."
conflicted on xen_add_extra_mem() updates. The resolution is
trivial as the latter just want to replace
memblock_x86_reserve_range() with memblock_reserve().
* drivers/pci/intel-iommu.c
166e9278a3 "x86/ia64: intel-iommu: move to drivers/iommu/"
5dfe8660a3 "bootmem: Replace work_with_active_regions() with..."
conflicted as the former moved the file under drivers/iommu/.
Resolved by applying the chnages from the latter on the moved
file.
* mm/Kconfig
6661672053 "memblock: add NO_BOOTMEM config symbol"
c378ddd53f "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"
conflicted trivially. Both added config options. Just
letting both add their own options resolves the conflict.
* mm/memblock.c
d1f0ece6cd "mm/memblock.c: small function definition fixes"
ed7b56a799 "memblock: Remove memblock_memory_can_coalesce()"
confliected. The former updates function removed by the
latter. Resolution is trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
The tsc code uses CLOCK_TICK_RATE which on x86
is defined to just be the same as PIT_TICK_RATE.
This patch updates the code use the later
as we want to depecrate and remove the global
CLOCK_TICK_RATE symbol.
Signed-off-by: Deepak Saxena <dsaxena@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Prevent tracing of preempt_disable() in get_cpu_var() in
kvm_clock_read(). When CONFIG_DEBUG_PREEMPT is enabled,
preempt_disable/enable() are traced and this causes the function_graph
tracer to go into an infinite recursion. By open coding the
preempt_disable() around the get_cpu_var(), we can use the notrace
version which prevents preempt_disable/enable() from being traced and
prevents the recursion.
Based on a similar patch for Xen from Jeremy Fitzhardinge.
Tested-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Generate system call tables and unistd_*.h automatically from the
tables in arch/x86/syscalls. All other information, like NR_syscalls,
is auto-generated, some of which is in asm-offsets_*.c.
This allows us to keep all the system call information in one place,
and allows for kernel space and user space to see different
information; this is currently used for the ia32 system call numbers
when building the 64-bit kernel, but will be used by the x32 ABI in
the near future.
This also removes some gratuitious differences between i386, x86-64
and ia32; in particular, now all system call tables are generated with
the same mechanism.
Cc: H. J. Lu <hjl.tools@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Adjust spacing for comment so that it matches the multiline comment
style used in the rest of the kernel, and remove word duplication.
It is not really clear what version of gcc this refers to, but the
extra & doesn't cause any harm, so there is no reason to remove it.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add support to specify which HSU port to use as an early console. This can
be selected by passing "earlyprintk=hsu<n>" on the kernel command line. By
default port 0 is still used.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The option iommu=group_mf indicates the that the iommu driver should
expose all functions of a multi-function PCI device as the same
iommu_device_group. This is useful for disallowing individual functions
being exposed as independent devices to userspace as there are often
hidden dependencies. Virtual functions are not affected by this option.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
It appears that stop_machine_text_poke() wants to be called on all CPUs,
like it's done from text_poke_smp(). Fix text_poke_smp_batch() to do
this.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1319702072-32676-1-git-send-email-rabin@rab.in
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the core offcore support is fixed up (thanks Stephane) and we
have sane generic events utilizing them, re-enable the raw access to
the feature as well.
Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify
an offcore event, either one works and neither guarantees you'll end
up on a particular offcore MSR.
Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>.
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
People (Linus) objected to using -ENOSPC to signal not having enough
resources on the PMU to satisfy the request. Use -EINVAL.
Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-xv8geaz2zpbjhlx0svmpp28n@git.kernel.org
[ merged to newer kernel, fixed up MIPS impact ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Masami spotted that we always try to decode the instruction stream as
64bit instructions when running a 64bit kernel, this doesn't work for
ia32-compat proglets.
Use TIF_IA32 to detect if we need to use the 32bit instruction
decoder.
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
with "apic=verbose" the print_IO_APIC() function tries to print
IRQ to pin mappings for every active irq. It assumes chip_data
is of type irq_cfg and may cause an oops if not.
As the print_IO_APIC() is called from a late_initcall other
chained irq chips may already be registered with custom
chip_data information, causing an oops. This is the case with
intel MID SoC devices with gpio demuxers registered as irq_chips.
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
[ -v2: fixed build failure ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moorestown/Medfield platform does not have port 0x61 to report
NMI status, nor does it have external NMI sources. The only NMI
sources are from lapic, as results of perf counter overflow or
IPI, e.g. NMI watchdog or spin lock debug.
Reading port 0x61 on Moorestown will return 0xff which misled
NMI handlers to false critical errors such memory parity error.
The subsequent ioport access for NMI handling can also cause
undefined behavior on Moorestown.
This patch allows kernel process NMI due to watchdog or backrace
dump without unnecessary hangs.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[hand applied]
Signed-off-by: Alan Cox <alan@linux.intel.com>
lapic timer calibration can be combined with tsc in platform
specific calibration functions. if such calibration result is
obtained early, we can skip the redundant calibration loops.
Signed-off-by: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
nr_legacy_irqs is set in probe_nr_irqs_gsi, we should not clear
it after that. Otherwise, the result is that MSI irqs will be
allocated from the wrong range for the systems without legacy
PIC.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some wall clock devices use MMIO based HW register, this new
function will give them a chance to do some initialization work
before their get/set_time service get called.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.
Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'upstream/jump-label-noearly' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
jump-label: initialize jump-label subsystem much earlier
x86/jump_label: add arch_jump_label_transform_static()
s390/jump-label: add arch_jump_label_transform_static()
jump_label: add arch_jump_label_transform_static() to optimise non-live code updates
sparc/jump_label: drop arch_jump_label_text_poke_early()
x86/jump_label: drop arch_jump_label_text_poke_early()
jump_label: if a key has already been initialized, don't nop it out
stop_machine: make stop_machine safe and efficient to call early
jump_label: use proper atomic_t initializer
Conflicts:
- arch/x86/kernel/jump_label.c
Added __init_or_module to arch_jump_label_text_poke_early vs
removal of that function entirely
- kernel/stop_machine.c
same patch ("stop_machine: make stop_machine safe and efficient
to call early") merged twice, with whitespace fix in one version
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (21 commits)
MAINTAINERS: add an entry for Edac Sandy Bridge driver
edac: tag sb_edac as EXPERIMENTAL, as it requires more testing
EDAC: Fix incorrect edac mode reporting in sb_edac
edac: sb_edac: Add it to the building system
edac: Add an experimental new driver to support Sandy Bridge CPU's
i7300_edac: Fix error cleanup logic
i7core_edac: Initialize memory name with cpu, channel, bank
i7core_edac: Fix compilation on 32 bits arch
i7core_edac: scrubbing fixups
EDAC: Correct Kconfig dependencies
i7core_edac: return -ENODEV if no MC is found
i7core_edac: use edac's own way to print errors
MAINTAINERS: remove dropped edac_mce.* from the file
i7core_edac: Drop the edac_mce facility
x86, MCE: Use notifier chain only for MCE decoding
EDAC i7core: Use mce socketid for better compatibility
i7core_edac: Don't enable memory scrubbing for Xeon 35xx
i7core_edac: Add scrubbing support
edac: Move edac main structs to include/linux/edac.h
i7core_edac: Fix oops when trying to inject errors
...
Remove edac_mce pieces and use the normal MCE decoder notifier chain by
retaining the same functionality with considerably less code.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
http://marc.info/?l=linux-mm&m=130105930902915&w=2
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <linux-man@vger.kernel.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These files were implicitly getting EXPORT_SYMBOL via device.h
which was including module.h, but that will be fixed up shortly.
By fixing these now, we can avoid seeing things like:
arch/x86/kernel/rtc.c:29: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/pci-dma.c:20: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
arch/x86/kernel/e820.c:69: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’
[ with input from Randy Dunlap <rdunlap@xenotime.net> and also
from Stephen Rothwell <sfr@canb.auug.org.au> ]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
In removing the presence of <linux/module.h> from some of the
more common <linux/something.h> files, this implict include
of <linux/topology.h> was uncovered.
CC arch/x86/kernel/vsyscall_64.o
arch/x86/kernel/vsyscall_64.c: In function ‘vsyscall_set_cpu’:
arch/x86/kernel/vsyscall_64.c:259: error: implicit declaration of function ‘cpu_to_node’
Explicitly call it out so the cleanup can take place.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Drop the edac_mce custom hook in favor of the generic notifier
mechanism. Also, do not log the error to mcelog if the notified agent
was able to decode it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
* 'x86-rdrand-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, random: Verify RDRAND functionality and allow it to be disabled
x86, random: Architectural inlines to get random integers with RDRAND
random: Add support for architectural random hooks
Fix up trivial conflicts in drivers/char/random.c: the architectural
random hooks touched "get_random_int()" that was simplified to use MD5
and not do the keyptr thing any more (see commit 6e5714eaf7: "net:
Compute protocol sequence numbers and fragment IDs using MD5").
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, AMD: Add microcode revision to /proc/cpuinfo
x86, microcode: Correct microcode revision format
coretemp: Get microcode revision from cpu_data
x86, intel: Use c->microcode for Atom errata check
x86, intel: Output microcode revision in /proc/cpuinfo
x86, microcode: Don't request microcode from userspace unnecessarily
Fix up trivial conflicts in arch/x86/kernel/cpu/amd.c (conflict between
moving AMD BSP code to cpu_dev helper function and adding AMD microcode
revision to /proc/cpuinfo code)
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Hyper-V: Integrate the clocksource with Hyper-V detection code
Fix up conflicts in drivers/staging/hv/Makefile manually (some of the hv
code has moved out of staging to drivers/hv/)
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, amd: Include linux/elf.h since we use stuff from asm/elf.h
x86: cache_info: Update calculation of AMD L3 cache indices
x86: cache_info: Kill the atomic allocation in amd_init_l3_cache()
x86: cache_info: Kill the moronic shadow struct
x86: cache_info: Remove bogus free of amd_l3_cache data
x86, amd: Include elf.h explicitly, prepare the code for the module.h split
x86-32, amd: Move va_align definition to unbreak 32-bit build
x86, amd: Move BSP code to cpu_dev helper
x86: Add a BSP cpu_dev helper
x86, amd: Avoid cache aliasing penalties on AMD family 15h
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, unistd: Remove bogus __IGNORE_getcpu
x86, mm, trivial: Remove unnecessary get_order() in free_thread_info()
x86, cleanup: Remove unneeded version.h include from arch/x86/
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64: Fix CFI data for interrupt frames
x86-64: Don't apply destructive erratum workaround on unaffected CPUs
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Standardize on CONFIG_SPARSE_IRQ=y
x86, ioapic: Clean up ioapic/apic_id usage
x86, ioapic: Factor out print_IO_APIC() to only print one io apic
x86, ioapic: Print out irte with right ioapic index
x86, ioapic: Split up setup_ioapic_entry()
x86, ioapic: Pass struct irq_attr * to setup_ioapic_irq()
apic, i386/bigsmp: Fix false warnings regarding logical APIC ID mismatches
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (121 commits)
perf symbols: Increase symbol KSYM_NAME_LEN size
perf hists browser: Refuse 'a' hotkey on non symbolic views
perf ui browser: Use libslang to read keys
perf tools: Fix tracing info recording
perf hists browser: Elide DSO column when it is set to just one DSO, ditto for threads
perf hists: Don't consider filtered entries when calculating column widths
perf hists: Don't decay total_period for filtered entries
perf hists browser: Honour symbol_conf.show_{nr_samples,total_period}
perf hists browser: Do not exit on tab key with single event
perf annotate browser: Don't change selection line when returning from callq
perf tools: handle endianness of feature bitmap
perf tools: Add prelink suggestion to dso update message
perf script: Fix unknown feature comment
perf hists browser: Apply the dso and thread filters when merging new batches
perf hists: Move the dso and thread filters from hist_browser
perf ui browser: Honour the xterm colors
perf top tui: Give color hints just on the percentage, like on --stdio
perf ui browser: Make the colors configurable and change the defaults
perf tui: Remove unneeded call to newtCls on startup
perf hists: Don't format the percentage on hist_entry__snprintf
...
Fix up conflicts in arch/x86/kernel/kprobes.c manually.
Ingo's tree did the insane "add volatile to const array", which just
doesn't make sense ("volatile const"?). But we could remove the const
*and* make the array volatile to make doubly sure that gcc doesn't
optimize it away..
Also fix up kernel/trace/ring_buffer.c non-data-conflicts manually: the
reader_lock has been turned into a raw lock by the core locking merge,
and there was a new user of it introduced in this perf core merge. Make
sure that new use also uses the raw accessor functions.
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock()
lockdep: Comment all warnings
lib: atomic64: Change the type of local lock to raw_spinlock_t
locking, lib/atomic64: Annotate atomic64_lock::lock as raw
locking, x86, iommu: Annotate qi->q_lock as raw
locking, x86, iommu: Annotate irq_2_ir_lock as raw
locking, x86, iommu: Annotate iommu->register_lock as raw
locking, dma, ipu: Annotate bank_lock as raw
locking, ARM: Annotate low level hw locks as raw
locking, drivers/dca: Annotate dca_lock as raw
locking, powerpc: Annotate uic->lock as raw
locking, x86: mce: Annotate cmci_discover_lock as raw
locking, ACPI: Annotate c3_lock as raw
locking, oprofile: Annotate oprofilefs lock as raw
locking, video: Annotate vga console lock as raw
locking, latencytop: Annotate latency_lock as raw
locking, timer_stats: Annotate table_lock as raw
locking, rwsem: Annotate inner lock as raw
locking, semaphores: Annotate inner lock as raw
locking, sched: Annotate thread_group_cputimer as raw
...
Fix up conflicts in kernel/posix-cpu-timers.c manually: making
cputimer->cputime a raw lock conflicted with the ABBA fix in commit
bcd5cff721 ("cputimer: Cure lock inversion").
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, ioapic: Consolidate the explicit EOI code
x86, ioapic: Restore the mask bit correctly in eoi_ioapic_irq()
x86, kdump, ioapic: Reset remote-IRR in clear_IO_APIC
iommu: Rename the DMAR and INTR_REMAP config options
x86, ioapic: Define irq_remap_modify_chip_defaults()
x86, msi, intr-remap: Use the ioapic set affinity routine
iommu: Cleanup ifdefs in detect_intel_iommu()
iommu: No need to set dmar_disabled in check_zero_address()
iommu: Move IOMMU specific code to intel-iommu.c
intr_remap: Call dmar_dev_scope_init() explicitly
x86, x2apic: Enable the bios request for x2apic optout
This allows jump-label entries to be cheaply updated on code which is
not yet live.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
It is no longer used.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
MAINTAINERS: linux-m32r is moderated for non-subscribers
linux@lists.openrisc.net is moderated for non-subscribers
Drop default from "DM365 codec select" choice
parisc: Kconfig: cleanup Kernel page size default
Kconfig: remove redundant CONFIG_ prefix on two symbols
cris: remove arch/cris/arch-v32/lib/nand_init.S
microblaze: add missing CONFIG_ prefixes
h8300: drop puzzling Kconfig dependencies
MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
tty: drop superfluous dependency in Kconfig
ARM: mxc: fix Kconfig typo 'i.MX51'
Fix file references in Kconfig files
aic7xxx: fix Kconfig references to READMEs
Fix file references in drivers/ide/
thinkpad_acpi: Fix printk typo 'bluestooth'
bcmring: drop commented out line in Kconfig
btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
doc: raw1394: Trivial typo fix
CIFS: Don't free volume_info->UNC until we are entirely done with it.
treewide: Correct spelling of successfully in comments
...
When compiling an i386_defconfig kernel with gcc-4.6.1-9.fc15.i686, I
noticed a warning about the asm operand for test_bit in kprobes'
can_boost. I discovered that this caused only the first long of
twobyte_is_boostable[] to be output.
Jakub filed and fixed gcc PR50571 to correct the warning and this output
issue. But to solve it for less current gcc, we can make kprobes'
twobyte_is_boostable[] non-const, and it won't be optimized out.
Before:
CC arch/x86/kernel/kprobes.o
In file included from include/linux/bitops.h:22:0,
from include/linux/kernel.h:17,
from [...]/arch/x86/include/asm/percpu.h:44,
from [...]/arch/x86/include/asm/current.h:5,
from [...]/arch/x86/include/asm/processor.h:15,
from [...]/arch/x86/include/asm/atomic.h:6,
from include/linux/atomic.h:4,
from include/linux/mutex.h:18,
from include/linux/notifier.h:13,
from include/linux/kprobes.h:34,
from arch/x86/kernel/kprobes.c:43:
[...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
[...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input
without lvalue in asm operand 1 is deprecated [enabled by default]
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 00 00 00 00 bt %eax,0x0
554: R_386_32 .rodata.cst4
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
Contents of section .rodata.cst4:
0000 4c030000 L...
Only a single long of twobyte_is_boostable[] is in the object file.
After, without the const on twobyte_is_boostable:
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 20 00 00 00 bt %eax,0x20
554: R_386_32 .data
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
0010 00000000 00000000 00000000 00000000 ................
0020 4c030000 0f000200 ffff0000 ffcff0c0 L...............
0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77 ....;.......&..w
Now all 32 bytes are output into .data instead.
Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable microcode revision output for AMD after 506ed6b53e ("x86,
intel: Output microcode revision in /proc/cpuinfo") did it for Intel.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
506ed6b53e ("x86, intel: Output microcode revision in /proc/cpuinfo")
added microcode revision format to /proc/cpuinfo and the MCE handler in
decimal format but both AMD and Intel patch levels are handled as hex
numbers. Fix it.
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
When compiling an i386_defconfig kernel with
gcc-4.6.1-9.fc15.i686, I noticed a warning about the asm operand
for test_bit in kprobes' can_boost. I discovered that this
caused only the first long of twobyte_is_boostable[] to be
output.
Jakub filed and fixed gcc PR50571 to correct the warning and
this output issue. But to solve it for less current gcc, we can
make kprobes' twobyte_is_boostable[] volatile, and it won't be
optimized out.
Before:
CC arch/x86/kernel/kprobes.o
In file included from include/linux/bitops.h:22:0,
from include/linux/kernel.h:17,
from [...]/arch/x86/include/asm/percpu.h:44,
from [...]/arch/x86/include/asm/current.h:5,
from [...]/arch/x86/include/asm/processor.h:15,
from [...]/arch/x86/include/asm/atomic.h:6,
from include/linux/atomic.h:4,
from include/linux/mutex.h:18,
from include/linux/notifier.h:13,
from include/linux/kprobes.h:34,
from arch/x86/kernel/kprobes.c:43:
[...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
[...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 00 00 00 00 bt %eax,0x0
554: R_386_32 .rodata.cst4
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
Contents of section .rodata.cst4:
0000 4c030000 L...
Only a single long of twobyte_is_boostable[] is in the object
file.
After, with volatile:
$ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
551: 0f a3 05 20 00 00 00 bt %eax,0x20
554: R_386_32 .data
$ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
arch/x86/kernel/kprobes.o: file format elf32-i386
Contents of section .data:
0000 48000000 00000000 00000000 00000000 H...............
0010 00000000 00000000 00000000 00000000 ................
0020 4c030000 0f000200 ffff0000 ffcff0c0 L...............
0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77 ....;.......&..w
Now all 32 bytes are output into .data instead.
Signed-off-by: Josh Stone <jistone@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Link: http://lkml.kernel.org/r/1318899645-4068-1-git-send-email-jistone@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the cpu update level is available the Atom PSE errata
check can use it directly without reading the MSR again.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1318466795-7393-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I got a request to make it easier to determine the microcode
update level on Intel CPUs. This patch adds a new "microcode"
field to /proc/cpuinfo.
The microcode level is also outputed on fatal machine checks
together with the other CPUID model information.
I removed the respective code from the microcode update driver,
it just reads the field from cpu_data. Also when the microcode
is updated it fills in the new values too.
I had to add a memory barrier to native_cpuid to prevent it
being optimized away when the result is not used.
This turns out to clean up further code which already got this
information manually. This is done in followon patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1318466795-7393-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Requesting the microcode from userspace *every time* when onlining CPUs
(during a CPU hotplug operation) is unnecessary. Thus, ensure that
once the kernel gets the microcode after booting, it is not freed nor
invalidated when a CPU goes offline, so that it can be reused when that
CPU comes back online, without requesting userspace for it again. As a
result, the CPU hotplug operations become faster as well.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/4E91F908.5010006@linux.vnet.ibm.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Sparseirq got introduced in v2.6.28 and Thomas did a huge cleanup
around v2.6.38 that eliminated basically all disadvantages
of it.
So we can remove non-sparseirq support now and simplify
our IRQ degrees of freedom a bit.
Suggested-and-acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/4E95E21D.6090200@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While looking at the code, apic_id sometime is referred to index
of ioapic, but sometime is used for phys apic id. and some even
use apic for real apic id. It is very confusing.
So try to limit apic_id or ioapic_id to be real apic id for
ioapic, and use ioapic_idx for ioapic index in the array.
-v2: Suggested by Ingo, use ioapic_idx consistently, instead of ioapic
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542DC.3090509@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is getting too big after the interrupt remaping entries debug
print out was added.
Original print_IO_APIC() becomes print_IO_APICs().
New print_IO_APIC() will only print one ioapic's registers
As a side-effect this clean-up also made checkpatch.pl happier.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542D3.5000008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo pointed out that setup_ioapic_entry() is way too big now.
Split the intr-remap code out into setup_ir_ioapic_entry().
Also pass struct io_apic_irq_attr * instead of 5 parameters
in those two functions.
At last in setup_ir_ioapic_entry() we don't need to panic.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542BB.4070807@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do not expand that struct, and just pass pointer to reduce the
number of parameters in related functions.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/4E9542B1.7050800@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This UML breakage:
linux-2.6.30.1[3800] vsyscall fault (exploit attempt?) ip:ffffffffff600000 cs:33 sp:7fbfb9c498 ax:ffffffffff600000 si:0 di:606790
linux-2.6.30.1[3856] vsyscall fault (exploit attempt?) ip:ffffffffff600000 cs:33 sp:7fbfb13168 ax:ffffffffff600000 si:0 di:606790
Is caused by commit 3ae36655 ("x86-64: Rework vsyscall emulation and add
vsyscall= parameter") - the vsyscall emulation code is not fully cooked
yet as UML relies on some rather fragile SIGSEGV semantics.
Linus suggested in https://lkml.org/lkml/2011/8/9/376 to default
to vsyscall=native for now, this patch implements that.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Andrew Lutomirski <luto@mit.edu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20111005214047.GE14406@localhost.pp.htv.fi
Signed-off-by: Ingo Molnar <mingo@elte.hu>
nmi.c needs an #include <linux/mca.h>:
arch/x86/kernel/nmi.c: In function ‘unknown_nmi_error’:
arch/x86/kernel/nmi.c:286:6: error: ‘MCA_bus’ undeclared (first use in this function)
arch/x86/kernel/nmi.c:286:6: note: each undeclared identifier is reported only once for each function it appears in
Another one is the hpwdt driver:
drivers/watchdog/hpwdt.c:507:9: error: ‘NMI_DONE’ undeclared (first use in this function)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch implements IBS feature detection and initialzation. The
code is shared between perf and oprofile. If IBS is available on the
system for perf, a pmu is setup.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1316597423-25723-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moving IBS macros from oprofile to <asm/perf_event.h> to make it
available to perf. No additional changes.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1316597423-25723-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the NMI handler are broken into lists, increment the appropriate
stats for each list. This allows us to see what is going on when they
get printed out in the next patch.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-6-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Previous patches allow the NMI subsystem to process multipe NMI events
in one NMI. As previously discussed this can cause issues when an event
triggered another NMI but is processed in the current NMI. This causes the
next NMI to go unprocessed and become an 'unknown' NMI.
To handle this, we first have to flag whether or not the NMI handler handled
more than one event or not. If it did, then there exists a chance that
the next NMI might be already processed. Once the NMI is flagged as a
candidate to be swallowed, we next look for a back-to-back NMI condition.
This is determined by looking at the %rip from pt_regs. If it is the same
as the previous NMI, it is assumed the cpu did not have a chance to jump
back into a non-NMI context and execute code and instead handled another NMI.
If both of those conditions are true then we will swallow any unknown NMI.
There still exists a chance that we accidentally swallow a real unknown NMI,
but for now things seem better.
An optimization has also been added to the nmi notifier rountine. Because x86
can latch up to one NMI while currently processing an NMI, we don't have to
worry about executing _all_ the handlers in a standalone NMI. The idea is
if multiple NMIs come in, the second NMI will represent them. For those
back-to-back NMI cases, we have the potentail to drop NMIs. Therefore only
execute all the handlers in the second half of a detected back-to-back NMI.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-5-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Just convert all the files that have an nmi handler to the new routines.
Most of it is straight forward conversion. A couple of places needed some
tweaking like kgdb which separates the debug notifier from the nmi handler
and mce removes a call to notify_die.
[Thanks to Ying for finding out the history behind that mce call
https://lkml.org/lkml/2010/5/27/114
And Boris responding that he would like to remove that call because of it
https://lkml.org/lkml/2011/9/21/163]
The things that get converted are the registeration/unregistration routines
and the nmi handler itself has its args changed along with code removal
to check which list it is on (most are on one NMI list except for kgdb
which has both an NMI routine and an NMI Unknown routine).
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Corey Minyard <minyard@acm.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The NMI handlers used to rely on the notifier infrastructure. This worked
great until we wanted to support handling multiple events better.
One of the key ideas to the nmi handling is to process _all_ the handlers for
each NMI. The reason behind this switch is because NMIs are edge triggered.
If enough NMIs are triggered, then they could be lost because the cpu can
only latch at most one NMI (besides the one currently being processed).
In order to deal with this we have decided to process all the NMI handlers
for each NMI. This allows the handlers to determine if they recieved an
event or not (the ones that can not determine this will be left to fend
for themselves on the unknown NMI list).
As a result of this change it is now possible to have an extra NMI that
was destined to be received for an already processed event. Because the
event was processed in the previous NMI, this NMI gets dropped and becomes
an 'unknown' NMI. This of course will cause printks that scare people.
However, we prefer to have extra NMIs as opposed to losing NMIs and as such
are have developed a basic mechanism to catch most of them. That will be
a later patch.
To accomplish this idea, I unhooked the nmi handlers from the notifier
routines and created a new mechanism loosely based on doIRQ. The reason
for this is the notifier routines have a couple of shortcomings. One we
could't guarantee all future NMI handlers used NOTIFY_OK instead of
NOTIFY_STOP. Second, we couldn't keep track of the number of events being
handled in each routine (most only handle one, perf can handle more than one).
Third, I wanted to eventually display which nmi handlers are registered in
the system in /proc/interrupts to help see who is generating NMIs.
The patch below just implements the new infrastructure but doesn't wire it up
yet (that is the next patch). Its design is based on doIRQ structs and the
atomic notifier routines. So the rcu stuff in the patch isn't entirely untested
(as the notifier routines have soaked it) but it should be double checked in
case I copied the code wrong.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nmi stuff is changing a lot and adding more functionality. Split it
out from the traps.c file so it doesn't continue to pollute that file.
This makes it easier to find and expand all the future nmi related work.
No real functional changes here.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317409584-23662-2-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel does not have guest/host-only bit in perf counters like AMD
does. To support GO/HO bits KVM needs to switch EVENTSELn values
(or PERF_GLOBAL_CTRL if available) at a guest entry. If a counter is
configured to count only in a guest mode it stays disabled in a host,
but VMX is configured to switch it to enabled value during guest entry.
This patch adds GO/HO tracking to Intel perf code and provides interface
for KVM to get a list of MSRs that need to be switched on a guest entry.
Only cpus with architectural PMU (v1 or later) are supported with this
patch. To my knowledge there is not p6 models with VMX but without
architectural PMU and p4 with VMX are rare and the interface is general
enough to support them if need arise.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317816084-18026-7-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The AMD perf-counters support counting in guest or host-mode
only. Make use of that feature when user-space specified
guest/host-mode only counting.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1317816084-18026-3-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The patch titled "x86: Don't use frame pointer to save old stack
on irq entry" did not properly adjust CFI directives, so this
patch is a follow-up to that one.
With the old stack pointer no longer stored in a callee-saved
register (plus some offset), we now have to use a CFA expression
to describe the memory location where it is being found. This
requires the use of .cfi_escape (allowing arbitrary byte streams
to be emitted into .eh_frame), as there is no
.cfi_def_cfa_expression (which also cannot reasonably be
expected, as it would require a full expression parser).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/4E8360200200007800058467@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These warnings (generally one per CPU) are a result of
initializing x86_cpu_to_logical_apicid while apic_default is
still in use, but the check in setup_local_APIC() being done
when apic_bigsmp was already used as an override in
default_setup_apic_routing():
Overriding APIC driver with bigsmp
Enabling APIC mode: Physflat. Using 5 I/O APICs
------------[ cut here ]------------
WARNING: at .../arch/x86/kernel/apic/apic.c:1239
...
CPU 1 irqstacks, hard=f1c9a000 soft=f1c9c000
Booting Node 0, Processors #1
smpboot cpu 1: start_ip = 9e000
Initializing CPU#1
------------[ cut here ]------------
WARNING: at .../arch/x86/kernel/apic/apic.c:1239
setup_local_APIC+0x137/0x46b() Hardware name: ...
CPU1 logical APIC ID: 2 != 8
...
Fix this (for the time being, i.e. until
x86_32_early_logical_apicid() will get removed again, as Tejun
says ought to be possible) by overriding the previously stored
values at the point where the APIC driver gets overridden.
v2: Move this and the pre-existing override logic into
arch/x86/kernel/apic/bigsmp_32.c.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@kernel.org> (2.6.39 and onwards)
Link: http://lkml.kernel.org/r/4E835D16020000780005844C@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After merging the moduleh tree, today's linux-next build (x86_64
allmodconfig) failed like this:
arch/x86/kernel/sys_x86_64.c:28:10: warning: 'enum align_flags' declared inside parameter list
arch/x86/kernel/sys_x86_64.c:28:10: warning: its scope is only this definition or declaration, which is probably not what you
want arch/x86/kernel/sys_x86_64.c:28:22: error: parameter 3 ('flags') has incomplete type
[...]
Presumably caused by the module.h split interacting with a
new commit dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties
on AMD family 15h") from the x8 tree.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20110928174214.17a58be15d84d67c185930e1@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix (rare) build error by adding <asm/apicdef.h> header file:
arch/x86/kernel/cpu/perf_event_amd.c:350:2: error: 'BAD_APICID' undeclared (first use in this function)
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Andre Przywara <andre.przywara@amd.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/4E820138.90301@xenotime.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are numerous broken references to Documentation files (in other
Documentation files, in comments, etc.). These broken references are
caused by typo's in the references, and by renames or removals of the
Documentation files. Some broken references are simply odd.
Fix these broken references, sometimes by dropping the irrelevant text
they were part of.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The CPU support for perf events on x86 was implemented via included C files
with #ifdefs. Clean this up by creating a new header file and compiling
the vendor-specific files as needed.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1314747665-2090-1-git-send-email-kjwinchester@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A deadlock was introduced on x86 in commit ef68c8f87e ("x86:
Serialize EFI time accesses on rtc_lock") because efi_get_time()
and friends can be called with rtc_lock already held by
read_persistent_time(), e.g.:
timekeeping_init()
read_persistent_clock() <-- acquire rtc_lock
efi_get_time()
phys_efi_get_time() <-- acquire rtc_lock <DEADLOCK>
To fix this let's push the locking down into the get_wallclock()
and set_wallclock() implementations. Only the clock
implementations that access the x86 RTC directly need to acquire
rtc_lock, so it makes sense to push the locking down into the
rtc, vrtc and efi code.
The virtualization implementations don't require rtc_lock to be
held because they provide their own serialization.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Acked-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Avi Kivity <avi@redhat.com> [for the virtualization aspect]
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Josh Boyer <jwboyer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a workaround for a UV2 hub bug that affects the format of system
global addresses.
The GRU API for UV2 was inadvertently broken by a hardware change. The
format of the physical address used for TLB dropins and for addresses used
with instructions running in unmapped mode has changed. This change was
not documented and became apparent only when diags failed running on
system simulators.
For UV1, TLB and GRU instruction physical addresses are identical to
socket physical addresses (although high NASID bits must be OR'ed into the
address).
For UV2, socket physical addresses need to be converted. The NODE portion
of the physical address needs to be shifted so that the low bit is in bit
39 or bit 40, depending on an MMR value.
It is not yet clear if this bug will be fixed in a silicon respin. If it
is fixed, the hub revision will be incremented & the workaround disabled.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For older IO-APIC's, we were clearing the remote-IRR by changing
the RTE trigger mode to edge and then back to level. We wanted
to mask the RTE during this process, so we were essentially
doing mask+edge and then to unmask+level.
As part of the commit ca64c47cec,
we moved this EOI process earlier where the IO-APIC RTE is
masked. So we were wrongly unmasking it in the eoi_ioapic_irq().
So change the remote-IRR clear sequence in eoi_ioapic_irq() to
mask + edge and then restore the previous RTE entry which will
restore the mask status as well as the level trigger.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Rafael Wysocki <rjw@novell.com>
Cc: lchiquitto@novell.com
Cc: jbeulich@novell.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110825190657.210286410@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the kdump scenario mentioned below, we can have a case where
the device using level triggered interrupt will not generate any
interrupts in the kdump kernel.
1. IO-APIC sends a level triggered interrupt to the CPU's local APIC.
2. Kernel crashed before the CPU services this interrupt, leaving
the remote-IRR in the IO-APIC set.
3. kdump kernel boot sequence does clear_IO_APIC() as part of IO-APIC
initialization. But this fails to reset remote-IRR bit of the
IO-APIC RTE as the remote-IRR bit is read-only.
4. Device using that level triggered entry can't generate any
more interrupts because of the remote-IRR bit.
In clear_IO_APIC_pin(), check if the remote-IRR bit is set and if
so do an explicit attempt to clear it (by doing EOI write on
modern io-apic's and changing trigger mode to edge/level on
older io-apic's). Also before doing the explicit EOI to the
io-apic, ensure that the trigger mode is indeed set to level.
This will enable the explicit EOI to the io-apic to reset the
remote-IRR bit.
Tested-by: Leonardo Chiquitto <lchiquitto@novell.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Fixes: https://bugzilla.novell.com/show_bug.cgi?id=701686
Cc: Rafael Wysocki <rjw@novell.com>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Thomas Renninger <trenn@suse.de>
Cc: jbeulich@novell.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110825190657.157502602@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the platforms which are x2apic and interrupt-remapping
capable, Linux kernel is enabling x2apic even if the BIOS
doesn't. This is to take advantage of the features that x2apic
brings in.
Some of the OEM platforms are running into issues because of
this, as their bios is not x2apic aware. For example, this was
resulting in interrupt migration issues on one of the platforms.
Also if the BIOS SMI handling uses APIC interface to send SMI's,
then the BIOS need to be aware of x2apic mode that OS has
enabled.
On some of these platforms, BIOS doesn't have a HW mechanism to
turnoff the x2apic feature to prevent OS from enabling it.
To resolve this mess, recent changes to the VT-d2 specification:
http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf
includes a mechanism that provides BIOS a way to request system
software to opt out of enabling x2apic mode.
Look at the x2apic optout flag in the DMAR tables before
enabling the x2apic mode in the platform. Also print a warning
that we have disabled x2apic based on the BIOS request.
Kernel boot parameter "intremap=no_x2apic_optout" can be used to
override the BIOS x2apic optout request.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: joerg.roedel@amd.com
Cc: tony.luck@intel.com
Cc: dwmw2@infradead.org
Link: http://lkml.kernel.org/r/20110824001456.171766616@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch fixes the typo in parameters passed to
x86_32 switch_to() description.
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
del_timer_sync() can cause a deadlock when called in interrupt context.
It is used with on_each_cpu() in some parts for sysfs files like bank*,
check_interval, cmci_disabled and ignore_ce.
However, use of on_each_cpu() results in calling the function passed
as the argument in interrupt context. This causes a flood of nested
warnings from del_timer_sync() (it runs on each CPU) caused even by a
simple file access like:
$ echo 300 > /sys/devices/system/machinecheck/machinecheck0/check_interval
Fortunately, these MCE-specific files are rarely used and AFAIK only few
MCE geeks experience this warning.
To remove the warning, move timer deletion outside of the interrupt
context.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The cmci_discover_lock can be taken in atomic context (cpu bring
up sequence) and therefore cannot be preempted on -rt.
In mainline this change documents the low level nature of
the lock - otherwise there's no functional difference. Lockdep
and Sparse checking will work as usual.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
L3 subcaches 0 and 1 of AMD Family 15h CPUs can have a size of 2MB.
Update the calculation routine for the number of L3 indices to
reflect that.
Signed-off-by: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Rosenfeld Hans <Hans.Rosenfeld@amd.com>
Cc: Herrmann3 Andreas <Andreas.Herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Frank Arnold <Frank.Arnold@amd.com>
Link: http://lkml.kernel.org/r/20110726170449.GB32536@aftab
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's not a good reason to allocate memory in the smp function call
just because someone thought it's the most conveniant place.
The AMD L3 data is coupled to the northbridge info by a pointer to the
corresponding north bridge data. So allocating it with the northbridge
data and referencing the northbridge in the cache_info code instead
uses less memory and gets rid of that atomic allocation hack in the
smp function call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20110723212626.688229918@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit f9b90566c ("x86: reduce stack usage in init_intel_cacheinfo")
introduced a shadow structure to reduce the stack usage on large
machines instead of making the smaller structure embedded into the
large one. That's definitely a candidate for the bad taste award.
Move the small struct into the large one and get rid of the ugly type
casts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20110723212626.625651773@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
free_cache_attributes() kfree's:
per_cpu(ici_cpuid4_info, cpu)->l3
which is a pointer to memory which was allocated as a block in
amd_init_l3_cache(). l3 of a particular cpu points to a part of this
memory blob. The part and the rest of the blob are still referenced by
other cpus.
As far as I can tell from the git history this is a leftover from the
conversion from per cpu to node data with commit ba06edb63(x86,
cacheinfo: Make L3 cache info per node) and the following commit
f658bcfb2(x86, cacheinfo: Cleanup L3 cache index disable support)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mike Travis <travis@sgi.com>
Link: http://lkml.kernel.org/r/20110723212626.550539989@linutronix.de
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nfsservctl system call is now gone, so we should remove all
linkage for it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
entry_32.S contained a hardcoded alternative instruction entry, and the
format changed in commit 59e97e4d6f ("x86: Make alternative
instruction pointers relative").
Replace the hardcoded entry with the altinstruction_entry macro. This
fixes the 32-bit boot with CONFIG_X86_INVD_BUG=y.
Reported-and-tested-by: Arnaud Lacombe <lacombar@gmail.com>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While removing custom rendezvous code and switching to stop_machine,
commit 192d885742 ("x86, mtrr: use stop_machine APIs for doing MTRR
rendezvous") completely dropped mtrr setting code on !CONFIG_SMP
breaking MTRR settting on UP.
Fix it by removing the incorrect CONFIG_SMP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Anders Eriksson <aeriksson@fastmail.fm>
Tested-and-acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, vdso: On system call restart after SYSENTER, use int $0x80
x86, UV: Remove UV delay in starting slave cpus
x86, olpc: Wait for last byte of EC command to be accepted
Because THREAD_SIZE is defined as PAGE_SIZE << THREAD_ORDER on x86, the
call of get_order(THREAD_SIZE) can be replaced with THREAD_ORDER.
Signed-off-by: Zhao Jin <cronozhj@gmail.com>
Link: http://lkml.kernel.org/r/4E4FB5A9.700@gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On -rt kfree() can schedule, but CPU_STARTING is before the CPU is
fully up and running. These are contradictory, so avoid it. Instead
push the kfree() to CPU_ONLINE where we're free to schedule.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kwd4j6ayld5thrscvaxgjquv@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip:
x86-64: Rework vsyscall emulation and add vsyscall= parameter
x86-64: Wire up getcpu syscall
x86: Remove unnecessary compile flag tweaks for vsyscall code
x86-64: Add vsyscall:emulate_vsyscall trace event
x86-64: Add user_64bit_mode paravirt op
x86-64, xen: Enable the vvar mapping
x86-64: Work around gold bug 13023
x86-64: Move the "user" vsyscall segment out of the data segment.
x86-64: Pad vDSO to a page boundary
There are three choices:
vsyscall=native: Vsyscalls are native code that issues the
corresponding syscalls.
vsyscall=emulate (default): Vsyscalls are emulated by instruction
fault traps, tested in the bad_area path. The actual contents of
the vsyscall page is the same as the vsyscall=native case except
that it's marked NX. This way programs that make assumptions about
what the code in the page does will not be confused when they read
that code.
vsyscall=none: Trying to execute a vsyscall will segfault.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/8449fb3abf89851fd6b2260972666a6f82542284.1312988155.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As of commit 98d0ac38ca
Author: Andy Lutomirski <luto@mit.edu>
Date: Thu Jul 14 06:47:22 2011 -0400
x86-64: Move vread_tsc and vread_hpet into the vDSO
user code no longer directly calls into code in arch/x86/kernel/, so
we don't need compile flag hacks to make it safe. All vdso code is
in the vdso directory now.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/835cd05a4c7740544d09723d6ba48f4406f9826c.1312988155.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When the moduleu.h splitting tree is merged to the latest
tip:x86/cpu tree, the x86_64 allmodconfig build fails like this:
arch/x86/kernel/cpu/amd.c: In function 'bsp_init_amd':
arch/x86/kernel/cpu/amd.c:437:3: error: 'va_align' undeclared (first use in this function)
arch/x86/kernel/cpu/amd.c:438:23: error: 'ALIGN_VA_32' undeclared (first use in this function)
arch/x86/kernel/cpu/amd.c:438:37: error: 'ALIGN_VA_64' undeclared (first use in this function)
This is caused by the module.h split up intreacting with commit
dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties on AMD
family 15h") from the tip:x86/cpu tree.
I have added the following patch for today (this, or something
similar, could be applied to the tip tree directly - the
export.h include below was added by the module.h splitup).
So include elf.h to use va_align and remove this implicit
dependency on module.h doing it for us.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/20110810114956.238d66772883636e3040d29f@canb.auug.org.au
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add support to Romely-EP SandyBridge.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Anhua Xu <anhua.xu@intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312264895-2010-1-git-send-email-youquan.song@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
hpa reported that dfb09f9b7a breaks 32-bit
builds with the following error message:
/home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:437: undefined
reference to `va_align'
/home/hpa/kernel/linux-tip.cpu/arch/x86/kernel/cpu/amd.c:436: undefined
reference to `va_align'
This is due to the fact that va_align is a global in a 64-bit only
compilation unit. Move it to mmap.c where it is visible to both
subarches.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1312633899-1131-1-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Delete the 10 msec delay between the INIT and SIPI when starting
slave cpus. I can find no requirement for this delay. BIOS also
has similar code sequences without the delay.
Removing the delay reduces boot time by 40 sec. Every bit helps.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110805140900.GA6774@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move code which is run once on the BSP during boot into the cpu_dev
helper.
[ hpa: removed bogus cpu_has -> static_cpu_has conversion ]
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20110805180409.GC26217@aftab
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add a function ptr to struct cpu_dev which is destined to be run only
once on the BSP during boot.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20110805180116.GB26217@aftab
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch provides performance tuning for the "Bulldozer" CPU. With its
shared instruction cache there is a chance of generating an excessive
number of cache cross-invalidates when running specific workloads on the
cores of a compute module.
This excessive amount of cross-invalidations can be observed if cache
lines backed by shared physical memory alias in bits [14:12] of their
virtual addresses, as those bits are used for the index generation.
This patch addresses the issue by clearing all the bits in the [14:12]
slice of the file mapping's virtual address at generation time, thus
forcing those bits the same for all mappings of a single shared library
across processes and, in doing so, avoids instruction cache aliases.
It also adds the command line option "align_va_addr=(32|64|on|off)" with
which virtual address alignment can be enabled for 32-bit or 64-bit x86
individually, or both, or be completely disabled.
This change leaves virtual region address allocation on other families
and/or vendors unaffected.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1312550110-24160-2-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Three places in the kernel assume that the only long mode CPL 3
selector is __USER_CS. This is not true on Xen -- Xen's sysretq
changes cs to the magic value 0xe033.
Two of the places are corner cases, but as of "x86-64: Improve
vsyscall emulation CS and RIP handling"
(c9712944b2), vsyscalls will segfault
if called with Xen's extra CS selector. This causes a panic when
older init builds die.
It seems impossible to make Xen use __USER_CS reliably without
taking a performance hit on every system call, so this fixes the
tests instead with a new paravirt op. It's a little ugly because
ptrace.h can't include paravirt.h.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/f4fcb3947340d9e96ce1054a432f183f9da9db83.1312378163.git.luto@mit.edu
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Gold has trouble assigning numbers to the location counter inside of
an output section description. The bug was triggered by
9fd67b4ed0, which consolidated all of
the vsyscall sections into a single section. The workaround is IMO
still nicer than the old way of doing it.
This produces an apparently valid kernel image and passes my vdso
tests on both GNU ld version 2.21.51.0.6-2.fc15 20110118 and GNU
gold (version 2.21.51.0.6-2.fc15 20110118) 1.10 as distributed by
Fedora 15.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/0b260cb806f1f9a25c00ce8377a5f035d57f557a.1312378163.git.luto@mit.edu
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
cpuidle: stop depending on pm_idle
x86 idle: move mwait_idle_with_hints() to where it is used
cpuidle: replace xen access to x86 pm_idle and default_idle
cpuidle: create bootparam "cpuidle.off=1"
mrst_pmu: driver for Intel Moorestown Power Management Unit
cpuidle users should call cpuidle_call_idle() directly
rather than via (pm_idle)() function pointer.
Architecture may choose to continue using (pm_idle)(),
but cpuidle need not depend on it:
my_arch_cpu_idle()
...
if(cpuidle_call_idle())
pm_idle();
cc: Kevin Hilman <khilman@deeprootsystems.com>
cc: Paul Mundt <lethal@linux-sh.org>
cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
...and make it static
no functional change
cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
If the CPU declares that RDRAND is available, go through a guranteed
reseed sequence, and make sure that it is actually working (producing
data.) If it does not, disable the CPU feature flag.
Allow RDRAND to be disabled on the command line (as opposed to at
compile time) for a user who has special requirements with regards to
random numbers.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "Theodore Ts'o" <tytso@mit.edu>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
fs: Merge split strings
treewide: fix potentially dangerous trailing ';' in #defined values/expressions
uwb: Fix misspelling of neighbourhood in comment
net, netfilter: Remove redundant goto in ebt_ulog_packet
trivial: don't touch files that are removed in the staging tree
lib/vsprintf: replace link to Draft by final RFC number
doc: Kconfig: `to be' -> `be'
doc: Kconfig: Typo: square -> squared
doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
drivers/net: static should be at beginning of declaration
drivers/media: static should be at beginning of declaration
drivers/i2c: static should be at beginning of declaration
XTENSA: static should be at beginning of declaration
SH: static should be at beginning of declaration
MIPS: static should be at beginning of declaration
ARM: static should be at beginning of declaration
rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
Update my e-mail address
PCIe ASPM: forcedly -> forcibly
gma500: push through device driver tree
...
Fix up trivial conflicts:
- arch/arm/mach-ep93xx/dma-m2p.c (deleted)
- drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
- drivers/net/r8169.c (just context changes)
This patch removes all the module loader hook implementations in the
architecture specific code where the functionality is the same as that
now provided by the recently added default hooks.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This patch implements the kvm bits of the steal time infrastructure.
The most important part of it, is the steal time clock. It is an
continuous clock that shows the accumulated amount of steal time
since vcpu creation. It is supposed to survive cpu offlining/onlining.
[marcelo: fix build with CONFIG_KVM_GUEST=n]
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Eric B Munson <emunson@mgebm.net>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Avi Kivity <avi@redhat.com>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* 'x86-detect-hyper-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hyper: Change hypervisor detection order
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, fpu: Fix DNA exception during check_fpu()
* 'x86-kexec-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
kexec, x86: Fix incorrect jump back address if not preserving context
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, config: Introduce an INTEL_MID configuration
* 'x86-quirks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, quirks: Use pci_dev->revision
* 'x86-tsc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: tsc: Remove unneeded DMI-based blacklisting
* 'x86-smpboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, boot: Wait for boot cpu to show up if nr_cpus limit is about to hit
* 'timers-clocksource-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: apb: Share APB timer code with other platforms
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, vdso: Do not allocate memory for the vDSO
clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option
x86, vdso: Drop now wrong comment
Document the vDSO and add a reference parser
ia64: Replace clocksource.fsys_mmio with generic arch data
x86-64: Move vread_tsc and vread_hpet into the vDSO
clocksource: Replace vread with generic arch data
x86-64: Add --no-undefined to vDSO build
x86-64: Allow alternative patching in the vDSO
x86: Make alternative instruction pointers relative
x86-64: Improve vsyscall emulation CS and RIP handling
x86-64: Emulate legacy vsyscalls
x86-64: Fill unused parts of the vsyscall page with 0xcc
x86-64: Remove vsyscall number 3 (venosys)
x86-64: Map the HPET NX
x86-64: Remove kernel.vsyscall64 sysctl
x86-64: Give vvars their own page
x86-64: Document some of entry_64.S
x86-64: Fix alignment of jiffies variable
* 'x86-signal-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Kill handle_signal()->set_fs()
x86, do_signal: Simplify the TS_RESTORE_SIGMASK logic
x86, signals: Convert the X86_32 code to use set_current_blocked()
x86, signals: Convert the IA32_EMULATION code to use set_current_blocked()
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: Use mce_sysdev_ prefix to group functions
x86, mce: Use mce_chrdev_ prefix to group functions
x86, mce: Cleanup mce_read()
x86, mce: Cleanup mce_create()/remove_device()
x86, mce: Check the result of ancient_init()
x86, mce: Introduce mce_gather_info()
x86, mce: Replace MCM_ with MCI_MISC_
x86, mce: Replace MCE_SELF_VECTOR by irq_work
x86, mce, severity: Clean up trivial coding style problems
x86, mce, severity: Cleanup severity table
x86, mce, severity: Make formatting a bit more readable
x86, mce, severity: Fix two severities table signatures
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, smpboot: Mark the names[] array in __inquire_remote_apic() as const
x86: Convert vmalloc()+memset() to vzalloc()
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, ioapic: Print IR_IO_APIC_route_entry when IR is enabled
x86, ioapic: Print IRTE when IR is enabled
x86, x2apic: Preserve high 32-bits of IA32_APIC_BASE MSR
x86, ioapic: Also print Dest field
x86, ioapic: Format clean up for IOAPIC output
x86: print APIC data a little later during boot
* 'timers-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mips: Fix i8253 clockevent fallout
i8253: Cleanup outb/inb magic
arm: Footbridge: Use common i8253 clockevent
mips: Use common i8253 clockevent
x86: Use common i8253 clockevent
i8253: Create common clockevent implementation
i8253: Export i8253_lock unconditionally
pcpskr: MIPS: Make config dependencies finer grained
pcspkr: Cleanup Kconfig dependencies
i8253: Move remaining content and delete asm/i8253.h
i8253: Consolidate definitions of PIT_LATCH
x86: i8253: Consolidate definitions of global_clock_event
i8253: Alpha, PowerPC: Remove unused asm/8253pit.h
alpha: i8253: Cleanup remaining users of i8253pit.h
i8253: Remove I8253_LOCK config
i8253: Make pcsp sound driver use the shared i8253_lock
i8253: Make pcspkr input driver use the shared i8253_lock
i8253: Consolidate all kernel definitions of i8253_lock
i8253: Unify all kernel declarations of i8253_lock
i8253: Create linux/i8253.h and use it in all 8253 related files
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
perf: Remove the nmi parameter from the oprofile_perf backend
x86, perf: Make copy_from_user_nmi() a library function
perf: Remove perf_event_attr::type check
x86, perf: P4 PMU - Fix typos in comments and style cleanup
perf tools: Make test use the preset debugfs path
perf tools: Add automated tests for events parsing
perf tools: De-opt the parse_events function
perf script: Fix display of IP address for non-callchain path
perf tools: Fix endian conversion reading event attr from file header
perf tools: Add missing 'node' alias to the hw_cache[] array
perf probe: Support adding probes on offline kernel modules
perf probe: Add probed module in front of function
perf probe: Introduce debuginfo to encapsulate dwarf information
perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
perf probe: Remove redundant dwarf functions
perf probe: Move strtailcmp to string.c
perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
tracing/kprobe: Update symbol reference when loading module
tracing/kprobes: Support module init function probing
kprobes: Return -ENOENT if probe point doesn't exist
...
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
iommu/core: Fix build with INTR_REMAP=y && CONFIG_DMAR=n
iommu/amd: Don't use MSI address range for DMA addresses
iommu/amd: Move missing parts to drivers/iommu
iommu: Move iommu Kconfig entries to submenu
x86/ia64: intel-iommu: move to drivers/iommu/
x86: amd_iommu: move to drivers/iommu/
msm: iommu: move to drivers/iommu/
drivers: iommu: move to a dedicated folder
x86/amd-iommu: Store device alias as dev_data pointer
x86/amd-iommu: Search for existind dev_data before allocting a new one
x86/amd-iommu: Allow dev_data->alias to be NULL
x86/amd-iommu: Use only dev_data in low-level domain attach/detach functions
x86/amd-iommu: Use only dev_data for dte and iotlb flushing routines
x86/amd-iommu: Store ATS state in dev_data
x86/amd-iommu: Store devid in dev_data
x86/amd-iommu: Introduce global dev_data_list
x86/amd-iommu: Remove redundant device_flush_dte() calls
iommu-api: Add missing header file
Fix up trivial conflicts (independent additions close to each other) in
drivers/Makefile and include/linux/pci.h
* 'of-pci' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
pci/of: Consolidate pci_bus_to_OF_node()
pci/of: Consolidate pci_device_to_OF_node()
x86/devicetree: Use generic PCI <-> OF matching
microblaze/pci: Move the remains of pci_32.c to pci-common.c
microblaze/pci: Remove powermac originated cruft
pci/of: Match PCI devices to OF nodes dynamically
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1287 commits)
icmp: Fix regression in nexthop resolution during replies.
net: Fix ppc64 BPF JIT dependencies.
acenic: include NET_SKB_PAD headroom to incoming skbs
ixgbe: convert to ndo_fix_features
ixgbe: only enable WoL for magic packet by default
ixgbe: remove ifdef check for non-existent define
ixgbe: Pass staterr instead of re-reading status and error bits from descriptor
ixgbe: Move interrupt related values out of ring and into q_vector
ixgbe: add structure for containing RX/TX rings to q_vector
ixgbe: inline the ixgbe_maybe_stop_tx function
ixgbe: Update ATR to use recorded TX queues instead of CPU for routing
igb: Fix for DH89xxCC near end loopback test
e1000: always call e1000_check_for_link() on e1000_ce4100 MACs.
netxen: add fw version compatibility check
be2net: request native mode each time the card is reset
ipv4: Constrain UFO fragment sizes to multiples of 8 bytes
virtio_net: Fix panic in virtnet_remove
ipv6: make fragment identifications less predictable
ipv6: unshare inetpeers
can: make function can_get_bittiming static
...
The Host used to create some page tables for the Guest to use at the
top of Guest memory; it would then tell the Guest where this was. In
particular, it created linear mappings for 0 and 0xC0000000 addresses
because lguest used to switch to its real page tables quite late in
boot.
However, since d50d8fe19 Linux initialized boot page tables in
head_32.S even before the "are we lguest?" boot jump. So, now we can
simplify things: the Host pagetable code assumes 1:1 linear mapping
until it first calls the LHCALL_NEW_PGTABLE hypercall, which we now do
before we reach C code.
This also means that the Host doesn't need to know anything about the
Guest's PAGE_OFFSET. (Non-Linux guests might not even have such a
thing).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Yet another variant of the Dell Latitude series which requires
reboot=pci.
From the E5420 bug report by Daniel J Blueman:
> The E6420 is affected also (same platform, different casing and
> features), which provides an external confirmation of the issue; I can
> submit a patch for that later or include it if you prefer:
> http://linux.koolsolutions.com/2009/08/04/howto-fix-linux-hangfreeze-during-reboots-and-restarts/
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@kernel.org>
Rebooting on the Dell E5420 often hangs with the keyboard or ACPI
methods, but is reliable via the PCI method.
[ hpa: this was deferred because we believed for a long time that the
recent reshuffling of the boot priorities in commit
660e34cebf fixed this platform.
Unfortunately that turned out to be incorrect. ]
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Link: http://lkml.kernel.org/r/1305248699-2347-1-git-send-email-daniel.blueman@gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@kernel.org>
copy_from_user_nmi() is used in oprofile and perf. Moving it to other
library functions like copy_from_user(). As this is x86 code for 32
and 64 bits, create a new file usercopy.c for unified code.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110607172413.GJ20052@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch:
- fixes typos in comments and clarifies the text
- renames obscure p4_event_alias::original and ::alter members to
::original and ::alternative as appropriate
- drops parenthesis from the return of p4_get_alias_event()
No functional changes.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/20110721160625.GX7492@sun
Signed-off-by: Ingo Molnar <mingo@elte.hu>
All these are instances of
#define NAME value;
or
#define NAME(params_opt) value;
These of course fail to build when used in contexts like
if(foo $OP NAME)
while(bar $OP NAME)
and may silently generate the wrong code in contexts such as
foo = NAME + 1; /* foo = value; + 1; */
bar = NAME - 1; /* bar = value; - 1; */
baz = NAME & quux; /* baz = value; & quux; */
Reported on comp.lang.c,
Message-ID: <ab0d55fe-25e5-482b-811e-c475aa6065c3@c29g2000yqd.googlegroups.com>
Initial analysis of the dangers provided by Keith Thompson in that thread.
There are many more instances of more complicated macros having unnecessary
trailing semicolons, but this pile seems to be all of the cases of simple
values suffering from the problem. (Thus things that are likely to be found
in one of the contexts above, more complicated ones aren't.)
Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
In kexec jump support, jump back address passed to the kexeced
kernel via function calling ABI, that is, the function call
return address is the jump back entry.
Furthermore, jump back entry == 0 should be used to signal that
the jump back or preserve context is not enabled in the original
kernel.
But in the current implementation the stack position used for
function call return address is not cleared context
preservation is disabled. The patch fixes this bug.
Reported-and-tested-by: Yin Kangkai <kangkai.yin@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1310607277-25029-1-git-send-email-ying.huang@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This code uses PCI_CLASS_REVISION instead of PCI_REVISION_ID, so
it wasn't converted by commit 44c10138fd ("PCI: Change all
drivers to use pci_device->revision") before being moved to arch/x86/...
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/201107111901.39281.sshtylyov@ru.mvista.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix the printk_once() so that it actually prints (didn't print before
due to a stray comma.)
[ hpa: changed to an incremental patch and adjusted the description
accordingly. ]
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1107151732480.18606@x980
Cc: <table@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
handle_signal()->set_fs() has a nice comment which explains what
set_fs() is, but it doesn't explain why it is needed and why it
depends on CONFIG_X86_64.
Afaics, the history of this confusion is:
1. I guess today nobody can explain why it was needed
in arch/i386/kernel/signal.c, perhaps it was always
wrong. This predates 2.4.0 kernel.
2. then it was copy-and-past'ed to the new x86_64 arch.
3. then it was removed from i386 (but not from x86_64)
by b93b6ca3 "i386: remove unnecessary code".
4. then it was reintroduced under CONFIG_X86_64 when x86
unified i386 and x86_64, because the patch above didn't
touch x86_64.
Remove it. ->addr_limit should be correct. Even if it was possible
that it is wrong, it is too late to fix it after setup_rt_frame().
Linus commented in:
http://lkml.kernel.org/r/alpine.LFD.0.999.0707170902570.19166@woody.linux-foundation.org
... about the equivalent bit from i386:
Heh. I think it's entirely historical.
Please realize that the whole reason that function is called "set_fs()" is
that it literally used to set the %fs segment register, not
"->addr_limit".
So I think the "set_fs(USER_DS)" is there _only_ to match the other
regs->xds = __USER_DS;
regs->xes = __USER_DS;
regs->xss = __USER_DS;
regs->xcs = __USER_CS;
things, and never mattered. And now it matters even less, and has been
copied to all other architectures where it is just totally insane.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20110710164424.GA20261@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
1. do_signal() looks at TS_RESTORE_SIGMASK and calculates the
mask which should be stored in the signal frame, then it
passes "oldset" to the callees, down to setup_rt_frame().
This is ugly, setup_rt_frame() can do this itself and nobody
else needs this sigset_t. Move this code into setup_rt_frame.
2. do_signal() also clears TS_RESTORE_SIGMASK if handle_signal()
succeeds.
We can move this to setup_rt_frame() as well, this avoids the
unnecessary checks and makes the logic more clear.
3. use set_current_blocked() instead of sigprocmask(SIG_SETMASK),
sigprocmask() should be avoided.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20110710182203.GA27979@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
sys_sigsuspend() and sys_sigreturn() change ->blocked directly.
This is not correct, see the changelog in e6fa16ab
"signal: sigprocmask() should do retarget_shared_pending()"
Change them to use set_current_blocked().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20110710192727.GA31759@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Instead of hw_nmi_watchdog_set_attr() weak function
and appropriate x86_pmu::hw_watchdog_set_attr() call
we introduce even alias mechanism which allow us
to drop this routines completely and isolate quirks
of Netburst architecture inside P4 PMU code only.
The main idea remains the same though -- to allow
nmi-watchdog and perf top run simultaneously.
Note the aliasing mechanism applies to generic
PERF_COUNT_HW_CPU_CYCLES event only because arbitrary
event (say passed as RAW initially) might have some
additional bits set inside ESCR register changing
the behaviour of event and we can't guarantee anymore
that alias event will give the same result.
P.S. Thanks a huge to Don and Steven for for testing
and early review.
Acked-by: Don Zickus <dzickus@redhat.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Stephane Eranian <eranian@google.com>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20110708201712.GS23657@sun
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Since 2.6.36 (23016bf0d2), Linux prints the existence of "epb" in /proc/cpuinfo,
Since 2.6.38 (d5532ee7b4), the x86_energy_perf_policy(8) utility has
been available in-tree to update MSR_IA32_ENERGY_PERF_BIAS.
However, the typical BIOS fails to initialize the MSR, presumably
because this is handled by high-volume shrink-wrap operating systems...
Linux distros, on the other hand, do not yet invoke x86_energy_perf_policy(8).
As a result, WSM-EP, SNB, and later hardware from Intel will run in its
default hardware power-on state (performance), which assumes that users
care for performance at all costs and not for energy efficiency.
While that is fine for performance benchmarks, the hardware's intended default
operating point is "normal" mode...
Initialize the MSR to the "normal" by default during kernel boot.
x86_energy_perf_policy(8) is available to change the default after boot,
should the user have a different preference.
Signed-off-by: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1107140051020.18606@x980
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org>
Other than sanity check and debug message, the x86 specific version of
memblock reserve/free functions are simple wrappers around the generic
versions - memblock_reserve/free().
This patch adds debug messages with caller identification to the
generic versions and replaces x86 specific ones and kills them.
arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty
after this change and removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
memblock_find_dma_reserve() wants to find out how much memory is
reserved under MAX_DMA_PFN. memblock_x86_memory_[free_]in_range() are
used to find out the amounts of all available and free memory in the
area, which are then subtracted to find out the amount of reservation.
memblock_x86_memblock_[free_]in_range() are implemented using
__memblock_x86_memory_in_range() which builds ranges from memblock and
then count them, which is rather unnecessarily complex.
This patch open codes the counting logic directly in
memblock_find_dma_reserve() using memblock iterators and removes now
unused __memblock_x86_memory_in_range() and find_range_array().
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-11-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
setup_bios_corruption_check() and memtest do_one_pass() open code
memblock free area iteration using memblock_x86_find_in_range_size().
Convert them to use for_each_free_mem_range() instead.
This leaves memblock_x86_find_in_range_size() and
memblock_x86_check_reserved_size() unused. Kill them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-8-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
early_reserve_e820() implements its own ad-hoc early allocator using
memblock_x86_find_in_range_size(). Use __memblock_alloc_base()
instead and remove the unnecessary @startt parameter (it's top-down
allocation anyway).
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310462166-31469-6-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function. Besides
a steal time function, we also declare two jump_labels. They will be
used to allow the steal time code to be easily bypassed when not
in use.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Eric B Munson <emunson@mgebm.net>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
25818f0f28 (memblock: Make MEMBLOCK_ERROR be 0) thankfully made
MEMBLOCK_ERROR 0 and there already are codes which expect error return
to be 0. There's no point in keeping MEMBLOCK_ERROR around. End its
misery.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1310457490-3356-6-git-send-email-tj@kernel.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The vread field was bloating struct clocksource everywhere except
x86_64, and I want to change the way this works on x86_64, so let's
split it out into per-arch data.
Cc: x86@kernel.org
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: linux-ia64@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/3ae5ec76a168eaaae63f08a2a1060b91aa0b7759.1310563276.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Three fixes here:
- Send SIGSEGV if called from compat code or with a funny CS.
- Don't BUG on impossible addresses.
- Add a missing local_irq_disable.
This patch also removes an unused variable.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/6fb2b13ab39b743d1e4f466eef13425854912f7f.1310563276.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When "apic=debug" is used as a boot parameter, Linux prints the IOAPIC routing
entries in "dmesg". Below is output from IOAPIC whose apic_id is 8:
# dmesg | grep "routing entry"
IOAPIC[8]: Set routing entry (8-1 -> 0x31 -> IRQ 1 Mode:0 Active:0 Dest:0)
IOAPIC[8]: Set routing entry (8-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0)
IOAPIC[8]: Set routing entry (8-3 -> 0x33 -> IRQ 3 Mode:0 Active:0 Dest:0)
...
Similarly, when IR (interrupt remapping) is enabled, and the IRTE
(interrupt remapping table entry) is set up we should display it.
After the fix:
# dmesg | grep IRTE
IOAPIC[8]: Set IRTE entry (P:1 FPD:0 Dst_Mode:0 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:31 Dest:00000000 SID:00F1 SQ:0 SVT:1)
IOAPIC[8]: Set IRTE entry (P:1 FPD:0 Dst_Mode:0 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:30 Dest:00000000 SID:00F1 SQ:0 SVT:1)
IOAPIC[8]: Set IRTE entry (P:1 FPD:0 Dst_Mode:0 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:33 Dest:00000000 SID:00F1 SQ:0 SVT:1)
...
The IRTE is defined in Sec 9.5 of the Intel VT-d Specification.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Link: http://lkml.kernel.org/r/20110712211704.2939.71291.sendpatchset@nchumbalkar.americas.cpqcorp.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The code in setup_ioapic_irq() determines the Destination Field,
so why not also include it in the debug printk output that gets
displayed when the boot parameter "apic=debug" is used.
Before the change, "dmesg" will show:
IOAPIC[0]: Set routing entry (8-1 -> 0x31 -> IRQ 1 Mode:0 Active:0)
IOAPIC[0]: Set routing entry (8-2 -> 0x30 -> IRQ 0 Mode:0 Active:0)
IOAPIC[0]: Set routing entry (8-3 -> 0x33 -> IRQ 3 Mode:0 Active:0) ...
After the change, you will see:
IOAPIC[0]: Set routing entry (8-1 -> 0x31 -> IRQ 1 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (8-2 -> 0x30 -> IRQ 0 Mode:0 Active:0 Dest:0)
IOAPIC[0]: Set routing entry (8-3 -> 0x33 -> IRQ 3 Mode:0 Active:0 Dest:0) ...
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Link: http://lkml.kernel.org/r/20110708184603.2734.91071.sendpatchset@nchumbalkar.americas.cpqcorp.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When IOAPIC data is displayed in "dmesg" with the help of the
boot parameter "apic=debug" certain values are not formatted
correctly wrt their size.
In the "dmesg" snippet below, note that the output for "max
redirection entries", and "IO APIC version" which are each
defined to be just 8-bits long are displayed as 2 bytes in
length. Similarly, "Dst" under the "IRQ redirection table"
should only be 8-bits long.
IO APIC #0......
...
...
.... register #01: 00170020
....... : max redirection entries: 0017
....... : PRQ implemented: 0
....... : IO APIC version: 0020
...
...
.... IRQ redirection table:
NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect:
00 000 1 0 0 0 0 0 0 00
01 000 0 0 0 0 0 0 0 31
02 000 0 0 0 0 0 0 0 30
03 000 1 0 0 0 0 0 0 33
...
...
Do some formatting clean up, so you will see output like below:
IO APIC #0......
...
...
.... register #01: 00170020
....... : max redirection entries: 17
....... : PRQ implemented: 0
....... : IO APIC version: 20
...
...
.... IRQ redirection table:
NR Dst Mask Trig IRR Pol Stat Dmod Deli Vect:
00 00 1 0 0 0 0 0 0 00
01 00 0 0 0 0 0 0 0 31
02 00 0 0 0 0 0 0 0 30
03 00 1 0 0 0 0 0 0 33
...
...
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Link: http://lkml.kernel.org/r/20110708184557.2734.61830.sendpatchset@nchumbalkar.americas.cpqcorp.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Detect Xen before HyperV because in Viridian compatibility mode Xen
presents itself as HyperV. Move Xen to the top since it seems more
likely that Xen would emulate VMware than vice versa.
Signed-off-by: Anupam Chanda <achanda@nicira.com>
Link: http://lkml.kernel.org/r/1310150570-26810-1-git-send-email-achanda@nicira.com
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
nr_cpus allows one to specify number of possible cpus in the system.
Current assumption seems to be that first cpu to show up is boot cpu
and this assumption will be broken in kdump scenario where we can be
booting on a non boot cpu with nr_cpus=1.
It might happen that first cpu we parse is not the cpu we boot on and
later we ignore boot cpu. Though code later seems to recognize this
anomaly and forcibly sets boot cpu in physical cpu map with following
warning.
if (!physid_isset(hard_smp_processor_id(), phys_cpu_present_map)) {
printk(KERN_WARNING
"weird, boot CPU (#%d) not listed by the BIOS.\n",
hard_smp_processor_id());
physid_set(hard_smp_processor_id(), phys_cpu_present_map);
}
This patch waits for boot cpu to show up and starts ignoring the cpus
once we have hit (nr_cpus - 1) number of cpus. So effectively we are
reserving one slot out of nr_cpus for boot cpu explicitly.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20110708171926.GF2930@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some BIOSes will reset the Intel MISC_ENABLE MSR (specifically the
XD_DISABLE bit) when resuming from S3, which can interact poorly with
ebba638ae7. In 32bit PAE mode, this can
lead to a fault when EFER is restored by the kernel wakeup routines,
due to it setting the NX bit for a CPU that (thanks to the BIOS reset)
now incorrectly thinks it lacks the NX feature. (64bit is not affected
because it uses a common CPU bring-up that specifically handles the
XD_DISABLE bit.)
The need for MISC_ENABLE being restored so early is specific to the S3
resume path. Normally, MISC_ENABLE is saved in save_processor_state(),
but this happens after the resume header is created, so just reproduce
the logic here. (acpi_suspend_lowlevel() creates the header, calls
do_suspend_lowlevel, which calls save_processor_state(), so the saved
processor context isn't available during resume header creation.)
[ hpa: Consider for stable if OK in mainline ]
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Link: http://lkml.kernel.org/r/20110707011034.GA8523@outflux.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: <stable@kernel.org> 2.6.38+
Since git commit
660e34cebf x86: reorder reboot method
preferences,
my Acer Aspire One hangs on reboot. It appears that its ACPI method
for rebooting is broken. The attached patch adds a quirk so that the
machine will reboot via the BIOS.
[ hpa: verified that the ACPI control on this machine is just plain broken. ]
Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
Link: http://lkml.kernel.org/r/w439iki5vl.wl%25peter@chubb.wattle.id.au
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
rbp is used in SAVE_ARGS_IRQ to save the old stack pointer
in order to restore it later in ret_from_intr.
It is convenient because we save its value in the irq regs
and it's easily restored using the leave instruction.
However this is a kind of abuse of the frame pointer which
role is to help unwinding the kernel by chaining frames
together, each node following the return address to the
previous frame.
But although we are breaking the frame by changing the stack
pointer, there is no preceding return address before the new
frame. Hence using the frame pointer to link the two stacks
breaks the stack unwinders that find a random value instead of
a return address here.
There is no workaround that can work in every case. We are using
the fixup_bp_irq_link() function to dereference that abused frame
pointer in the case of non nesting interrupt (which means stack
changed).
But that doesn't fix the case of interrupts that don't change the
stack (but we still have the unconditional frame link), which is
the case of hardirq interrupting softirq. We have no way to detect
this transition so the frame irq link is considered as a real frame
pointer and the return address is dereferenced but it is still a
spurious one.
There are two possible results of this: either the spurious return
address, a random stack value, luckily belongs to the kernel text
and then the unwinding can continue and we just have a weird entry
in the stack trace. Or it doesn't belong to the kernel text and
unwinding stops there.
This is the reason why stacktraces (including perf callchains) on
irqs that interrupted softirqs don't work very well.
To solve this, we don't save the old stack pointer on rbp anymore
but we save it to a scratch register that we push on the new
stack and that we pop back later on irq return.
This preserves the whole frame chain without spurious return addresses
in the middle and drops the need for the horrid fixup_bp_irq_link()
workaround.
And finally irqs that interrupt softirq are sanely unwinded.
Before:
99.81% perf [kernel.kallsyms] [k] perf_pending_event
|
--- perf_pending_event
irq_work_run
smp_irq_work_interrupt
irq_work_interrupt
|
|--41.60%-- __read
| |
| |--99.90%-- create_worker
| | bench_sched_messaging
| | cmd_bench
| | run_builtin
| | main
| | __libc_start_main
| --0.10%-- [...]
After:
1.64% swapper [kernel.kallsyms] [k] perf_pending_event
|
--- perf_pending_event
irq_work_run
smp_irq_work_interrupt
irq_work_interrupt
|
|--95.00%-- arch_irq_work_raise
| irq_work_queue
| __perf_event_overflow
| perf_swevent_overflow
| perf_swevent_event
| perf_tp_event
| perf_trace_softirq
| __do_softirq
| call_softirq
| do_softirq
| irq_exit
| |
| |--73.68%-- smp_apic_timer_interrupt
| | apic_timer_interrupt
| | |
| | |--96.43%-- amd_e400_idle
| | | cpu_idle
| | | start_secondary
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
The unwinder backlink in interrupt entry is very useless.
It's actually not part of the stack frame chain and thus is
never used.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Just for clarity in the code. Have a first block that handles
the frame pointer and a separate one that handles pt_regs
pointer and its use.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
The save_regs function that saves the regs on low level
irq entry is complicated because of the fact it changes
its stack in the middle and also because it manipulates
data allocated in the caller frame and accesses there
are directly calculated from callee rsp value with the
return address in the middle of the way.
This complicates the static stack offsets calculation and
require more dynamic ones. It also needs a save/restore
of the function's return address.
To simplify and optimize this, turn save_regs() into a
macro.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
When regs are passed to dump_stack(), we fetch the frame
pointer from the regs but the stack pointer is taken from
the current frame.
Thus the frame and stack pointers may not come from the same
context. For example this can result in the unwinder to
think the context is in irq, due to the current value of
the stack, but the frame pointer coming from the regs points
to a frame from another place. It then tries to fix up
the irq link but ends up dereferencing a random frame
pointer that doesn't belong to the irq stack:
[ 9131.706906] ------------[ cut here ]------------
[ 9131.707003] WARNING: at arch/x86/kernel/dumpstack_64.c:129 dump_trace+0x2aa/0x330()
[ 9131.707003] Hardware name: AMD690VM-FMH
[ 9131.707003] Perf: bad frame pointer = 0000000000000005 in callchain
[ 9131.707003] Modules linked in:
[ 9131.707003] Pid: 1050, comm: perf Not tainted 3.0.0-rc3+ #181
[ 9131.707003] Call Trace:
[ 9131.707003] <IRQ> [<ffffffff8104bd4a>] warn_slowpath_common+0x7a/0xb0
[ 9131.707003] [<ffffffff8104be21>] warn_slowpath_fmt+0x41/0x50
[ 9131.707003] [<ffffffff8178b873>] ? bad_to_user+0x6d/0x10be
[ 9131.707003] [<ffffffff8100c2da>] dump_trace+0x2aa/0x330
[ 9131.707003] [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50
[ 9131.707003] [<ffffffff8101b164>] perf_callchain_kernel+0x54/0x70
[ 9131.707003] [<ffffffff810d391f>] perf_prepare_sample+0x19f/0x2a0
[ 9131.707003] [<ffffffff810d546c>] __perf_event_overflow+0x16c/0x290
[ 9131.707003] [<ffffffff810d5430>] ? __perf_event_overflow+0x130/0x290
[ 9131.707003] [<ffffffff810107d3>] ? native_sched_clock+0x13/0x50
[ 9131.707003] [<ffffffff8100fbb9>] ? sched_clock+0x9/0x10
[ 9131.707003] [<ffffffff810752e5>] ? T.375+0x15/0x90
[ 9131.707003] [<ffffffff81084da4>] ? trace_hardirqs_on_caller+0x64/0x180
[ 9131.707003] [<ffffffff810817bd>] ? trace_hardirqs_off+0xd/0x10
[ 9131.707003] [<ffffffff810d5764>] perf_event_overflow+0x14/0x20
[ 9131.707003] [<ffffffff810d588c>] perf_swevent_hrtimer+0x11c/0x130
[ 9131.707003] [<ffffffff817821a1>] ? error_exit+0x51/0xb0
[ 9131.707003] [<ffffffff81072e93>] __run_hrtimer+0x83/0x1e0
[ 9131.707003] [<ffffffff810d5770>] ? perf_event_overflow+0x20/0x20
[ 9131.707003] [<ffffffff81073256>] hrtimer_interrupt+0x106/0x250
[ 9131.707003] [<ffffffff812a3bfd>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 9131.707003] [<ffffffff81024833>] smp_apic_timer_interrupt+0x53/0x90
[ 9131.707003] [<ffffffff81789053>] apic_timer_interrupt+0x13/0x20
[ 9131.707003] <EOI> [<ffffffff817821a1>] ? error_exit+0x51/0xb0
[ 9131.707003] [<ffffffff8178219c>] ? error_exit+0x4c/0xb0
[ 9131.707003] ---[ end trace b2560d4876709347 ]---
Fix this by simply taking the stack pointer from regs->sp
when regs are provided.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
This code uses PCI_CLASS_REVISION instead of PCI_REVISION_ID, so
it wasn't converted by commit 44c10138fd ("PCI: Change all
drivers to use pci_device->revision") before being moved to
arch/x86/...
Do it now at last -- and save one level of indentation...
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/201107012242.08347.sshtylyov@ru.mvista.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The v1 PMU does not have any fixed counters. Using the v2 constraints,
which do have fixed counters, causes an additional choice to be present
in the weight calculation, but not when actually scheduling the event,
leading to an event being not scheduled at all.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1309362157-6596-3-git-send-email-avi@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The perf_event overflow handler does not receive any caller-derived
argument, so many callers need to resort to looking up the perf_event
in their local data structure. This is ugly and doesn't scale if a
single callback services many perf_events.
Fix by adding a context parameter to perf_event_create_kernel_counter()
(and derived hardware breakpoints APIs) and storing it in the perf_event.
The field can be accessed from the callback as event->overflow_handler_context.
All callers are updated.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a NODE level to the generic cache events which is used to measure
local vs remote memory accesses. Like all other cache events, an
ACCESS is HIT+MISS, if there is no way to distinguish between reads
and writes do reads only etc..
The below needs filling out for !x86 (which I filled out with
unsupported events).
I'm fairly sure ARM can leave it like that since it doesn't strike me as
an architecture that even has NUMA support. SH might have something since
it does appear to have some NUMA bits.
Sparc64, PowerPC and MIPS certainly want a good look there since they
clearly are NUMA capable.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Miller <davem@davemloft.net>
Cc: Anton Blanchard <anton@samba.org>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1303508226.4865.8.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the OFFCORE registers are fully symmetric, try the other one
when the specified one is already in use.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1306141897.18455.8.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds Intel Sandy Bridge offcore_response support by
providing the low-level constraint table for those events.
On Sandy Bridge, there are two offcore_response events. Each uses
its own dedictated extra register. But those registers are NOT shared
between sibling CPUs when HT is on unlike Nehalem/Westmere. They are
always private to each CPU. But they still need to be controlled within
an event group. All events within an event group must use the same
value for the extra MSR. That's not controlled by the second patch in
this series.
Furthermore on Sandy Bridge, the offcore_response events have NO
counter constraints contrary to what the official documentation
indicates, so drop the events from the contraint table.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110606145712.GA7304@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The validate_group() function needs to validate events with
extra shared regs. Within an event group, only events with
the same value for the extra reg can co-exist. This was not
checked by validate_group() because it was missing the
shared_regs logic.
This patch changes the allocation of the fake cpuc used for
validation to also point to a fake shared_regs structure such
that group events be properly testing.
It modifies __intel_shared_reg_get_constraints() to use
spin_lock_irqsave() to avoid lockdep issues.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110606145708.GA7279@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch improves the code managing the extra shared registers
used for offcore_response events on Intel Nehalem/Westmere. The
idea is to use static allocation instead of dynamic allocation.
This simplifies greatly the get and put constraint routines for
those events.
The patch also renames per_core to shared_regs because the same
data structure gets used whether or not HT is on. When HT is
off, those events still need to coordination because they use
a extra MSR that has to be shared within an event group.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110606145703.GA7258@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since only samples call perf_output_sample() its much saner (and more
correct) to put the sample logic in there than in the
perf_output_begin()/perf_output_end() pair.
Saves a useless argument, reduces conditionals and shrinks
struct perf_output_handle, win!
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.
For the various event classes:
- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.
As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).
The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Due to restriction and specifics of Netburst PMU we need a separated
event for NMI watchdog. In particular every Netburst event
consumes not just a counter and a config register, but also an
additional ESCR register.
Since ESCR registers are grouped upon counters (i.e. if ESCR is occupied
for some event there is no room for another event to enter until its
released) we need to pick up the "least" used ESCR (or the most available
one) for nmi-watchdog purposes -- so MSR_P4_CRU_ESCR2/3 was chosen.
With this patch nmi-watchdog and perf top should be able to run simultaneously.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
Tested-and-reviewed-by: Don Zickus <dzickus@redhat.com>
Tested-and-reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110623124918.GC13050@sun
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the hysterical outb/inb_pit defines and use outb_p/inb_p in the
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20110609130622.348437125@linutronix.de
Before check_fpu() is called, we have cr0.TS bit set and hence the floating
point code to check the FDIV bug was generating a DNA exception.
Use kernel_fpu_begin()/kernel_fpu_end() around the floating point
code to avoid this unnecessary device not available exception during
boot.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1309479572.2665.1372.camel@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
MTRR rendezvous sequence is not implemened using stop_machine() before, as this
gets called both from the process context aswell as the cpu online paths
(where the cpu has not come online and the interrupts are disabled etc).
Now that we have a new stop_machine_from_inactive_cpu() API, use it for
rendezvous during mtrr init of a logical processor that is coming online.
For the rest (runtime MTRR modification, system boot, resume paths), use
stop_machine() to implement the rendezvous sequence. This will consolidate and
cleanup the code.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/20110623182057.076997177@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The APB timers are an IP block from Synopsys (DesignWare APB timers)
and are also found in other systems including ARM SoC's. This patch
adds functions for creating clock_event_devices and clocksources from
APB timers but does not do the resource allocation. This is handled
in a higher layer to allow the timers to be created from multiple
methods such as platform_devices.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
MTRR rendezvous sequence using stop_one_cpu_nowait() can potentially
happen in parallel with another system wide rendezvous using
stop_machine(). This can lead to deadlock (The order in which
works are queued can be different on different cpu's. Some cpu's
will be running the first rendezvous handler and others will be running
the second rendezvous handler. Each set waiting for the other set to join
for the system wide rendezvous, leading to a deadlock).
MTRR rendezvous sequence is not implemented using stop_machine() as this
gets called both from the process context aswell as the cpu online paths
(where the cpu has not come online and the interrupts are disabled etc).
stop_machine() works with only online cpus.
For now, take the stop_machine mutex in the MTRR rendezvous sequence that
gets called from an online cpu (here we are in the process context
and can potentially sleep while taking the mutex). And the MTRR rendezvous
that gets triggered during cpu online doesn't need to take this stop_machine
lock (as the stop_machine() already ensures that there is no cpu hotplug
going on in parallel by doing get_online_cpus())
TBD: Pursue a cleaner solution of extending the stop_machine()
infrastructure to handle the case where the calling cpu is
still not online and use this for MTRR rendezvous sequence.
fixes: https://bugzilla.novell.com/show_bug.cgi?id=672008
Reported-by: Vadim Kotelnikov <vadimuzzz@inbox.ru>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/20110623182056.807230326@sbsiddha-MOBL3.sc.intel.com
Cc: stable@kernel.org # 2.6.35+, backport a week or two after this gets more testing in mainline
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Remove linux/mm.h inclusion from netdevice.h -- it's unused (I've checked manually).
To prevent mm.h inclusion via other channels also extract "enum dma_data_direction"
definition into separate header. This tiny piece is what gluing netdevice.h with mm.h
via "netdevice.h => dmaengine.h => dma-mapping.h => scatterlist.h => mm.h".
Removal of mm.h from scatterlist.h was tried and was found not feasible
on most archs, so the link was cutoff earlier.
Hope people are OK with tiny include file.
Note, that mm_types.h is still dragged in, but it is a separate story.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A few parts of the driver were missing in drivers/iommu.
Move them there to have the complete driver in that
directory.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This should ease finding similarities with different platforms,
with the intention of solving problems once in a generic framework
which everyone can use.
Compile-tested on x86_64.
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The ucode size check has to take the section header size into account
too when sanity checking the section length. Shorten and clarify define
names, while at it.
Caught-by: Ben Hutchings <ben@decadent.org.uk>
Link: http://lkml.kernel.org/r/1302752223.5282.674.camel@localhost
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
There are many functions named mce_* so use a new prefix for the subset
of functions related to sysfs support.
And since f3c6ea1b06 introduces
syscore_ops, use the prefix mce_syscore for some functions related to
power management which were in sysdev_class before.
Before: After:
mce_device mce_sysdev
mce_sysclass mce_sysdev_class
mce_attrs mce_sysdev_attrs
mce_dev_initialized mce_sysdev_initialized
mce_create_device mce_sysdev_create
mce_remove_device mce_sysdev_remove
mce_suspend mce_syscore_suspend
mce_shutdown mce_syscore_shutdown
mce_resume mce_syscore_resume
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED81B.8020506@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
There are many functions named mce_* so use a new prefix for the subset
of functions dealing with the character device /dev/mcelog.
This change doesn't impact the mce-inject module because the exported
symbol mce_chrdev_ops already has the prefix, therefore it is left
unchanged.
Before: After:
mce_wait mce_chrdev_wait
mce_state_lock mce_chrdev_state_lock
open_count mce_chrdev_open_count
open_exclu mce_chrdev_open_exclu
mce_open mce_chrdev_open
mce_release mce_chrdev_release
mce_read_mutex mce_chrdev_read_mutex
mce_read mce_chrdev_read
mce_poll mce_chrdev_poll
mce_ioctl mce_chrdev_ioctl
mce_log_device mce_chrdev_device
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED7CD.3040500@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Use a temporary local variable m to simplify the code. No change in
logic.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED7A8.8020307@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Use temporary local variable sysdev to simplify the code. No change in
logic.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED777.7080205@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Because "ancient CPUs" like p5 and winchip don't have X86_FEATURE_MCA
(I suppose so), mcheck_cpu_init() on such CPUs will return at check of
mce_available() after __mcheck_cpu_ancient_init().
It is hard to know this implicit behavior without knowing the CPUs
well. So make it clear that we leave mcheck_cpu_init() when the CPU is
initialized in __mcheck_cpu_ancient_init().
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED74B.20502@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This patch introduces mce_gather_info() which is to be called at the
beginning of error handling and gathers minimum error information from
proper error registers (and saved registers).
As the result of mce_get_rip() is integrated, unnecessary zeroing
is removed. This also takes care of saving RIP which is required to
make some decision about error severity for SRAR errors, instead of
retrieving it later in the handler.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED71A.1060906@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The MCE handler uses a special vector for self IPI to invoke
post-emergency processing in an interrupt context, e.g. call an
NMI-unsafe function, wakeup loggers, schedule time-consuming work for
recovery, etc.
This mechanism is now generalized by the following commit:
> e360adbe29
> Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Thu Oct 14 14:01:34 2010 +0800
>
> irq_work: Add generic hardirq context callbacks
>
> Provide a mechanism that allows running code in IRQ context. It is
> most useful for NMI code that needs to interact with the rest of the
> system -- like wakeup a task to drain buffers.
:
So change to use provided generic mechanism.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED6B2.6080005@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
More specifically:
- sort bits in the macros
- use BITCLR/BITSET
- coordinate message pattern
- use m for struct mce
- cleanup for severities_debugfs_init()
No functional change.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED679.9090503@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The current format of an item in this table is:
condition(param, ..., level, message [, condition2 ...])
So we have to check both an item's head and tail to find the conditions
which match the item.
Format them in a more straight forward manner:
item(level, message, condition [, condition2 ...])
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED61F.5010502@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The table looks very complicated and hard to read for people other than
skilled developers. So let's clean it up a bit. At first, change format
to ease reading elements in the table.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4DEED5EB.6050400@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The "Spurious not enabled" entry is redundant: the "Not enabled" entry
earlier in the table will cover this case.
The "Action required; unknown MCACOD" entry shouldn't specify MCACOD in
the .mask field. Current code will only match for mcacod==0 rather than
all AR=1 entries.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/4DEED5BC.8030703@jp.fujitsu.com
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Both the equivalence table and the microcode patch types are u32. Access
them properly through the buf-ptr.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Swap the 1st and 2nd parameters of save_stack_trace_regs()
as same as the parameters of save_stack_trace_tsk().
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Namhyung Kim <namhyung@gmail.com>
Link: http://lkml.kernel.org/r/20110608070921.17777.31103.stgit@fedora15
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
[ Also from Ben Hutchings <ben@decadent.org.uk> and Vitaliy Ivanov
<vitalivanov@gmail.com> ]
Commit 06ae40ce07 ("x86 idle: EXPORT_SYMBOL(default_idle, pm_idle)
only when APM demands it") removed the export for pm_idle/default_idle
unless the apm module was modularised and CONFIG_APM_CPU_IDLE was set.
But the apm module uses pm_idle/default_idle unconditionally,
CONFIG_APM_CPU_IDLE only affects the bios idle threshold. Adjust the
export accordingly.
[ Used #ifdef instead of #if defined() as it's shorter, and what both
Ben and Vitaliy used.. Andy, you're out-voted ;) - Linus ]
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Vitaliy Ivanov <vitalivanov@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm: Compare only lower 32 bits of framebuffer map offsets
drm/i915: Don't leak in i915_gem_shmem_pread_slow()
drm/radeon/kms: do bounds checking for 3D_LOAD_VBPNTR and bump array limit
drm/radeon/kms: fix mac g5 quirk
x86/uv/x2apic: update for change in pci bridge handling.
alpha, drm: Remove obsolete Alpha support in MGA DRM code
alpha/drm: Cleanup Alpha support in DRM generic code
savage: remove unnecessary if statement
drm/radeon: fix GUI idle IH debug statements
drm/radeon/kms: check modes against max pixel clock
drm: fix fbs in DRM_IOCTL_MODE_GETRESOURCES ioctl
This finally allows PCI-Device-IDs to be handled by the
IOMMU driver that have no corresponding struct device
present in the system.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Search for existing dev_data first will allow to switch
dev_data->alias to just store dev_data instead of struct
device.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
With this patch the low-level attach/detach functions only
work on dev_data structures. This allows to remove the
dev_data->dev pointer.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch make the functions flushing the DTE and IOTLBs
only take the dev_data structure instead of the struct
device directly.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This list keeps all allocated iommu_dev_data structs in a
list together. This is needed for instances that have no
associated device.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Remove these function calls from places where the function
has already been called by another function.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When I added 3448a19da4
I forgot about the special uv handling code for this, so this
patch fixes it up.
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Ingo Molnar
Signed-off-by: Dave Airlie <airlied@redhat.com>
Several fixes as well where the +1 was missing.
Done via coccinelle scripts like:
@@
struct resource *ptr;
@@
- ptr->end - ptr->start + 1
+ resource_size(ptr)
and some grep and typing.
Mostly uncompiled, no cross-compilers.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Unconditionally changing the address limit to USER_DS and not restoring
it to its old value in the error path is wrong because it prevents us
using kernel memory on repeated calls to this function. This, in fact,
breaks the fallback of hard coded paths to the init program from being
ever successful if the first candidate fails to load.
With this patch applied switching to USER_DS is delayed until the point
of no return is reached which makes it possible to have a multi-arch
rootfs with one arch specific init binary for each of the (hard coded)
probed paths.
Since the address limit is already set to USER_DS when start_thread()
will be invoked, this redundancy can be safely removed.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch fixes the following build failure:
drivers/built-in.o: In function `early_init_dt_check_for_initrd':
/home/florian/dev/kernel/x86/linux-2.6-x86/drivers/of/fdt.c:571:
undefined reference to `early_init_dt_setup_initrd_arch'
make: *** [.tmp_vmlinux1] Error 1
which happens as soon as we enable initrd support on a x86 devicetree
platform such as Intel CE4100.
Signed-off-by: Florian Fainelli <ffainelli@freebox.fr>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Maxime Bizon <mbizon@freebox.fr>
Acked-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: stable@kernel.org # 2.6.39
Link: http://lkml.kernel.org/r/201106061015.50039.ffainelli@freebox.fr
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There are multiple declarations of global_clock_event in header files
specific to particular clock event implementations. Consolidate them
in <asm/time.h> and make sure all users include that header.
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Venkatesh Pallipadi (Venki) <venki@google.com>
Link: http://lkml.kernel.org/r/20110601180610.762763451@duck.linux-mips.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move them to drivers/clocksource/i8253.c and remove the
implementations in arch/
[ tglx: Avoid the extra file in lib - folded arch patches in. The
export will become conditional in a later step ]
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Link: http://lkml.kernel.org/r/20110601180610.221426078@duck.linux-mips.net
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
After a newly plugged CPU sets the cpu_online bit it enables
interrupts and goes idle. The cpu which brought up the new cpu waits
for the cpu_online bit and when it observes it, it sets the cpu_active
bit for this cpu. The cpu_active bit is the relevant one for the
scheduler to consider the cpu as a viable target.
With forced threaded interrupt handlers which imply forced threaded
softirqs we observed the following race:
cpu 0 cpu 1
bringup(cpu1);
set_cpu_online(smp_processor_id(), true);
local_irq_enable();
while (!cpu_online(cpu1));
timer_interrupt()
-> wake_up(softirq_thread_cpu1);
-> enqueue_on(softirq_thread_cpu1, cpu0);
^^^^
cpu_notify(CPU_ONLINE, cpu1);
-> sched_cpu_active(cpu1)
-> set_cpu_active((cpu1, true);
When an interrupt happens before the cpu_active bit is set by the cpu
which brought up the newly onlined cpu, then the scheduler refuses to
enqueue the woken thread which is bound to that newly onlined cpu on
that newly onlined cpu due to the not yet set cpu_active bit and
selects a fallback runqueue. Not really an expected and desirable
behaviour.
So far this has only been observed with forced hard/softirq threading,
but in theory this could happen without forced threaded hard/softirqs
as well. It's probably unobservable as it would take a massive
interrupt storm on the newly onlined cpu which causes the softirq loop
to wake up the softirq thread and an even longer delay of the cpu
which waits for the cpu_online bit.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@kernel.org # 2.6.39
Instead of walking the whole PCI tree to update the of_node's for
PCI busses and devices after the fact, enable the new generic core
code for doing so by providing the proper device nodes for the
PCI host bridges
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Some PCIe cards ship with a PCI-PCIe bridge which is not
visible as a PCI device in Linux. But the device-id of the
bridge is present in the IOMMU tables which causes a boot
crash in the IOMMU driver.
This patch fixes by removing these cards from the IOMMU
handling. This is a pure -stable fix, a real fix to handle
this situation appriatly will follow for the next merge
window.
Cc: stable@kernel.org # > 2.6.32
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
There's a fair amount of code in the vsyscall page. It contains
a syscall instruction (in the gettimeofday fallback) and who
knows what will happen if an exploit jumps into the middle of
some other code.
Reduce the risk by replacing the vsyscalls with short magic
incantations that cause the kernel to emulate the real
vsyscalls. These incantations are useless if entered in the
middle.
This causes vsyscalls to be a little more expensive than real
syscalls. Fortunately sensible programs don't use them.
The only exception is time() which is still called by glibc
through the vsyscall - but calling time() millions of times
per second is not sensible. glibc has this fixed in the
development tree.
This patch is not perfect: the vread_tsc and vread_hpet
functions are still at a fixed address. Fixing that might
involve making alternative patching work in the vDSO.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/e64e1b3c64858820d12c48fa739efbd1485e79d5.1307292171.git.luto@mit.edu
[ Removed the CONFIG option - it's simpler to just do it unconditionally. Tidied up the code as well. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unfortunatly there are systems where the AMD IOMMU does not
cover all devices. This breaks with the current driver as it
initializes the global dma_ops variable. This patch limits
the AMD IOMMU to the devices listed in the IVRS table fixing
DMA for devices not covered by the IOMMU.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The driver contains several loops counting on an u16 value
where the exit-condition is checked against variables that
can have values up to 0xffff. In this case the loops will
never exit. This patch fixed 3 such loops.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Jumping to 0x00 might do something depending on the following
bytes. Jumping to 0xcc is a trap. So fill the unused parts of
the vsyscall page with 0xcc to make it useless for exploits to
jump there.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/ed54bfcfbe50a9070d20ec1edbe0d149e22a4568.1307292171.git.luto@mit.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It just segfaults since April 2008 (a4928cff), so I'm pretty
sure that nothing uses it. And having an empty section makes
the linker script a bit fragile.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/4a4abcf47ecadc269f2391a313576fe6d06acef7.1307292171.git.luto@mit.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the HPET mapping is a user-accessible syscall
instruction at a fixed address some of the time.
A sufficiently determined hacker might be able to guess when.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/ab41b525a4ca346b1ca1145d16fb8d181861a8aa.1307292171.git.luto@mit.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It's unnecessary overhead in code that's supposed to be highly
optimized. Removing it allows us to remove one of the two
syscall instructions in the vsyscall page.
The only sensible use for it is for UML users, and it doesn't
fully address inconsistent vsyscall results on UML. The real
fix for UML is to stop using vsyscalls entirely.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/973ae803fe76f712da4b2740e66dccf452d3b1e4.1307292171.git.luto@mit.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move vvars out of the vsyscall page into their own page and mark
it NX.
Without this patch, an attacker who can force a daemon to call
some fixed address could wait until the time contains, say,
0xCD80, and then execute the current time.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Jesper Juhl <jj@chaosbits.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: Mikael Pettersson <mikpe@it.uu.se>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Louis Rilling <Louis.Rilling@kerlabs.com>
Cc: Valdis.Kletnieks@vt.edu
Cc: pageexec@freemail.hu
Link: http://lkml.kernel.org/r/b1460f81dc4463d66ea3f2b5ce240f58d48effec.1307292171.git.luto@mit.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The blacklist was added in response to my bug report
(http://lkml.org/lkml/2006/1/19/362) and has never
contained more than the one entry describing my old
now dead ThinkPad 380XD laptop. As found out later
(http://lkml.org/lkml/2007/11/29/50), this special
treatment has been unnecessary for a long time, so
it can be removed.
Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix mwait_play_dead() faulting on mwait-incapable cpus
x86 idle: Fix mwait deprecation warning message
Evil merge to remove extra quote noticed by Joe Perches
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Put back -pg to tsc.o and add no GCOV to vread_tsc_64.o
A logic error in mwait_play_dead() causes the kernel to use
mwait even on cpus which don't support it, such as KVM virtual
cpus.
Introduced by:
349c004e3d: x86: A fast way to check capabilities of the current cpu
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=36222
Reported-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1306758237-9327-1-git-send-email-avi@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
x86 idle: deprecate "no-hlt" cmdline param
x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
x86 idle floppy: deprecate disable_hlt()
x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
x86 idle: clarify AMD erratum 400 workaround
idle governor: Avoid lock acquisition to read pm_qos before entering idle
cpuidle: menu: fixed wrapping timers at 4.294 seconds
mwait_idle() is a C1-only idle loop intended to be more efficient
than HLT on SMP hardware that supports it.
But mwait_idle() has been replaced by the more general
mwait_idle_with_hints(), which handles both C1 and deeper C-states.
ACPI uses only mwait_idle_with_hints(), and never uses mwait_idle().
Deprecate mwait_idle() and the "idle=mwait" cmdline param
to simplify the x86 idle code.
After this change, kernels configured with
(!CONFIG_ACPI=n && !CONFIG_INTEL_IDLE=n) when run on hardware
that support MWAIT will simply use HLT. If MWAIT is desired
on those systems, cpuidle and the cpuidle drivers above
can be used.
cc: x86@kernel.org
cc: stable@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
We'd rather that modern machines not check if HLT works on
every entry into idle, for the benefit of machines that had
marginal electricals 15-years ago. If those machines are still running
the upstream kernel, they can use "idle=poll". The only difference
will be that they'll now invoke HLT in machine_hlt().
cc: x86@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
We don't want to export the pm_idle function pointer to modules.
Currently CONFIG_APM_CPU_IDLE w/ CONFIG_APM_MODULE forces us to.
CONFIG_APM_CPU_IDLE is of dubious value, it runs only on 32-bit
uniprocessor laptops that are over 10 years old. It calls into
the BIOS during idle, and is known to cause a number of machines
to fail.
Removing CONFIG_APM_CPU_IDLE and will allow us to stop exporting
pm_idle. Any systems that were calling into the APM BIOS
at run-time will simply use HLT instead.
cc: x86@kernel.org
cc: Jiri Kosina <jkosina@suse.cz>
cc: stable@kernel.org # .39.x
Signed-off-by: Len Brown <len.brown@intel.com>
In the long run, we don't want default_idle() or (pm_idle)() to
be exported outside of process.c. Start by not exporting them
to modules, unless the APM build demands it.
cc: x86@kernel.org
cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
The workaround for AMD erratum 400 uses the term "c1e" falsely suggesting:
1. Intel C1E is somehow involved
2. All AMD processors with C1E are involved
Use the string "amd_c1e" instead of simply "c1e" to clarify that
this workaround is specific to AMD's version of C1E.
Use the string "e400" to clarify that the workaround is specific
to AMD processors with Erratum 400.
This patch is text-substitution only, with no functional change.
cc: x86@kernel.org
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, asm: Clean up desc.h a bit
x86, amd: Do not enable ARAT feature on AMD processors below family 0x12
x86: Move do_page_fault()'s error path under unlikely()
x86, efi: Retain boot service code until after switching to virtual mode
x86: Remove unnecessary check in detect_ht()
x86: Reorder mm_context_t to remove x86_64 alignment padding and thus shrink mm_struct
x86, UV: Clean up uv_tlb.c
x86, UV: Add support for SGI UV2 hub chip
x86, cpufeature: Update CPU feature RDRND to RDRAND
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
perf: Fix SIGIO handling
perf top: Don't stop if no kernel symtab is found
perf top: Handle kptr_restrict
perf top: Remove unused macro
perf events: initialize fd array to -1 instead of 0
perf tools: Make sure kptr_restrict warnings fit 80 col terms
perf tools: Fix build on older systems
perf symbols: Handle /proc/sys/kernel/kptr_restrict
perf: Remove duplicate headers
ftrace: Add internal recursive checks
tracing: Update btrfs's tracepoints to use u64 interface
tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
ftrace: Set ops->flag to enabled even on static function tracing
tracing: Have event with function tracer check error return
ftrace: Have ftrace_startup() return failure code
jump_label: Check entries limit in __jump_label_update
ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
scripts/tags.sh: Add magic for trace-events for etags too
scripts/tags.sh: Fix ctags for DEFINE_EVENT()
x86/ftrace: Fix compiler warning in ftrace.c
...
32bit and 64bit on x86 are tested and working. The rest I have looked
at closely and I can't find any problems.
setns is an easy system call to wire up. It just takes two ints so I
don't expect any weird architecture porting problems.
While doing this I have noticed that we have some architectures that are
very slow to get new system calls. cris seems to be the slowest where
the last system calls wired up were preadv and pwritev. avr32 is weird
in that recvmmsg was wired up but never declared in unistd.h. frv is
behind with perf_event_open being the last syscall wired up. On h8300
the last system call wired up was epoll_wait. On m32r the last system
call wired up was fallocate. mn10300 has recvmmsg as the last system
call wired up. The rest seem to at least have syncfs wired up which was
new in the 2.6.39.
v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch
v5: ported to v2.6.39-rc6
v6: rebased onto parisc-next and net-next to avoid syscall conflicts.
v7: ported to Linus's latest post 2.6.39 tree.
> arch/blackfin/include/asm/unistd.h | 3 ++-
> arch/blackfin/mach-common/entry.S | 1 +
Acked-by: Mike Frysinger <vapier@gentoo.org>
Oh - ia64 wiring looks good.
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The commit 44259b1abf
Author: Andy Lutomirski <luto@MIT.EDU>
x86-64: Move vread_tsc into a new file with sensible options
Removed the -pg from tsc.o which caused the function graph tracer
to go into an infinite function call recursion as it uses the tsc
internally outside its recursion protection, thus tracing the tsc
breaks the function graph tracer.
This commit also added the file vread_tsc_64.c that gets used
by vdso but failed to prevent GCOV from monkeying with it,
causing userspace to try to access kernel data when GCOV was
enabled.
Thanks to Thomas Gleixner for pointing out GCOV as the likely
culprit that added strange kernel accesses into the vread_tsc()
call.
Cc: Author: Andy Lutomirski <luto@MIT.EDU>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: vdso: Remove unused variable
x86-64: Optimize vDSO time()
x86-64: Add time to vDSO
x86-64: Turn off -pg and turn on -foptimize-sibling-calls for vDSO
x86-64: Move vread_tsc into a new file with sensible options
x86-64: Vclock_gettime(CLOCK_MONOTONIC) can't ever see nsec < 0
x86-64: Don't generate cmov in vread_tsc
x86-64: Remove unnecessary barrier in vread_tsc
x86-64: Clean up vdso/kernel shared variables
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
seqlock: Get rid of SEQLOCK_UNLOCKED
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq: Remove smp_affinity_list when unregister irq proc
Commit b87cf80af3 added support for
ARAT (Always Running APIC timer) on AMD processors that are not
affected by erratum 400. This erratum is present on certain processor
families and prevents APIC timer from waking up the CPU when it
is in a deep C state, including C1E state.
Determining whether a processor is affected by this erratum may
have some corner cases and handling these cases is somewhat
complicated. In the interest of simplicity we won't claim ARAT
support on processor families below 0x12 and will go back to
broadcasting timer when going idle.
Signed-off-by: Boris Ostrovsky <ostr@amd64.org>
Link: http://lkml.kernel.org/r/1306423192-19774-1-git-send-email-ostr@amd64.org
Tested-by: Boris Petkov <borislav.petkov@amd.com>
Cc: Hans Rosenfeld <Hans.Rosenfeld@amd.com>
Cc: Andreas Herrmann <Andreas.Herrmann3@amd.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Cc: stable@kernel.org # 32.x, 38.x, 39.x
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
UEFI stands for "Unified Extensible Firmware Interface", where "Firmware"
is an ancient African word meaning "Why do something right when you can
do it so wrong that children will weep and brave adults will cower before
you", and "UEI" is Celtic for "We missed DOS so we burned it into your
ROMs". The UEFI specification provides for runtime services (ie, another
way for the operating system to be forced to depend on the firmware) and
we rely on these for certain trivial tasks such as setting up the
bootloader. But some hardware fails to work if we attempt to use these
runtime services from physical mode, and so we have to switch into virtual
mode. So far so dreadful.
The specification makes it clear that the operating system is free to do
whatever it wants with boot services code after ExitBootServices() has been
called. SetVirtualAddressMap() can't be called until ExitBootServices() has
been. So, obviously, a whole bunch of EFI implementations call into boot
services code when we do that. Since we've been charmingly naive and
trusted that the specification may be somehow relevant to the real world,
we've already stuffed a picture of a penguin or something in that address
space. And just to make things more entertaining, we've also marked it
non-executable.
This patch allocates the boot services regions during EFI init and makes
sure that they're executable. Then, after SetVirtualAddressMap(), it
discards them and everyone lives happily ever after. Except for the ones
who have to work on EFI, who live sad lives haunted by the knowledge that
someone's eventually going to write yet another firmware specification.
[ hpa: adding this to urgent with a stable tag since it fixes currently-broken
hardware. However, I do not know what the dependencies are and so I do
not know which -stable versions this may be a candidate for. ]
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Link: http://lkml.kernel.org/r/1306331593-28715-1-git-send-email-mjg@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <stable@kernel.org>
Due to commit dc326fca2b (x86, cpu: Clean up and unify the NOP selection infrastructure), we get the following warning:
arch/x86/kernel/ftrace.c: In function ‘ftrace_make_nop’:
arch/x86/kernel/ftrace.c:308:6: warning: assignment discards qualifiers from pointer target type
arch/x86/kernel/ftrace.c: In function ‘ftrace_make_call’:
arch/x86/kernel/ftrace.c:318:6: warning: assignment discards qualifiers from pointer target type
ftrace_nop_replace() now returns const unsigned char *, so change its associated function/variable to its compatible type to keep compiler clam.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Link: http://lkml.kernel.org/r/1305221620.7986.4.camel@localhost.localdomain
[ updated for change of const void *src in probe_kernel_write() ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This patch removes a check that causes incorrect scheduler
domain setup (SMP instead of SMT) and bootlog warning messages
when cpuid extensions for topology enumeration are not supported
and the number of processors reported to the OS is smaller than
smp_num_siblings.
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Nikhil P Rao <nikhil.rao@intel.com>
Link: http://lkml.kernel.org/r/1306343921.19325.1.camel@fedora13
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On larger systems, because of the numerous ACPI, Bootmem and EFI messages,
the static log buffer overflows before the larger one specified by the
log_buf_len param is allocated. Minimize the overflow by allocating the
new log buffer as soon as possible.
On kernels without memblock, a later call to setup_log_buf from
kernel/init.c is the fallback.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix CONFIG_PRINTK=n build]
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
It might lead to reduce cache hit ratio.
This patch has two change.
1) Move the place of cpumask into last of mm_struct. Because usually cpumask
is accessed only front bits when the system has cpu-hotplug capability
2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
footprint if cpumask_size() will use nr_cpumask_bits properly in future.
In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
It may help to detect out of tree cpu_vm_mask users.
This patch has no functional change.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds support for a new version of the SGI UV hub
chip. The hub chip is the node controller that connects multiple
blades into a larger coherent SSI.
For the most part, UV2 is compatible with UV1. The majority of
the changes are in the addresses of MMRs and in a few cases, the
contents of MMRs. These changes are the result in changes in the
system topology such as node configuration, processor types,
maximum nodes, physical address sizes, etc.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20110511175028.GA18006@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Unify input section names
percpu: Avoid extra NOP in percpu_cmpxchg16b_double
percpu: Cast away printk format warning
percpu: Always align percpu output section to PAGE_SIZE
Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
In mask/restore_ioapic_entries() we should be restoring ioapic
entries when ioapics[apic].saved_registers is not NULL.
Fix the typo and address the resume hang regression reported by
Linus.
This was not found sooner because the systems where these
changes were tested on kept the IO-APIC entries intact over
resume.
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Daniel J Blueman <daniel.blueman@gmail.com>
Link: http://lkml.kernel.org/r/1306259131.7171.7.camel@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
All static seqlock should be initialized with the lockdep friendly
__SEQLOCK_UNLOCKED() macro.
Remove legacy SEQLOCK_UNLOCKED() macro.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/%3C1306238888.3026.31.camel%40edumazet-laptop%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
vread_tsc is short and hot, and it's userspace code so the usual
reasons to enable -pg and turn off sibling calls don't apply.
(OK, turning off sibling calls has no effect. But it might
someday...)
As an added benefit, tsc.c is profilable now.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@amd64.org>
Link: http://lkml.kernel.org/r/%3C99c6d7f5efa3ccb65b4ac6eb443e1ab7bad47d7b.1306156808.git.luto%40mit.edu%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
vread_tsc checks whether rdtsc returns something less than
cycle_last, which is an extremely predictable branch. GCC likes
to generate a cmov anyway, which is several cycles slower than
a predicted branch. This saves a couple of nanoseconds.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@amd64.org>
Link: http://lkml.kernel.org/r/%3C561280649519de41352fcb620684dfb22bad6bac.1306156808.git.luto%40mit.edu%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
RDTSC is completely unordered on modern Intel and AMD CPUs. The
Intel manual says that lfence;rdtsc causes all previous instructions
to complete before the tsc is read, and the AMD manual says to use
mfence;rdtsc to do the same thing.
From a decent amount of testing [1] this is enough to make rdtsc
be ordered with respect to subsequent loads across a wide variety
of CPUs.
On Sandy Bridge (i7-2600), this improves a loop of
clock_gettime(CLOCK_MONOTONIC) by more than 5 ns/iter.
[1] https://lkml.org/lkml/2011/4/18/350
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@amd64.org>
Link: http://lkml.kernel.org/r/%3C1c158b9d74338aa5361f96dd473d0e6a58235302.1306156808.git.luto%40mit.edu%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Variables that are shared between the vdso and the kernel are
currently a bit of a mess. They are each defined with their own
magic, they are accessed differently in the kernel, the vsyscall page,
and the vdso, and one of them (vsyscall_clock) doesn't even really
exist.
This changes them all to use a common mechanism. All of them are
delcared in vvar.h with a fixed address (validated by the linker
script). In the kernel (as before), they look like ordinary
read-write variables. In the vsyscall page and the vdso, they are
accessed through a new macro VVAR, which gives read-only access.
The vdso is now loaded verbatim into memory without any fixups. As a
side bonus, access from the vdso is faster because a level of
indirection is removed.
While we're at it, pack jiffies and vgetcpu_mode into the same
cacheline.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@amd64.org>
Link: http://lkml.kernel.org/r/%3C7357882fbb51fa30491636a7b6528747301b7ee9.1306156808.git.luto%40mit.edu%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (27 commits)
PCI: Don't use dmi_name_in_vendors in quirk
PCI: remove unused AER functions
PCI/sysfs: move bus cpuaffinity to class dev_attrs
PCI: add rescan to /sys/.../pci_bus/.../
PCI: update bridge resources to get more big ranges when allocating space (again)
KVM: Use pci_store/load_saved_state() around VM device usage
PCI: Add interfaces to store and load the device saved state
PCI: Track the size of each saved capability data area
PCI/e1000e: Add and use pci_disable_link_state_locked()
x86/PCI: derive pcibios_last_bus from ACPI MCFG
PCI: add latency tolerance reporting enable/disable support
PCI: add OBFF enable/disable support
PCI: add ID-based ordering enable/disable support
PCI hotplug: acpiphp: assume device is in state D0 after powering on a slot.
PCI: Set PCIE maxpayload for card during hotplug insertion
PCI/ACPI: Report _OSC control mask returned on failure to get control
x86/PCI: irq and pci_ids patch for Intel Panther Point DeviceIDs
PCI: handle positive error codes
PCI: check pci_vpd_pci22_wait() return
PCI: Use ICH6_GPIO_EN in ich6_lpc_acpi_gpio
...
Fix up trivial conflicts in include/linux/pci_ids.h: commit a6e5e2be44
moved the intel SMBUS ID definitons to the i2c-i801.c driver.
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, apic: Include module.h header in apic_flat_64.c
x86, apic: Make apic drivers static
x86, apic: Clean up bigsmp apic selection code
x86, apic: Use .apicdrivers section for the apic drivers list
x86, apic: Introduce .apicdrivers section to find the list of apic drivers
x86, x2apic: Move the common bits to x2apic.h
x86, x2apic: Minimize IPI register writes using cluster groups
x86, x2apic: Track the x2apic cluster sibling map
x86, x2apic: Remove duplicate code for IPI mask routines
x86, apic: Use probe routines to simplify apic selection
x86, ioapic: Consolidate mp_ioapic_routing[] into 'struct ioapic'
x86, ioapic: Consolidate gsi routing info into 'struct ioapic'
x86, ioapic: Consolidate mp_ioapics[] into 'struct ioapic'
x86, ioapic: Consolidate ioapic_saved_data[] into 'struct ioapic'
x86, ioapic: Add struct ioapic
x86, ioapic: Remove duplicate code for saving/restoring RTEs
x86, ioapic: Use ioapic_saved_data while enabling intr-remapping
x86, ioapic: Allocate ioapic_saved_data early
x86, ioapic: Fix potential resume deadlock
apic_flat_64.c needs to include module.h because it uses
EXPORT_SYMBOL_GPL().
This fixes these warnings on some !SMP randconfigs:
arch/x86/kernel/apic/apic_flat_64.c:31: warning: data definition has no type or storage class
arch/x86/kernel/apic/apic_flat_64.c:31: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
arch/x86/kernel/apic/apic_flat_64.c:31: warning: parameter names (without types) in function declaration
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20110523104300.dd532a99.randy.dunlap@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tools: Fix sample size bit operations
perf tools: Fix ommitted mmap data update on remap
watchdog: Change the default timeout and configure nmi watchdog period based on watchdog_thresh
watchdog: Disable watchdog when thresh is zero
watchdog: Only disable/enable watchdog if neccessary
watchdog: Fix rounding bug in get_sample_period()
perf tools: Propagate event parse error handling
perf tools: Robustify dynamic sample content fetch
perf tools: Pre-check sample size before parsing
perf tools: Move evlist sample helpers to evlist area
perf tools: Remove junk code in mmap size handling
perf tools: Check we are able to read the event size on mmap
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Eliminate various 'set but not used' warnings
x86, SMEP: Fix section mismatch warnings
x86, amd: Use _safe() msr access for GartTlbWlk disable code
Before the conversion of the NMI watchdog to perf event, the
watchdog timeout was 5 seconds. Now it is 60 seconds. For my
particular application, netbooks, 5 seconds was a better
timeout. With a short timeout, we catch faults earlier and are
able to send back a panic. With a 60 second timeout, the user is
unlikely to wait and will instead hit the power button, causing
us to lose the panic info.
This change configures the NMI period to watchdog_thresh and
sets the softlockup_thresh to watchdog_thresh * 2. In addition,
watchdog_thresh was reduced to 10 seconds as suggested by Ingo
Molnar.
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Cc: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1306127423-3347-4-git-send-email-msb@chromium.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20110517071642.GF22305@elte.hu>
The setup_smep function gets calle at resume time too, and is thus not a
pure __init function. When marked as __init, it gets thrown out after
the kernel has initialized, and when the kernel is suspended and
resumed, the code will no longer be around, and we'll get a nice "kernel
tried to execute NX-protected page" oops because the page is no longer
marked executable.
Reported-and-tested-by: Parag Warudkar <parag.lkml@gmail.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This will pave the way for each apic driver to be self-contained
and eliminate the need for apic_probe[].
Order in which apic drivers are listed in the .apicdrivers
section is important, as this determines the apic probe order.
And this is enforced by the ordering of apic driver files in the
Makefile and the macros apic_driver()/apic_drivers().
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: steiner@sgi.com
Cc: gorcunov@openvz.org
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110521005526.068775085@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1446 commits)
macvlan: fix panic if lowerdev in a bond
tg3: Add braces around 5906 workaround.
tg3: Fix NETIF_F_LOOPBACK error
macvlan: remove one synchronize_rcu() call
networking: NET_CLS_ROUTE4 depends on INET
irda: Fix error propagation in ircomm_lmp_connect_response()
irda: Kill set but unused variable 'bytes' in irlan_check_command_param()
irda: Kill set but unused variable 'clen' in ircomm_connect_indication()
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_transport()
be2net: Kill set but unused variable 'req' in lancer_fw_download()
irda: Kill set but unused vars 'saddr' and 'daddr' in irlan_provider_connect_indication()
atl1c: atl1c_resume() is only used when CONFIG_PM_SLEEP is defined.
rxrpc: Fix set but unused variable 'usage' in rxrpc_get_peer().
rxrpc: Kill set but unused variable 'local' in rxrpc_UDP_error_handler()
rxrpc: Kill set but unused variable 'sp' in rxrpc_process_connection()
rxrpc: Kill set but unused variable 'sp' in rxrpc_rotate_tx_window()
pkt_sched: Kill set but unused variable 'protocol' in tc_classify()
isdn: capi: Use pr_debug() instead of ifdefs.
tg3: Update version to 3.119
tg3: Apply rx_discards fix to 5719/5720
...
Fix up trivial conflicts in arch/x86/Kconfig and net/mac80211/agg-tx.c
as per Davem.
In the case of x2apic cluster mode we can group IPI register
writes based on the cluster group instead of individual per-cpu
destination messages.
This reduces the apic register writes and reduces the amount of
IPI messages (in the best case we can reduce it by a factor of
16).
With this change, the cost of flush_tlb_others(), with the flush
tlb IPI being sent from a cpu in the socket-1 to all the logical
cpus in socket-2 (on a Westmere-EX system that has 20 logical
cpus in a socket) is 3x times better now (compared to the former
'send one-by-one' algorithm).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: steiner@sgi.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110519234637.512271057@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In the case of x2apic cluster mode, we can group IPI register
writes based on the cluster group instead of individual per-cpu
destination messages.
For this purpose, track the cpu's that belong to the same x2apic
cluster.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: steiner@sgi.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110519234637.421800999@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the unused probe routine in the apic driver to finalize the
apic model selection. This cleans up the
default_setup_apic_routing() and this probe routine in future
can also be used for doing any apic model specific
initialisation.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: steiner@sgi.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/20110519234637.247458931@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Code flow for enabling interrupt-remapping has its own routines
for saving and restoring io-apic RTE's. ioapic suspend/resume
code flow also has similar routines. Remove the duplicate code.
Tested-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/20110518233157.673130611@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Code flow for enabling interrupt-remapping was
allocating/freeing buffers for saving/restoring io-apic RTE's.
ioapic suspend/resume code uses boot time allocated
ioapic_saved_data that is a perfect match for reuse here.
This will remove the unnecessary allocation/free of the
temporary buffers during suspend/resume of interrupt-remapping
enabled platforms aswell as paving the way for further code
consolidation.
Tested-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/20110518233157.574469296@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows re-using this buffer for enabling
interrupt-remapping during boot and resume. And thus allow for
consolidating the code between ioapic suspend/resume and
interrupt-remapping.
Tested-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/20110518233157.481404505@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a potential deadlock when resuming; here the calling
function has disabled interrupts, so we cannot sleep.
Change the memory allocation flag from GFP_KERNEL to GFP_ATOMIC.
TODO: We can do away with this memory allocation during resume
by reusing the ioapic suspend/resume code that uses boot time
allocated buffers, but we want to keep this -stable patch
simple.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: <stable@kernel.org> # v2.6.38/39
Link: http://lkml.kernel.org/r/20110518233157.385970138@sbsiddha-MOBL3.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The workaround for Bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=33012
introduced a read and a write to the MC4 mask msr.
Unfortunatly this MSR is not emulated by the KVM hypervisor
so that the kernel will get a #GP and crashes when applying
this workaround when running inside KVM.
This issue was reported as:
https://bugzilla.kernel.org/show_bug.cgi?id=35132
and is fixed with this patch. The change just let the kernel
ignore any #GP it gets while accessing this MSR by using the
_safe msr access methods.
Reported-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org> # .39.x
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (44 commits)
debugfs: Silence DEBUG_STRICT_USER_COPY_CHECKS=y warning
sysfs: remove "last sysfs file:" line from the oops messages
drivers/base/memory.c: fix warning due to "memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION"
memory hotplug: Speed up add/remove when blocks are larger than PAGES_PER_SECTION
SYSFS: Fix erroneous comments for sysfs_update_group().
driver core: remove the driver-model structures from the documentation
driver core: Add the device driver-model structures to kerneldoc
Translated Documentation/email-clients.txt
RAW driver: Remove call to kobject_put().
reboot: disable usermodehelper to prevent fs access
efivars: prevent oops on unload when efi is not enabled
Allow setting of number of raw devices as a module parameter
Introduce CONFIG_GOOGLE_FIRMWARE
driver: Google Memory Console
driver: Google EFI SMI
x86: Better comments for get_bios_ebda()
x86: get_bios_ebda_length()
misc: fix ti-st build issues
params.c: Use new strtobool function to process boolean inputs
debugfs: move to new strtobool
...
Fix up trivial conflicts in fs/debugfs/file.c due to the same patch
being applied twice, and an unrelated cleanup nearby.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Introduce pci_map_biosrom()
x86, olpc: Use device tree for platform identification
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (50 commits)
x86, mm: Allow ZONE_DMA to be configurable
x86, NUMA: Trim numa meminfo with max_pfn in a separate loop
x86, NUMA: Rename setup_node_bootmem() to setup_node_data()
x86, NUMA: Enable emulation on 32bit too
x86, NUMA: Enable CONFIG_AMD_NUMA on 32bit too
x86, NUMA: Rename amdtopology_64.c to amdtopology.c
x86, NUMA: Make numa_init_array() static
x86, NUMA: Make 32bit use common NUMA init path
x86, NUMA: Initialize and use remap allocator from setup_node_bootmem()
x86-32, NUMA: Add @start and @end to init_alloc_remap()
x86, NUMA: Remove long 64bit assumption from numa.c
x86, NUMA: Enable build of generic NUMA init code on 32bit
x86, NUMA: Move NUMA init logic from numa_64.c to numa.c
x86-32, NUMA: Update numaq to use new NUMA init protocol
x86-32, NUMA: Replace srat_32.c with srat.c
x86-32, NUMA: implement temporary NUMA init shims
x86, NUMA: Move numa_nodes_parsed to numa.[hc]
x86-32, NUMA: Move get_memcfg_numa() into numa_32.c
x86, NUMA: make srat.c 32bit safe
x86, NUMA: rename srat_64.c to srat.c
...
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, efi: Ensure that the entirity of a region is mapped
x86, efi: Pass a minimal map to SetVirtualAddressMap()
x86, efi: Merge contiguous memory regions of the same type and attribute
x86, efi: Consolidate EFI nx control
x86, efi: Remove virtual-mode SetVirtualAddressMap call
* 'x86-gart-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, gart: Don't enforce GART aperture lower-bound by alignment
* 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Don't unmask disabled irqs when migrating them
x86: Skip migrating IRQF_PER_CPU irqs in fixup_irqs()
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: Drop the default decoding notifier
x86, MCE: Do not taint when handling correctable errors
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, cpu: Fix detection of Celeron Covington stepping A1 and B0
Documentation, ABI: Update L3 cache index disable text
x86, AMD, cacheinfo: Fix L3 cache index disable checks
x86, AMD, cacheinfo: Fix fallout caused by max3 conversion
x86, cpu: Change NOP selection for certain Intel CPUs
x86, cpu: Clean up and unify the NOP selection infrastructure
x86, percpu: Use ASM_NOP4 instead of hardcoding P6_NOP4
x86, cpu: Move AMD Elan Kconfig under "Processor family"
Fix up trivial conflicts in alternative handling (commit dc326fca2b
"x86, cpu: Clean up and unify the NOP selection infrastructure" removed
some hacky 5-byte instruction stuff, while commit d430d3d7e6 "jump
label: Introduce static_branch() interface" renamed HAVE_JUMP_LABEL to
CONFIG_JUMP_LABEL in the code that went away)
* 'timers-clockevents-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: hpet: Cleanup the clockevents init and register code
x86: Convert PIT to clockevents_config_and_register()
clockevents: Provide interface to reconfigure an active clock event device
clockevents: Provide combined configure and register function
clockevents: Restructure clock_event_device members
clocksource: Get rid of the hardcoded 5 seconds sleep time limit
clocksource: Restructure clocksource struct members
* 'timers-clocksource-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clocksource: convert mips to generic i8253 clocksource
clocksource: convert x86 to generic i8253 clocksource
clocksource: convert footbridge to generic i8253 clocksource
clocksource: add common i8253 PIT clocksource
blackfin: convert to clocksource_register_hz
mips: convert to clocksource_register_hz/khz
sparc: convert to clocksource_register_hz/khz
alpha: convert to clocksource_register_hz
microblaze: convert to clocksource_register_hz/khz
ia64: convert to clocksource_register_hz/khz
x86: Convert remaining x86 clocksources to clocksource_register_hz/khz
Make clocksource name const
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits)
sched: Fix and optimise calculation of the weight-inverse
sched: Avoid going ahead if ->cpus_allowed is not changed
sched, rt: Update rq clock when unthrottling of an otherwise idle CPU
sched: Remove unused parameters from sched_fork() and wake_up_new_task()
sched: Shorten the construction of the span cpu mask of sched domain
sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG
sched: Remove unused 'this_best_prio arg' from balance_tasks()
sched: Remove noop in alloc_rt_sched_group()
sched: Get rid of lock_depth
sched: Remove obsolete comment from scheduler_tick()
sched: Fix sched_domain iterations vs. RCU
sched: Next buddy hint on sleep and preempt path
sched: Make set_*_buddy() work on non-task entities
sched: Remove need_migrate_task()
sched: Move the second half of ttwu() to the remote cpu
sched: Restructure ttwu() some more
sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
sched: Remove rq argument from ttwu_stat()
sched: Remove rq->lock from the first half of ttwu()
sched: Drop rq->lock from sched_exec()
...
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix rt_rq runtime leakage bug
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (107 commits)
perf stat: Add more cache-miss percentage printouts
perf stat: Add -d -d and -d -d -d options to show more CPU events
ftrace/kbuild: Add recordmcount files to force full build
ftrace: Add self-tests for multiple function trace users
ftrace: Modify ftrace_set_filter/notrace to take ops
ftrace: Allow dynamically allocated function tracers
ftrace: Implement separate user function filtering
ftrace: Free hash with call_rcu_sched()
ftrace: Have global_ops store the functions that are to be traced
ftrace: Add ops parameter to ftrace_startup/shutdown functions
ftrace: Add enabled_functions file
ftrace: Use counters to enable functions to trace
ftrace: Separate hash allocation and assignment
ftrace: Create a global_ops to hold the filter and notrace hashes
ftrace: Use hash instead for FTRACE_FL_FILTER
ftrace: Replace FTRACE_FL_NOTRACE flag with a hash of ignored functions
perf bench, x86: Add alternatives-asm.h wrapper
x86, 64-bit: Fix copy_[to/from]_user() checks for the userspace address limit
x86, mem: memset_64.S: Optimize memset by enhanced REP MOVSB/STOSB
x86, mem: memmove_64.S: Optimize memmove by enhanced REP MOVSB/STOSB
...
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, gart: Rename pci-gart_64.c to amd_gart_64.c
x86/amd-iommu: Use threaded interupt handler
arch/x86/kernel/pci-iommu_table.c: Convert sprintf_symbol to %pS
x86/amd-iommu: Add support for invalidate_all command
x86/amd-iommu: Add extended feature detection
x86/amd-iommu: Add ATS enable/disable code
x86/amd-iommu: Add flag to indicate IOTLB support
x86/amd-iommu: Flush device IOTLB if ATS is enabled
x86/amd-iommu: Select PCI_IOV with AMD IOMMU driver
PCI: Move ATS declarations in seperate header file
dma-debug: print information about leaked entry
x86/amd-iommu: Flush all internal TLBs when IOMMUs are enabled
x86/amd-iommu: Rename iommu_flush_device
x86/amd-iommu: Improve handling of full command buffer
x86/amd-iommu: Rename iommu_flush* to domain_flush*
x86/amd-iommu: Remove command buffer resetting logic
x86/amd-iommu: Cleanup completion-wait handling
x86/amd-iommu: Cleanup inv_pages command handling
x86/amd-iommu: Move inv-dte command building to own function
x86/amd-iommu: Move compl-wait command building to own function
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: (34 commits)
PM: Introduce generic prepare and complete callbacks for subsystems
PM: Allow drivers to allocate memory from .prepare() callbacks safely
PM: Remove CONFIG_PM_VERBOSE
Revert "PM / Hibernate: Reduce autotuned default image size"
PM / Hibernate: Add sysfs knob to control size of memory for drivers
PM / Wakeup: Remove useless synchronize_rcu() call
kmod: always provide usermodehelper_disable()
PM / ACPI: Remove acpi_sleep=s4_nonvs
PM / Wakeup: Fix build warning related to the "wakeup" sysfs file
PM: Print a warning if firmware is requested when tasks are frozen
PM / Runtime: Rework runtime PM handling during driver removal
Freezer: Use SMP barriers
PM / Suspend: Do not ignore error codes returned by suspend_enter()
PM: Fix build issue in clock_ops.c for CONFIG_PM_RUNTIME unset
PM: Revert "driver core: platform_bus: allow runtime override of dev_pm_ops"
OMAP1 / PM: Use generic clock manipulation routines for runtime PM
PM: Remove sysdev suspend, resume and shutdown operations
PM / PowerPC: Use struct syscore_ops instead of sysdevs for PM
PM / UNICORE32: Use struct syscore_ops instead of sysdevs for PM
PM / AVR32: Use struct syscore_ops instead of sysdevs for PM
...
It doesn't make sense to unconditionally unmask a disabled irq when
migrating it from offlined cpu to another. If the irq triggers then it
will be disabled in the interrupt handler anyway. So we can just avoid
unmasking it.
[ tglx: Made masking unconditional again and fixed the changelog ]
Signed-off-by: Fengzhe Zhang <fengzhe.zhang@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/%3C625BA99ED14B2D499DC4E29D8138F1505C8ED7F7E3%40shsmsx502.ccr.corp.intel.com%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
IRQF_PER_CPU means that the irq cannot be moved away from a given
cpu. So it must not be migrated when the cpu goes offline.
[ tglx: massaged changelog ]
Signed-off-by: Fengzhe Zhang <fengzhe.zhang@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/%3C625BA99ED14B2D499DC4E29D8138F1505C8ED7F7E2%40shsmsx502.ccr.corp.intel.com%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No need to recalculate the frequency and the conversion factors over
and over. Calculate the frequency once and use the new config/register
interface and let the core code do the math.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Link: http://lkml.kernel.org/r/%3C20110518210136.646482357%40linutronix.de%3E
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf evlist: Fix per thread mmap setup
perf tools: Honour the cpu list parameter when also monitoring a thread list
kprobes, x86: Disable irqs during optimized callback
Enable/disable newly documented SMEP (Supervisor Mode Execution Protection) CPU
feature in kernel. CR4.SMEP (bit 20) is 0 at power-on. If the feature is
supported by CPU (X86_FEATURE_SMEP), enable SMEP by setting CR4.SMEP. New kernel
option nosmep disables the feature even if the feature is supported by CPU.
[ hpa: moved the call to setup_smep() until after the vendor-specific
initialization; that ensures that CPUID features are unmasked. We
will still run it before we have userspace (never mind uncontrolled
userspace). ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
LKML-Reference: <1305157865-31727-1-git-send-email-fenghua.yu@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some string operation functions may be patched twice, e.g. on enhanced REP MOVSB
/STOSB processors, memcpy is patched first by fast string alternative function,
then it is patched by enhanced REP MOVSB/STOSB alternative function.
Add comment for applying alternatives order to warn people who may change the
applying alternatives order for any reason.
[ Documentation-only patch ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-4-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If kernel intends to use enhanced REP MOVSB/STOSB, it must ensure
IA32_MISC_ENABLE.Fast_String_Enable (bit 0) is set and CPUID.(EAX=07H, ECX=0H):
EBX[bit 9] also reports 1.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305671358-14478-3-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* syscore:
PM: Remove sysdev suspend, resume and shutdown operations
PM / PowerPC: Use struct syscore_ops instead of sysdevs for PM
PM / UNICORE32: Use struct syscore_ops instead of sysdevs for PM
PM / AVR32: Use struct syscore_ops instead of sysdevs for PM
PM / Blackfin: Use struct syscore_ops instead of sysdevs for PM
ARM / Samsung: Use struct syscore_ops for "core" power management
ARM / PXA: Use struct syscore_ops for "core" power management
ARM / SA1100: Use struct syscore_ops for "core" power management
ARM / Integrator: Use struct syscore_ops for core PM
ARM / OMAP: Use struct syscore_ops for "core" power management
ARM: Use struct syscore_ops instead of sysdevs for PM in common code
acpi_sleep=s4_nonvs is superseded by acpi_sleep=nonvs, so remove it.
Signed-off-by: WANG Cong <amwang@redhat.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
CPUID leaf 7, subleaf 0 returns the maximum subleaf in EAX, not the
number of subleaves. Since so far only subleaf 0 is defined (and only
the EBX bitfield) we do not need to qualify the test.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1305660806-17519-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org> 2.6.36..39
Trying to enable the local APIC timer on early K8 revisions
uncovers a number of other issues with it, in conjunction with
the C1E enter path on AMD. Fixing those causes much more churn
and troubles than the benefit of using that timer brings so
don't enable it on K8 at all, falling back to the original
functionality the kernel had wrt to that.
Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Cc: Boris Ostrovsky <Boris.Ostrovsky@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
Cc: Nick Bowler <nbowler@elliptictech.com>
Cc: Joerg-Volker-Peetz <jvpeetz@web.de>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1305636919-31165-3-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e20a2d205c, as it crashes
certain boxes with specific AMD CPU models.
Moving the lower endpoint of the Erratum 400 check to accomodate
earlier K8 revisions (A-E) opens a can of worms which is simply
not worth to fix properly by tweaking the errata checking
framework:
* missing IntPenging MSR on revisions < CG cause #GP:
http://marc.info/?l=linux-kernel&m=130541471818831
* makes earlier revisions use the LAPIC timer instead of the C1E
idle routine which switches to HPET, thus not waking up in
deeper C-states:
http://lkml.org/lkml/2011/4/24/20
Therefore, leave the original boundary starting with K8-revF.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steppings A1 and B0 of Celeron Covington are currently misdetected as
Pentium II (Dixon). Fix it by removing the stepping check.
[ hpa: this fixes this specific bug... the CPUID documentation
specifies that the L2 cache size can disambiguate additional CPUs;
this patch does not fix that. ]
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Link: http://lkml.kernel.org/r/201105162138.15416.linux@rainbow-software.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We provide two slots to disable cache indices, and have a check to
prevent both slots to be used for the same index.
If the user disables the same index on different subcaches, both slots
will hold the same index, e.g.
$ echo 2047 > /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_0
$ cat /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_0
2047
$ echo 1050623 > /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_1
$ cat /sys/devices/system/cpu/cpu0/cache/index3/cache_disable_1
2047
due to the fact that the check was looking only at index bits [11:0]
and was ignoring writes to bits outside that range. The more correct
fix is to simply check whether the index is within the bounds of
[0..l3->indices].
While at it, cleanup comments and drop now-unused local macros.
Signed-off-by: Frank Arnold <frank.arnold@amd.com>
Link: http://lkml.kernel.org/r/1305553188-21061-3-git-send-email-bp@amd64.org
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
732eacc054 converted code around the
kernel using nested max() macros to use the new max3 macro but forgot to
remove the old line in intel_cacheinfo.c. Fix it.
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Frank Arnold <farnold@amd64.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1305553188-21061-2-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch fixes a bug reported by a customer, who found
that many unreasonable error interrupts reported on all
non-boot CPUs (APs) during the system boot stage.
According to Chapter 10 of Intel Software Developer Manual
Volume 3A, Local APIC may signal an illegal vector error when
an LVT entry is set as an illegal vector value (0~15) under
FIXED delivery mode (bits 8-11 is 0), regardless of whether
the mask bit is set or an interrupt actually happen. These
errors are seen as error interrupts.
The initial value of thermal LVT entries on all APs always reads
0x10000 because APs are woken up by BSP issuing INIT-SIPI-SIPI
sequence to them and LVT registers are reset to 0s except for
the mask bits which are set to 1s when APs receive INIT IPI.
When the BIOS takes over the thermal throttling interrupt,
the LVT thermal deliver mode should be SMI and it is required
from the kernel to keep AP's LVT thermal monitoring register
programmed as such as well.
This issue happens when BIOS does not take over thermal throttling
interrupt, AP's LVT thermal monitor register will be restored to
0x10000 which means vector 0 and fixed deliver mode, so all APs will
signal illegal vector error interrupts.
This patch check if interrupt delivery mode is not fixed mode before
restoring AP's LVT thermal monitor register.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yong Wang <yong.y.wang@intel.com>
Cc: hpa@linux.intel.com
Cc: joe@perches.com
Cc: jbaron@redhat.com
Cc: trenn@suse.de
Cc: kent.liu@intel.com
Cc: chaohong.guo@intel.com
Cc: <stable@kernel.org> # As far back as possible
Link: http://lkml.kernel.org/r/1303402963-17738-1-git-send-email-youquan.song@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/ia64/kernel/cyclone.c
arch/mips/kernel/i8253.c
arch/x86/kernel/i8253.c
Reason: Resolve conflicts so further cleanups do not conflict further
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert x86 i8253 clocksource code to use generic i8253 clocksource.
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
On some arches (x86, sh, arm, unicore, powerpc) the oops message would
print out the last sysfs file accessed.
This was very useful in finding a number of sysfs and driver core bugs
in the 2.5 and early 2.6 development days, but it has been a number of
years since this file has actually helped in debugging anything that
couldn't also be trivially determined from the stack traceback.
So it's time to delete the line. This is good as we need all the space
we can get for oops messages at times on consoles.
Acked-by: Phil Carmody <ext-phil.2.carmody@nokia.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
b may be added to a list, but is not removed before being freed
in the case of an error. This is done in the corresponding
deallocation function, so the code here has been changed to
follow that.
The sematic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression E,E1,E2;
identifier l;
@@
*list_add(&E->l,E1);
... when != E1
when != list_del(&E->l)
when != list_del_init(&E->l)
when != E = E2
*kfree(E);// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/1305294731-12127-1-git-send-email-julia@diku.dk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a new x86_init hook called pagetable_reserve that at the end
of init_memory_mapping is used to reserve a range of memory addresses for
the kernel pagetable pages we used and free the other ones.
On native it just calls memblock_x86_reserve_range while on xen it also
takes care of setting the spare memory previously allocated
for kernel pagetable pages from RO to RW, so that it can be used for
other purposes.
A detailed explanation of the reason why this hook is needed follows.
As a consequence of the commit:
commit 4b239f458c
Author: Yinghai Lu <yinghai@kernel.org>
Date: Fri Dec 17 16:58:28 2010 -0800
x86-64, mm: Put early page table high
at some point init_memory_mapping is going to reach the pagetable pages
area and map those pages too (mapping them as normal memory that falls
in the range of addresses passed to init_memory_mapping as argument).
Some of those pages are already pagetable pages (they are in the range
pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
everything is fine.
Some of these pages are not pagetable pages yet (they fall in the range
pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
are going to be mapped RW. When these pages become pagetable pages and
are hooked into the pagetable, xen will find that the guest has already
a RW mapping of them somewhere and fail the operation.
The reason Xen requires pagetables to be RO is that the hypervisor needs
to verify that the pagetables are valid before using them. The validation
operations are called "pinning" (more details in arch/x86/xen/mmu.c).
In order to fix the issue we mark all the pages in the entire range
pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
is completed only the range pgt_buf_start-pgt_buf_end is reserved by
init_memory_mapping. Hence the kernel is going to crash as soon as one
of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
ranges are RO).
For this reason we need a hook to reserve the kernel pagetable pages we
used and free the other ones so that they can be reused for other
purposes.
On native it just means calling memblock_x86_reserve_range, on Xen it
also means marking RW the pagetable pages that we allocated before but
that haven't been used before.
Another way to fix this is without using the hook is by adding a 'if
(xen_pv_domain)' in the 'init_memory_mapping' code and calling the Xen
counterpart, but that is just nasty.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Both warning and warning_symbol are nowhere used.
Let's get rid of them.
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Soeren Sandmann Pedersen <ssp@redhat.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: x86 <x86@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Link: http://lkml.kernel.org/r/1305205872-10321-2-git-send-email-richard@nod.at
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Since suspend, resume and shutdown operations in struct sysdev_class
and struct sysdev_driver are not used any more, remove them. Also
drop sysdev_suspend(), sysdev_resume() and sysdev_shutdown() used
for executing those operations and modify all of their users
accordingly. This reduces kernel code size quite a bit and reduces
its complexity.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Disable irqs during optimized callback, so we dont miss any in-irq kprobes.
The following commands:
# cd /debug/tracing/
# echo "p mutex_unlock" >> kprobe_events
# echo "p _raw_spin_lock" >> kprobe_events
# echo "p smp_apic_timer_interrupt" >> ./kprobe_events
# echo 1 > events/enable
Cause the optimized kprobes to be missed. None is missed
with the fix applied.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20110511110613.GB2390@jolsa.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This workaround holds a dma32 buffer at early boot to prevent later
bootmem allocations from stealing it in the case of large RAM configs.
Now that x86 is using memblock, and the nobootmem wrapper does top-down
allocation, it's no longer necessary, so remove it.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This file only contains code relevant for the northbridge
gart in AMD processors. This patch renames the file to
represent this fact in the filename.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Move the interupt handling for the iommu into the interupt
thread to reduce latencies and prepare interupt handling for
pri handling.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This fixes problems seen on UV systems handling NMIs from the
node controller.
I isolated the "dazed..." messages that I saw earlier to a bug in
the BMC on our platform. It was sending NMIs w/o properly setting
a register that indicated the source of NMI.
So rather than _assuming_ any unhandled NMI came from the UV system
maintenance console (SMC), add a check to verify that the SMC actually
sent the NMI.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Cc: gorcunov@gmail.com
Cc: dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Intel Nehalem offcore bits implemented in:
e994d7d23a: perf: Fix LLC-* events on Intel Nehalem/Westmere
... are wrong: they implemented _ACCESS as _HIT and counted OTHER_CORE_HIT* as
MISS even though its clearly documented as an L3 hit ...
Fix them and the Westmere definitions as well.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1299119690-13991-3-git-send-email-ming.m.lin@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend the Intel SandyBridge PMU driver with definitions
for generic front-end and back-end stall events.
( As commit 3011203 "perf events, x86: Add Westmere stalled-cycles-frontend/backend
events" says, these are only approximations. )
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1304666042-17577-1-git-send-email-ming.m.lin@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a multiple message send syscall and is the send
version of the existing recvmmsg syscall. This is heavily
based on the patch by Arnaldo that added recvmmsg.
I wrote a microbenchmark to test the performance gains of using
this new syscall:
http://ozlabs.org/~anton/junkcode/sendmmsg_test.c
The test was run on a ppc64 box with a 10 Gbit network card. The
benchmark can send both UDP and RAW ethernet packets.
64B UDP
batch pkts/sec
1 804570
2 872800 (+ 8 %)
4 916556 (+14 %)
8 939712 (+17 %)
16 952688 (+18 %)
32 956448 (+19 %)
64 964800 (+20 %)
64B raw socket
batch pkts/sec
1 1201449
2 1350028 (+12 %)
4 1461416 (+22 %)
8 1513080 (+26 %)
16 1541216 (+28 %)
32 1553440 (+29 %)
64 1557888 (+30 %)
We see a 20% improvement in throughput on UDP send and 30%
on raw socket send.
[ Add sparc syscall entries. -DaveM ]
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With dynamic debug having gained the capability to report debug messages
also during the boot process, it offers a far superior interface for
debug messages than the custom cpufreq infrastructure. As a first step,
remove the old cpufreq_debug_printk() function and replace it with a call
to the generic pr_debug() function.
How can dynamic debug be used on cpufreq? You need a kernel which has
CONFIG_DYNAMIC_DEBUG enabled.
To enabled debugging during runtime, mount debugfs and
$ echo -n 'module cpufreq +p' > /sys/kernel/debug/dynamic_debug/control
for debugging the complete "cpufreq" module. To achieve the same goal during
boot, append
ddebug_query="module cpufreq +p"
as a boot parameter to the kernel of your choice.
For more detailled instructions, please see
Documentation/dynamic-debug-howto.txt
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Dave Jones <davej@redhat.com>
UUID needs to be written out the way it is described in
Sec 18.5.124 of ACPI 4.0a Specification.
Platform firmware's use of this UUID/_OSC is optional, which is
why we didn't notice this bug earlier.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Cc: stable@kernel.org
The use of base for %ebx in this file is arbitrary, *except* that we
also use it to compute the real-mode segment. Therefore, make it so
that r_base really is the true address to which %ebx points.
This resolves kernel bugzilla 33302.
Reported-and-tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-08os5wi3yq1no0y4i5m4z7he@git.kernel.org
Update numaq such that it calls numa_add_memblk() and sets
numa_nodes_parsed instead of directly diddling with NUMA states. The
original get_memcfg_numaq() is renamed to numaq_numa_init() and new
get_memcfg_numaq() is created in numa_32.c.
The shim numa_add_memblk() implementation handles node_start/end_pfn[]
and node_set_online() for nodes with memory. The new
get_memcfg_numaq() exactly the same with get_memcfg_from_srat() other
than calling the numaq init function. Things get_memcfgs_numaq() do
are not strictly necessary for numaq but added for consistency and to
help unifying NUMA init handling.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Instead of calling memory_present() for each region from NUMA init,
call sparse_memory_present_with_active_regions() from paging_init()
similarly to x86-64.
For flat and numaq, this results in exactly the same memory_present()
calls. For srat, if there are multiple memory chunks for a node,
after this change, memory_present() will be called separately for each
chunk instead of being called once to encompass the whole range, which
doesn't cause any harm and actually is the better behavior.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
NUMAQ is the only meaningful user of this callback and
setup_local_APIC() the only callsite. Stop torturing everyone else by
making the callback optional and removing all the boilerplate
implementations and assignments.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Some x86-32 NUMA implementations (NUMAQ) don't initialize apicid ->
node mapping using set_apicid_to_node() during NUMA init but implement
custom apic->x86_32_numa_cpu_node() instead.
This patch automatically initializes the default apic -> node mapping
table from apic->x86_32_numa_cpu_node() from setup_local_APIC() such
that the mapping table is in sync with the actual mapping.
As the table isn't used by custom implementations, this doesn't make
any difference at this point. This is in preparation of unifying
numa_cpu_node() between x86-32 and 64.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Merge reason: Pick up the following two fix commits.
2be19102b7: x86, NUMA: Fix empty memblk detection in numa_cleanup_meminfo()
765af22da8: x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
Scheduled NUMA init 32/64bit unification changes depend on these.
Signed-off-by: Tejun Heo <tj@kernel.org>
Older AMD K8 processors (Revisions A-E) are affected by erratum
400 (APIC timer interrupts don't occur in C states greater than
C1). This, for example, means that X86_FEATURE_ARAT flag should
not be set for these parts.
This addresses regression introduced by commit
b87cf80af3 ("x86, AMD: Set ARAT
feature on AMD processors") where the system may become
unresponsive until external interrupt (such as keyboard input)
occurs. This results, for example, in time not being reported
correctly, lack of progress on the system and other lockups.
Reported-by: Joerg-Volker Peetz <jvpeetz@web.de>
Tested-by: Joerg-Volker Peetz <jvpeetz@web.de>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Boris Ostrovsky <Boris.Ostrovsky@amd.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1304113663-6586-1-git-send-email-ostr@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86, nmi: Move LVT un-masking into irq handlers
perf events, x86: Work around the Nehalem AAJ80 erratum
perf, x86: Fix BTS condition
ftrace: Build without frame pointers on Microblaze
Extend the Intel Westmere PMU driver with definitions for generic front-end and
back-end stall events.
( These are only approximations. )
Reported-by: David Ahern <dsahern@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n008io7hjpn1dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Extend the Intel and AMD event definitions with generic front-end and
back-end stall events.
( These are only approximations - suggestions are welcome for better events. )
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n001io7hjpn1dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add two generic hardware events: front-end and back-end stalled cycles.
These events measure conditions when the CPU is executing code but its
capabilities are not fully utilized. Understanding such situations and
analyzing them is an important sub-task of code optimization workflows.
Both events limit performance: most front end stalls tend to be caused
by branch misprediction or instruction fetch cachemisses, backend
stalls can be caused by various resource shortages or inefficient
instruction scheduling.
Front-end stalls are the more important ones: code cannot run fast
if the instruction stream is not being kept up.
An over-utilized back-end can cause front-end stalls and thus
has to be kept an eye on as well.
The exact composition is very program logic and instruction mix
dependent.
We use the terms 'stall', 'front-end' and 'back-end' loosely and
try to use the best available events from specific CPUs that
approximate these concepts.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n000io7hjpn1dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Normally sys_rt_sigreturn() restores the old current->blocked which was
changed by handle_signal(), and unblocking is always fine.
But the debugger or application itself can change frame->uc_sigmask and
thus we need set_current_blocked()->retarget_shared_pending().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
This is ugly, but if sigprocmask() needs retarget_shared_pending() then
handle signal should follow this logic. In theory it is newer correct to
add the new signals to current->blocked, the signal handler can sleep/etc
so we should notify other threads in case we block the pending signal and
nobody else has TIF_SIGPENDING.
Of course, this change doesn't make signals faster :/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Matt Fleming <matt.fleming@linux.intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
We use io_apic_setup_irq_pin() in order to configure pin's interrupt
number polarity and type. This is done on every irq_create_of_mapping()
which happens for instance during pci enable calls. Level typed
interrupts are masked by default, edge are unmasked.
On the first ->xlate() call the level interrupt is configured and
masked. The driver calls request_irq() and the line is unmasked. Lets
assume the interrupt line is shared with another device and we call
pci_enable_device() for this device. The ->xlate() configures the pin
again and it is masked. request_irq() does not unmask the line because
it _is_ already unmasked according to its internal state. So the
interrupt will never be unmasked again.
This patch is based on an earlier work by Torben Hohn and solves the
problem by configuring the pin only once. Since all devices must agree
on the same type and polarity there is no point in configuring the pin
more than once.
[ tglx: Split out the ce4100 part into a separate patch ]
Cc: Torben Hohn <torbenh@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/%3C20110427143052.GA15211%40linutronix.de%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the UOPS_EXECUTED.*,c=1,i=1 event on Intel CPUs - it is a rather
good indicator of CPU execution stalls, more sensitive and more inclusive
than the 0xa2 resource stalls event (which does not count nearly as many
stall types).
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-7y40wib8n1eqio7hjpn2dsrm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It was noticed that P4 machines were generating double NMIs for
each perf event. These extra NMIs lead to 'Dazed and confused'
messages on the screen.
I tracked this down to a P4 quirk that said the overflow bit had
to be cleared before re-enabling the apic LVT mask. My first
attempt was to move the un-masking inside the perf nmi handler
from before the chipset NMI handler to after.
This broke Nehalem boxes that seem to like the unmasking before
the counters themselves are re-enabled.
In order to keep this change simple for 2.6.39, I decided to
just simply move the apic LVT un-masking to the beginning of all
the chipset NMI handlers, with the exception of Pentium4's to
fix the double NMI issue.
Later on we can move the un-masking to later in the handlers to
save a number of 'extra' NMIs on those particular chipsets.
I tested this change on a P4 machine, an AMD machine, a Nehalem
box, and a core2quad box. 'perf top' worked correctly along
with various other small 'perf record' runs. Anything high
stress breaks all the machines but that is a different problem.
Thanks to various people for testing different versions of this
patch.
Reported-and-tested-by: Shaun Ruffell <sruffell@digium.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Link: http://lkml.kernel.org/r/1303900353-10242-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
CC: Cyrill Gorcunov <gorcunov@gmail.com>
Conflicts:
include/linux/perf_event.h
Merge reason: pick up the latest jump-label enhancements, they are cooked ready.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The new PERF_COUNT_HW_STALLED_CYCLES event tries to approximate
cycles the CPU does nothing useful, because it is stalled on a
cache-miss or some other condition.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-fue11vymwqsoo5to72jxxjyl@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Nehalem CPUs the retired branch-misses event can be completely bogus,
when there are no branch-misses occuring. When there are a lot of branch
misses then the count is pretty accurate. Still, this leaves us with an
event that over-counts a lot.
Detect this erratum and work it around by using BR_MISP_EXEC.ANY events.
These will also count speculated branches but still it's a lot more
precise in practice than the architectural event.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-yyfg0bxo9jsqxd6a0ovfny27@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the x86 backend incorrectly assumes that any BRANCH_INSN
with sample_period==1 is a BTS request. This is not true when we do
frequency driven profiling such as 'perf record -e branches'.
Solves this error:
$ perf record -e branches ./array
Error: sys_perf_event_open() syscall returned with 95 (Operation not supported).
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: Ingo Molnar <mingo@elte.hu>
Cc: "Metzger, Markus T" <markus.t.metzger@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-rd2y4ct71hjawzz6fpvsy9hg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While the tracer accesses ptrace breakpoints, the child task may
concurrently exit due to a SIGKILL and thus release its breakpoints
at the same time. We can then dereference some freed pointers.
To fix this, hold a reference on the child breakpoints before
manipulating them.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: v2.6.33.. <stable@kernel.org>
Link: http://lkml.kernel.org/r/1302284067-7860-3-git-send-email-fweisbec@gmail.com
It's not enough to simply disable event on overflow the
cpuc->active_mask should be cleared as well otherwise counter
may stall in "active" even in real being already disabled (which
potentially may lead to the situation that user may not use this
counter further).
Don pointed out that:
" I also noticed this patch fixed some unknown NMIs
on a P4 when I stressed the box".
Tested-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Link: http://lkml.kernel.org/r/1303398203-2918-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andi Kleen pointed out that the Intel offcore support patches were merged
without user-space tool support to the functionality:
|
| The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
| user space bits were not. This made it impossible to set the extra mask
| and actually do the OFFCORE profiling
|
Andi submitted a preliminary patch for user-space support, as an
extension to perf's raw event syntax:
|
| Some raw events -- like the Intel OFFCORE events -- support additional
| parameters. These can be appended after a ':'.
|
| For example on a multi socket Intel Nehalem:
|
| perf stat -e r1b7:20ff -a sleep 1
|
| Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
| that measures any access to DRAM on another socket.
|
But this kind of usability is absolutely unacceptable - users should not
be expected to type in magic, CPU and model specific incantations to get
access to useful hardware functionality.
The proper solution is to expose useful offcore functionality via
generalized events - that way users do not have to care which specific
CPU model they are using, they can use the conceptual event and not some
model specific quirky hexa number.
We already have such generalization in place for CPU cache events,
and it's all very extensible.
"Offcore" events measure general DRAM access patters along various
parameters. They are particularly useful in NUMA systems.
We want to support them via generalized DRAM events: either as the
fourth level of cache (after the last-level cache), or as a separate
generalization category.
That way user-space support would be very obvious, memory access
profiling could be done via self-explanatory commands like:
perf record -e dram ./myapp
perf record -e dram-remote ./myapp
... to measure DRAM accesses or more expensive cross-node NUMA DRAM
accesses.
These generalized events would work on all CPUs and architectures that
have comparable PMU features.
( Note, these are just examples: actual implementation could have more
sophistication and more parameter - as long as they center around
similarly simple usecases. )
Now we do not want to revert *all* of the current offcore bits, as they
are still somewhat useful for generic last-level-cache events, implemented
in this commit:
e994d7d23a: perf: Fix LLC-* events on Intel Nehalem/Westmere
But we definitely do not yet want to expose the unstructured raw events
to user-space, until better generalization and usability is implemented
for these hardware event features.
( Note: after generalization has been implemented raw offcore events can be
supported as well: there can always be an odd event that is marginally
useful but not useful enough to generalize. DRAM profiling is definitely
*not* such a category so generalization must be done first. )
Furthermore, PERF_TYPE_RAW access to these registers was not intended
to go upstream without proper support - it was a side-effect of the above
e994d7d23a commit, not mentioned in the changelog.
As v2.6.39 is nearing release we go for the simplest approach: disable
the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
kernel and becomes an ABI.
Once proper structure is implemented for these hardware events and users
are offered usable solutions we can revisit this issue.
Reported-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1302658203-4239-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding
hint only in the uncorrectable errors case and when no notifier
is registered. Also, limit issuing the "mcelog --ascii" message
in the rare case when we dump unreported CEs before panicking.
While at it, remove unused old x86_mce_decode_callback from the
header.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nagananda Chumbalkar <Nagananda.Chumbalkar@hp.com>
Cc: Russ Anderson <rja@sgi.com>
Link: http://lkml.kernel.org/r/20110420102349.GB1361@aftab
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andreas Herrmann reported that 7d6b46707f ("x86, NUMA: Fix fakenuma
boot failure") causes certain physical NUMA topologies (for example
AMD Magny-Cours) to move sibling cpus to a single node when in reality
they are in separate domains.
This may result in some nodes being completely void of cpus, which
doesn't accurately represent the correct topology. The system will
boot, but will have suboptimal NUMA performance.
This commit was intended as a fix for NUMA emulation, but should
not cause a regression for real NUMA machines as a side effect.
( There will be a separate fix for the numa-debug code, which
will not affect physical topologies. )
Reported-by: Andreas Herrmann <herrmann.der.user@googlemail.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1104201918110.12634@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Device suspend/resume infrastructure is used not only by the suspend
and hibernate code in kernel/power, but also by APM, Xen and the
kexec jump feature. However, commit 40dc166cb5
(PM / Core: Introduce struct syscore_ops for core subsystems PM)
failed to add syscore_suspend() and syscore_resume() calls to that
code, which generally leads to breakage when the features in question
are used.
To fix this problem, add the missing syscore_suspend() and
syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c
and drivers/xen/manage.c.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, gart: Make sure GART does not map physmem above 1TB
x86, gart: Set DISTLBWALKPRB bit always
x86, gart: Convert spaces to tabs in enable_gart_translation
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Fix AMD family 15h FPU event constraints
perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus
perf evsel: Fix use of inherit
perf hists browser: Fix seg fault when annotate null symbol
Correctable errors are considered something rather normal on
modern hardware these days. Even more importantly, correctable
errors mean exactly that - they've been corrected by the
hardware - and there's no need to taint the kernel since
execution hasn't been compromised so far.
Also, drop tainting in the thermal throttling code for a similar
reason: crossing a thermal threshold does not mean corruption.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Acked-by: Nagananda Chumbalkar <Nagananda.Chumbalkar@hp.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1303135222-17118-1-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
End users worry about the error interrupt printout we generate
currently:
pr_debug("APIC error on CPU%d: %02x(%02x)\n",
smp_processor_id(), v , v1);
... and would like to know the reason why error interrupts are generated.
This patch prints out more detailed debug information.
Another practical problem is that dynamic debug is not initialized yet
when the APIC initializes, so the pr_debug() will not output the error
interrupt debug information on bootup. In this patch, we use
apic_printk(APIC_DEBUG, ...), so the apic=debug boot option will print
verbose error interupts during bootup.
Signed-off-by: Youquan Song <youquan.song@intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: hpa@linux.intel.com
Cc: suresh.b.siddha@intel.com
Cc: yong.y.wang@linux.intel.com
Cc: jbaron@redhat.com
Cc: trenn@suse.de
Cc: kent.liu@intel.com
Cc: chaohong.guo@intel.com
Link: http://lkml.kernel.org/r/1302762968-24380-2-git-send-email-youquan.song@intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
an extra pointer chase and data cache hit.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302913676-14352-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Depending on the unit mask settings some FPU events may be scheduled
only on cpu counter #3. This patch fixes this.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@googlemail.com>
Link: http://lkml.kernel.org/r/1302913676-14352-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With AMD cpu family 15h a unit mask was introduced for the Data Cache
Miss event (0x041/L1-dcache-load-misses). We need to enable bit 0
(first data cache miss or streaming store to a 64 B cache line) of
this mask to proper count data cache misses.
Now we set this bit for all families and models. In case a PMU does
not implement a unit mask for event 0x041 the bit is ignored.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302913676-14352-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Due to a decoder implementation quirk, some specific Intel CPUs
actually perform better with the "k8_nops" than with the
SDM-recommended NOPs. For runtime-selected NOPs, if we detect those
specific CPUs then use the k8_nops instead of the ones we would
normally use.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1303166160-10315-4-git-send-email-hpa@linux.intel.com
Clean up and unify the NOP selection infrastructure:
- Make the atomic 5-byte NOP a part of the selection system.
- Pick NOPs once during early boot and then be done with it.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1303166160-10315-3-git-send-email-hpa@linux.intel.com
This patch changes the allocation of the GART aperture to
enforce only natural alignment instead of aligning it on
512MB. This big alignment was used to force the GART
aperture to be over 512MB. This is enforced by using 512MB
as the lower-bound address in the allocation range.
[ hpa: The actual number 512 MiB needs to be revisited, too. ]
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1303134346-5805-2-git-send-email-joerg.roedel@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The GART can only map physical memory below 1TB. Make sure
the gart driver in the kernel does not try to map memory
above 1TB.
Cc: <stable@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1303134346-5805-5-git-send-email-joerg.roedel@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The DISTLBWALKPRB bit must be set for the GART because the
gatt table is mapped UC. But the current code does not set
the bit at boot when the BIOS setup the aperture correctly.
Fix that by setting this bit when enabling the GART instead
of the other places.
Cc: <stable@kernel.org>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/1303134346-5805-4-git-send-email-joerg.roedel@amd.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
the BIOS forgets to do is (or is just too old). Letting
these errors enabled can cause a sync-flood on the CPU
causing a reboot.
The AMD BKDG recommends disabling GART TLB Wlk Error completely.
This patch is the fix for
https://bugzilla.kernel.org/show_bug.cgi?id=33012
on my machine.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Link: http://lkml.kernel.org/r/20110415131152.GJ18463@8bytes.org
Tested-by: Alexandre Demers <alexandre.f.demers@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently, numa=fake boot parameter is broken. If it's used,
kernel may panic due to devide by zero error depending on CPU
configuration
Call Trace:
[<ffffffff8104ad4c>] find_busiest_group+0x38c/0xd30
[<ffffffff81086aff>] ? local_clock+0x6f/0x80
[<ffffffff81050533>] load_balance+0xa3/0x600
[<ffffffff81050f53>] idle_balance+0xf3/0x180
[<ffffffff81550092>] schedule+0x722/0x7d0
[<ffffffff81550538>] ? wait_for_common+0x128/0x190
[<ffffffff81550a65>] schedule_timeout+0x265/0x320
[<ffffffff81095815>] ? lock_release_holdtime+0x35/0x1a0
[<ffffffff81550538>] ? wait_for_common+0x128/0x190
[<ffffffff8109bb6c>] ? __lock_release+0x9c/0x1d0
[<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff815534e0>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff81550540>] wait_for_common+0x130/0x190
[<ffffffff81051920>] ? try_to_wake_up+0x510/0x510
[<ffffffff8155067d>] wait_for_completion+0x1d/0x20
[<ffffffff8107f36c>] kthread_create_on_node+0xac/0x150
[<ffffffff81077bb0>] ? process_scheduled_works+0x40/0x40
[<ffffffff8155045f>] ? wait_for_common+0x4f/0x190
[<ffffffff8107a283>] __alloc_workqueue_key+0x1a3/0x590
[<ffffffff81e0cce2>] cpuset_init_smp+0x6b/0x7b
[<ffffffff81df3d07>] kernel_init+0xc3/0x182
[<ffffffff8155d5e4>] kernel_thread_helper+0x4/0x10
[<ffffffff81553cd4>] ? retint_restore_args+0x13/0x13
[<ffffffff81df3c44>] ? start_kernel+0x400/0x400
[<ffffffff8155d5e0>] ? gs_change+0x13/0x13
The divede by zero is caused by the following line,
group->cpu_power==0:
kernel/sched_fair.c::update_sg_lb_stats()
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
This regression was caused by commit e23bba6044 ("x86-64, NUMA: Unify
emulated distance mapping") because it changes cpu -> node
mapping in the process of dropping fake_physnodes().
old) all cpus are assinged node 0
now) cpus are assigned round robin
(the logic is implemented by numa_init_array())
Note: The change in behavior only happens if the system doesn't
have neither ACPI SRAT table nor AMD northbridge NUMA
information.
Round robin assignment doesn't work because init_numa_sched_groups_power()
assumes all logical cpus in the same physical cpu share the same node
(then it only accounts for group_first_cpu()), and the simple round robin
breaks the above assumption.
Thus, this patch implements a reassignment of node-ids if buggy firmware
or numa emulation makes wrong cpu node map. Tt enforce all logical cpus
in the same physical cpu share the same node.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shaohui Zheng <shaohui.zheng@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20110415203928.1303.A69D9226@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For future rework of try_to_wake_up() we'd like to push part of that
function onto the CPU the task is actually going to run on.
In order to do so we need a generic callback from the existing scheduler IPI.
This patch introduces such a generic callback: scheduler_ipi() and
implements it as a NOP.
BenH notes: PowerPC might use this IPI on offline CPUs under rare conditions!
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: Frank Rowand <frank.rowand@am.sony.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110405152728.744338123@chello.nl
This patch adds detection of the extended features of an
AMD IOMMU. The available features are printed to dmesg on
boot.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch adds the necessary code to the AMD IOMMU driver
for enabling and disabling the ATS capability on a device
and to setup the IOMMU data structures correctly.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch adds a flag to the AMD IOMMU driver to indicate
that all IOMMUs present in the system support device IOTLBs.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch implements a function to flush the IOTLB on
devices supporting ATS and makes sure that this TLB is also
flushed if necessary.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Currently the option resides under X86_EXTENDED_PLATFORM due to historical
nonstandard A20M# handling. However that is no longer the case and so Elan can
be treated as part of the standard processor choice Kconfig option.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Link: http://lkml.kernel.org/r/1302245177.31620.47.camel@localhost.localdomain
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, fpu: Fix FPU exception handling on non-SSE systems
x86, hibernate: Initialize mmu_cr4_features during boot
x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
x86: visws: Fixup irq overhaul fallout
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Clean up rebalance_domains() load-balance interval calculation
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix cpumask leak in __setup_irq()
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf probe: Fix listing incorrect line number with inline function
perf probe: Fix to find recursively inlined function
perf probe: Fix multiple --vars options behavior
perf probe: Fix to remove redundant close
perf probe: Fix to ensure function declared file
The old code only flushed a DTE or a domain TLB before it is
actually used by the IOMMU driver. While this is efficient
and works when done right it is more likely to introduce new
bugs when changing code (which happened in the past).
This patch adds code to flush all DTEs and all domain TLBs
in each IOMMU right after it is enabled (at boot and after
resume). This reduces the complexity of the driver and makes
it less likely to introduce stale-TLB bugs in the future.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This function operates on a struct device, so give it a name
that represents that. As a side effect a new function is
introduced which operates on am iommu and a device-id. It
will be used again in a later patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch improved the handling of commands when the IOMMU
command buffer is nearly full. In this case it issues an
completion wait command and waits until the IOMMU has
processed it before continuing queuing new commands.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
These functions all operate on protection domains and not on
singe IOMMUs. Represent that in their name.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The logic to reset the command buffer caused more problems
than it actually helped. The logic jumped in when the IOMMU
hardware doesn't execute commands anymore but the reasons
for this are usually not fixed by just resetting the command
buffer. So the code can be removed to reduce complexity.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch cleans up the implementation of completion-wait
command sending. It also switches the completion indicator
from the MMIO bit to a memory store which can be checked
without IOMMU locking.
As a side effect this patch makes the __iommu_queue_command
function obsolete and so it is removed too.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Only pgdat and memmap use remap area and there isn't much benefit in
allowing per-node override. In addition, the use of node_remap_size[]
is confusing in that it contains number of bytes before remap
initialization and then number of pages afterwards.
Move remap size calculation for memap from specific NUMA config
implementations to init_alloc_remap() and make node_remap_size[]
static.
The only behavior difference is that, before this patch, numaq_32
didn't consider max_pfn when calculating the memmap size but it's
enforced after this patch, which is the right thing to do.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/1301955840-7246-8-git-send-email-tj@kernel.org
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Restore the initialization of mmu_cr4_features during boot, which was
removed without comment in checkin e5f15b45dd
x86: Cleanup highmap after brk is concluded
thereby breaking resume from hibernate. This restores previous
functionality in approximately the same place, and corrects the
reading of %cr4 on pre-CPUID hardware (%cr4 exists if and only if
CPUID is supported.)
However, part of the problem is that the hibernate suspend/resume
sequence should manage the save/restore of %cr4 explicitly.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <201104020154.57136.rjw@sisk.pl>
This patch reworks the processing of invalidate-pages
commands to the IOMMU. The function building the the command
is extended so we can get rid of another function. It was
also renamed to match with the other function names.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
We have a never ending stream of 'reboot quirks' for new boxes
that will not reboot properly under Linux (they will hang on
reboot).
The reason is widespread 'Windows compatible' assumption of modern
x86 hardware, which expects the following reboot sequence:
- hitting the ACPI reboot vector (if available)
- trying the keyboard controller
- hitting the ACPI reboot vector again
- then giving the keyboard controller one last go
This sequence expectation gets more and more embedded in modern
hardware, which often lacks a keyboard controller and may even
lock up if the legacy io ports are hit - and which hardware is
often not tested with Linux during development.
The end result is that reboot works under Windows-alike OSs but not
under Linux.
Rework our reboot process to meet this hardware externality a little
better and match this assumption of newer x86 hardware.
In addition to the ACPI,kbd,ACPI,kbd sequence we'll still fall
through to attempting a legacy triple fault if nothing else
works - and keep trying that and the kbd reset.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
[ this commit will also save special casing Oaktrail boards ]
Acked-by: Alan Cox <alan@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Leann Ogasawara <leann.ogasawara@canonical.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Len Brown <len.brown@intel.com>
LKML-Reference: <1301939705-2404-1-git-send-email-mjg@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce:
static __always_inline bool static_branch(struct jump_label_key *key);
instead of the old JUMP_LABEL(key, label) macro.
In this way, jump labels become really easy to use:
Define:
struct jump_label_key jump_key;
Can be used as:
if (static_branch(&jump_key))
do unlikely code
enable/disale via:
jump_label_inc(&jump_key);
jump_label_dec(&jump_key);
that's it!
For the jump labels disabled case, the static_branch() becomes an
atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
atomic_dec() operations. We show testing results for this change below.
Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.
Since we now require a 'struct jump_label_key *key', we can store a pointer into
the jump table addresses. In this way, we can enable/disable jump labels, in
basically constant time. This change allows us to completely remove the previous
hashtable scheme. Thanks to Peter Zijlstra for this re-write.
Testing:
I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
configurations, where tracepoints were disabled.
jump label configured in
avg: 815.6
jump label *not* configured in (using atomic reads)
avg: 800.1
jump label *not* configured in (regular reads)
avg: 803.4
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110316212947.GA8792@redhat.com>
Signed-off-by: Jason Baron <jbaron@redhat.com>
Suggested-by: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: David Daney <ddaney@caviumnetworks.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, UV: Fix kdump reboot
x86, amd-nb: Rename CPU PCI id define for F4
sound: Add delay.h to sound/soc/codecs/sn95031.c
x86, mtrr, pat: Fix one cpu getting out of sync during resume
x86, microcode: Unregister syscore_ops after microcode unloaded
x86: Stop including <linux/delay.h> in two asm header files
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: create new rcu_access_index() and use in mce
WARN_ON_SMP(): Add comment to explain ({0;})
check_slot() is only called from replace_intsrc_all() - which is
in the .init section.
So, put check_slot into the .init section as well, so it can be freed
after system boot.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
LKML-Reference: <AANLkTing52ntzRcHkODCWDKOfRF=0uhXw5-cCUhx6M54@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The MCE subsystem needs to sample an RCU-protected index outside of
any protection for that index. If this was a pointer, we would use
rcu_access_pointer(), but there is no corresponding rcu_access_index().
This commit therefore creates an rcu_access_index() and applies it
to MCE.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Zdenek Kabelac <zkabelac@redhat.com>
After a crash dump on an SGI Altix UV system the crash kernel
fails to cause a reboot. EFI mode is disabled in the kdump
kernel, so only the reboot_type of BOOT_ACPI works.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: rja@sgi.com
LKML-Reference: <E1Q5Iuo-00013b-UK@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With increasing number of PCI function ids, add the PCI function
id in the define name instead of its symbolic name in the BKDG
for more clarity. This renames function 4 define.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
LKML-Reference: <20110330183447.GA3668@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On laptops with core i5/i7, there were reports that after resume
graphics workloads were performing poorly on a specific AP, while
the other cpu's were ok. This was observed on a 32bit kernel
specifically.
Debug showed that the PAT init was not happening on that AP
during resume and hence it contributing to the poor workload
performance on that cpu.
On this system, resume flow looked like this:
1. BP starts the resume sequence and we reinit BP's MTRR's/PAT
early on using mtrr_bp_restore()
2. Resume sequence brings all AP's online
3. Resume sequence now kicks off the MTRR reinit on all the AP's.
4. For some reason, between point 2 and 3, we moved from BP
to one of the AP's. My guess is that printk() during resume
sequence is contributing to this. We don't see similar
behavior with the 64bit kernel but there is no guarantee that
at this point the remaining resume sequence (after AP's bringup)
has to happen on BP.
5. set_mtrr() was assuming that we are still on BP and skipped the
MTRR/PAT init on that cpu (because of 1 above)
6. But we were on an AP and this led to not reprogramming PAT
on this cpu leading to bad performance.
Fix this by doing unconditional mtrr_if->set_all() in set_mtrr()
during MTRR/PAT init. This might be unnecessary if we are still
running on BP. But it is of no harm and will guarantee that after
resume, all the cpu's will be in sync with respect to the
MTRR/PAT registers.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <1301438292-28370-1-git-send-email-eric@anholt.net>
Signed-off-by: Eric Anholt <eric@anholt.net>
Tested-by: Keith Packard <keithp@keithp.com>
Cc: stable@kernel.org [v2.6.32+]
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The lonely user of the internal interface was not in the coccinelle
script.
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently, microcode doesn't unregister syscore_ops after it's
unloaded. So if we modprobe then rmmod microcode, the stale
microcode syscore_ops info will stay on syscore_ops_list.
Later when we're trying to reboot/halt/shutdown the machine, kernel
will panic on syscore_shutdown().
With the patch applied, I can reboot/halt/shutdown my machine successfully.
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <1301387672-23661-1-git-send-email-dfeng@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is more effective to use a segment prefix instead of calculating the
address of the current cpu area amd then testing flags.
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add this_cpu_has() which determines if the current cpu has a certain
ability using a segment prefix and a bit test operation.
For that we need to add bit operations to x86s percpu.h.
Many uses of cpu_has use a pointer passed to a function to determine
the current flags. That is no longer necessary after this patch.
However, this patch only converts the straightforward cases where
cpu_has is used with this_cpu_ptr. The rest is work for later.
-tj: Rolled up patch to add x86_ prefix and use percpu_read() instead
of percpu_read_stable().
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Stop including <linux/delay.h> in x86 header files which don't
need it. This will let the compiler complain when this header is
not included by source files when it should, so that
contributors can fix the problem before building on other
architectures starts to fail.
Credits go to Geert for the idea.
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: James E.J. Bottomley <James.Bottomley@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
LKML-Reference: <20110325152014.297890ec@endymion.delvare>
[ this also fixes an upstream build bug in drivers/media/rc/ite-cir.c ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
No change on the functional level, just align the table properly.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Lin Ming <ming.m.lin@intel.com>
LKML-Reference: <4D8FA213.5050108@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
Introduce ARCH_NO_SYSDEV_OPS config option (v2)
cpufreq: Use syscore_ops for boot CPU suspend/resume (v2)
KVM: Use syscore_ops instead of sysdev class and sysdev
PCI / Intel IOMMU: Use syscore_ops instead of sysdev class and sysdev
timekeeping: Use syscore_ops instead of sysdev class and sysdev
x86: Use syscore_ops instead of sysdev classes and sysdevs
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
kdb: add usage string of 'per_cpu' command
kgdb,x86_64: fix compile warning found with sparse
kdb: code cleanup to use macro instead of value
kgdboc,kgdbts: strlen() doesn't count the terminator
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Complain louder about BIOSen corrupting CPU/PMU state and continue
perf, x86: P4 PMU - Read proper MSR register to catch unflagged overflows
perf symbols: Look at .dynsym again if .symtab not found
perf build-id: Add quirk to deal with perf.data file format breakage
perf session: Pass evsel in event_ops->sample()
perf: Better fit max unprivileged mlock pages for tools needs
perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()
perf top: Fix uninitialized 'counter' variable
tracing: Fix set_ftrace_filter probe function display
perf, x86: Fix Intel fixed counters base initialization
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: Fix WARN_ON() test for UP
WARN_ON_SMP(): Allow use in if() statements on UP
x86, dumpstack: Use %pB format specifier for stack trace
vsprintf: Introduce %pB format specifier
lockdep: Remove unused 'factor' variable from lockdep_stats_show()
Fix sparse warning:
arch/x86/kernel/kgdb.c:123:9: warning: switch with no cases
Reported-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Eric Dumazet reported that hardware PMU events do not work on his
system, due to the BIOS corrupting PMU state:
Performance Events: PEBS fmt0+, Core2 events, Broken BIOS detected, using software events only.
[Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 186 is 43003c)
Linus suggested that we continue in the face of such BIOS-induced CPU
state corruption:
http://lkml.org/lkml/2011/3/24/608
Such BIOSes will have to be fixed - Linux developers rely on a working and
fully capable PMU and the BIOS interfering with the CPU's PMU state is simply
not acceptable.
So this patch changes perf to continue when it detects such BIOS
interaction, some hardware events may be unreliable due to the BIOS
writing and re-writing them - there's not much the kernel can do
about that but to detect the corruption and report it.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
That call escaped the name space cleanup. Fix it up.
We really want to call there. The chip might have changed since the
irq was setup initially. So let the core code and the chip decide what
to do. The status is just an unreliable snapshot.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
The xlate() function returns 0 or a negative error code. Returning the
error code blindly will be seen as an huge irq number by the calling
function because irq_create_of_mapping() returns an unsigned value.
Return 0 (NO_IRQ) as required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
The read of a proper MSR register was missed and instead of
counter the configration register was tested (it has
ARCH_P4_UNFLAGGED_BIT always cleared) leading to unknown NMI
hitting the system. As result the user may obtain "Dazed and
confused, but trying to continue" message. Fix it by reading a
proper MSR register.
When an NMI happens on a P4, the perf nmi handler checks the
configuration register to see if the overflow bit is set or not
before taking appropriate action. Unfortunately, various P4
machines had a broken overflow bit, so a backup mechanism was
implemented. This mechanism checked to see if the counter
rolled over or not.
A previous commit that implemented this backup mechanism was
broken. Instead of reading the counter register, it used the
configuration register to determine if the counter rolled over
or not. Reading that bit would give incorrect results.
This would lead to 'Dazed and confused' messages for the end
user when using the perf tool (or if the nmi watchdog is
running).
The fix is to read the counter register before determining if
the counter rolled over or not.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Lin Ming <ming.m.lin@intel.com>
LKML-Reference: <4D8BAB49.3080701@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Percpu allocator honors alignment request upto PAGE_SIZE and both the
percpu addresses in the percpu address space and the translated kernel
addresses should be aligned accordingly. The calculation of the
former depends on the alignment of percpu output section in the kernel
image.
The linker script macros PERCPU_VADDR() and PERCPU() are used to
define this output section and the latter takes @align parameter.
Several architectures are using @align smaller than PAGE_SIZE breaking
percpu memory alignment.
This patch removes @align parameter from PERCPU(), renames it to
PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it,
add PCPU_SETUP_BUG_ON() checks such that alignment problems are
reliably detected and remove percpu alignment comment recently added
in workqueue.c as the condition would trigger BUG way before reaching
there.
For um, this patch raises the alignment of percpu area. As the area
is in .init, there shouldn't be any noticeable difference.
This problem was discovered by David Howells while debugging boot
failure on mn10300.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: uclinux-dist-devel@blackfin.uclinux.org
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (42 commits)
ACPI: minor printk format change in acpi_pad
ACPI: make acpi_pad /sys output more readable
ACPICA: Update version to 20110316
ACPICA: Header support for SLIC table
ACPI: Make sure the FADT is at least rev 2 before using the reset register
ACPI: Bug compatibility for Windows on the ACPI reboot vector
ACPICA: Fix access width for reset vector
ACPI battery: fribble sysfs files from a resume notifier
ACPI button: remove unused procfs I/F
ACPI, APEI, Add PCIe AER error information printing support
PCIe, AER, use pre-generated prefix in error information printing
ACPI, APEI, Add ERST record ID cache
ACPI: Use syscore_ops instead of sysdev class and sysdev
ACPI: Remove the unused EC sysdev class
ACPI: use __cpuinit for the acpi_processor_set_pdc() call tree
ACPI: use __init where possible in processor driver
Thermal_Framework-Fix_crash_during_hwmon_unregister
ACPICA: Update version to 20110211.
ACPICA: Add mechanism to defer _REG methods for some installed handlers
ACPICA: Add support for FunctionalFixedHW in acpi_ut_get_region_name
...
'ret' isn't used by check_slot(), gets initialized but has no real use,
so remove it.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
LKML-Reference: <AANLkTikh3y+is3xixKBdyHhr_cHxzPFJF729Fcvt8+Ns@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
deal with races in /proc/*/{syscall,stack,personality}
proc: enable writing to /proc/pid/mem
proc: make check_mem_permission() return an mm_struct on success
proc: hold cred_guard_mutex in check_mem_permission()
proc: disable mem_write after exec
mm: implement access_remote_vm
mm: factor out main logic of access_process_vm
mm: use mm_struct to resolve gate vma's in __get_user_pages
mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
mm: arch: make in_gate_area take an mm_struct instead of a task_struct
mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
x86: mark associated mm when running a task in 32 bit compatibility mode
x86: add context tag to mark mm when running a task in 32-bit compatibility mode
auxv: require the target to be tracable (or yourself)
close race in /proc/*/environ
report errors in /proc/*/*map* sanely
pagemap: close races with suid execve
make sessionid permissions in /proc/*/task/* match those in /proc/*
fix leaks in path_lookupat()
Fix up trivial conflicts in fs/proc/base.c
The Xen PV drivers in a crashed HVM guest can not connect to the dom0
backend drivers because both frontend and backend drivers are still in
connected state. To run the connection reset function only in case of a
crashdump, the is_kdump_kernel() function needs to be available for the PV
driver modules.
Consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn into
kernel/crash_dump.c Also export elfcorehdr_addr to make is_kdump_kernel()
usable for modules.
Leave 'elfcorehdr' as early_param(). This changes powerpc from __setup()
to early_param(). It adds an address range check from x86 also on ia64
and powerpc.
[akpm@linux-foundation.org: additional #includes]
[akpm@linux-foundation.org: remove elfcorehdr_addr export]
[akpm@linux-foundation.org: fix for Tejun's mm/nobootmem.c changes]
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some subsystems in the x86 tree need to carry out suspend/resume and
shutdown operations with one CPU on-line and interrupts disabled and
they define sysdev classes and sysdevs or sysdev drivers for this
purpose. This leads to unnecessarily complicated code and excessive
memory usage, so switch them to using struct syscore_ops objects for
this purpose instead.
Generally, there are three categories of subsystems that use
sysdevs for implementing PM operations: (1) subsystems whose
suspend/resume callbacks ignore their arguments entirely (the
majority), (2) subsystems whose suspend/resume callbacks use their
struct sys_device argument, but don't really need to do that,
because they can be implemented differently in an arguably simpler
way (io_apic.c), and (3) subsystems whose suspend/resume callbacks
use their struct sys_device argument, but the value of that argument
is always the same and could be ignored (microcode_core.c). In all
of these cases the subsystems in question may be readily converted to
using struct syscore_ops objects for power management and shutdown.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
This patch simply follows the same practice as for setting the TIF_IA32 flag.
In particular, an mm is marked as holding 32-bit tasks when a 32-bit binary is
exec'ed. Both ELF and a.out formats are updated.
Signed-off-by: Stephen Wilson <wilsons@start.ca>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8237A utilizes the interface provided by CONFIG_ISA_DMA_API, specifically
claim_dma_lock() and release_dma_lock(). Thus, there's a strict
dependency on the config option and the module should only be loaded if
the kernel supports ISA-style DMA.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oops=panic cmdline option is not x86 specific, move it to generic code.
Update documentation.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
xen: update mask_rw_pte after kernel page tables init changes
xen: set max_pfn_mapped to the last pfn mapped
x86: Cleanup highmap after brk is concluded
Fix up trivial onflict (added header file includes) in
arch/x86/mm/init_64.c
When CONFIG_X86_MPPARSE=y and CONFIG_X86_IO_APIC=n, then we get
the following warning:
arch/x86/kernel/mpparse.c:723: warning: 'check_slot' defined but not used
So, put check_slot into CONFIG_X86_IO_APIC context. Its only
called from CONFIG_X86_IO_APIC=y context.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
LKML-Reference: <AANLkTinsUfGc=NG_GeH_B+zFVu+DXJzZbJKdQLscqfuH@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
APEI ERST firmware interface and implementation has no multiple users
in mind. For example, if there is four records in storage with ID: 1,
2, 3 and 4, if two ERST readers enumerate the records via
GET_NEXT_RECORD_ID as follow,
reader 1 reader 2
1
2
3
4
-1
-1
where -1 signals there is no more record ID.
Reader 1 has no chance to check record 2 and 4, while reader 2 has no
chance to check record 1 and 3. And any other GET_NEXT_RECORD_ID will
return -1, that is, other readers will has no chance to check any
record even they are not cleared by anyone.
This makes raw GET_NEXT_RECORD_ID not suitable for used by multiple
users.
To solve the issue, an in-memory ERST record ID cache is designed and
implemented. When enumerating record ID, the ID returned by
GET_NEXT_RECORD_ID is added into cache in addition to be returned to
caller. So other readers can check the cache to get all record ID
available.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
It is frequently useful to sync a single file system, instead of all
mounted file systems via sync(2):
- On machines with many mounts, it is not at all uncommon for some of
them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on
those and may never get to the one you do care about (e.g., /).
- Some applications write lots of data to the file system and then
want to make sure it is flushed to disk. Calling fsync(2) on each
file introduces unnecessary ordering constraints that result in a large
amount of sub-optimal writeback/flush/commit behavior by the file
system.
There are currently two ways (that I know of) to sync a single super_block:
- BLKFLSBUF ioctl on the block device: That also invalidates the bdev
mapping, which isn't usually desirable, and doesn't work for non-block
file systems.
- 'mount -o remount,rw' will call sync_filesystem as an artifact of the
current implemention. Relying on this little-known side effect for
something like data safety sounds foolish.
Both of these approaches require root privileges, which some applications
do not have (nor should they need?) given that sync(2) is an unprivileged
operation.
This patch introduces a new system call syncfs(2) that takes an fd and
syncs only the file system it references. Maybe someday we can
$ sync /some/path
and not get
sync: ignoring all arguments
The syscall is motivated by comments by Al and Christoph at the last LSF.
syncfs(2) seems like an appropriate name given statfs(2).
A similar ioctl was also proposed a while back, see
http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now cleanup_highmap actually is in two steps: one is early in head64.c
and only clears above _end; a second one is in init_memory_mapping() and
tries to clean from _brk_end to _end.
It should check if those boundaries are PMD_SIZE aligned but currently
does not.
Also init_memory_mapping() is called several times for numa or memory
hotplug, so we really should not handle initial kernel mappings there.
This patch moves cleanup_highmap() down after _brk_end is settled so
we can do everything in one step.
Also we honor max_pfn_mapped in the implementation of cleanup_highmap.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
LKML-Reference: <alpine.DEB.2.00.1103171739050.3382@kaball-desktop>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The following patch solves the problems introduced by Robert's
commit 41bf498 and reported by Arun Sharma. This commit gets rid
of the base + index notation for reading and writing PMU msrs.
The problem is that for fixed counters, the new calculation for
the base did not take into account the fixed counter indexes,
thus all fixed counters were read/written from fixed counter 0.
Although all fixed counters share the same config MSR, they each
have their own counter register.
Without:
$ task -e unhalted_core_cycles -e instructions_retired -e baclears noploop 1 noploop for 1 seconds
242202299 unhalted_core_cycles (0.00% scaling, ena=1000790892, run=1000790892)
2389685946 instructions_retired (0.00% scaling, ena=1000790892, run=1000790892)
49473 baclears (0.00% scaling, ena=1000790892, run=1000790892)
With:
$ task -e unhalted_core_cycles -e instructions_retired -e baclears noploop 1 noploop for 1 seconds
2392703238 unhalted_core_cycles (0.00% scaling, ena=1000840809, run=1000840809)
2389793744 instructions_retired (0.00% scaling, ena=1000840809, run=1000840809)
47863 baclears (0.00% scaling, ena=1000840809, run=1000840809)
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ming.m.lin@intel.com
Cc: robert.richter@amd.com
Cc: asharma@fb.com
Cc: perfmon2-devel@lists.sf.net
LKML-Reference: <20110319172005.GB4978@quad>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Flush TLB if PGD entry is changed in i386 PAE mode
x86, dumpstack: Correct stack dump info when frame pointer is available
x86: Clean up csum-copy_64.S a bit
x86: Fix common misspellings
x86: Fix misspelling and align params
x86: Use PentiumPro-optimized partial_csum() on VIA C7
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
trace, filters: Initialize the match variable in process_ops() properly
trace, documentation: Fix branch profiling location in debugfs
oprofile, s390: Cleanups
oprofile, s390: Remove hwsampler_files.c and merge it into init.c
perf: Fix tear-down of inherited group events
perf: Reorder & optimize perf_event_context to remove alignment padding on 64 bit builds
perf: Handle stopped state with tracepoints
perf: Fix the software events state check
perf, powerpc: Handle events that raise an exception without overflowing
perf, x86: Use INTEL_*_CONSTRAINT() for all PEBS event constraints
perf, x86: Clean up SandyBridge PEBS events
perf lock: Fix sorting by wait_min
perf tools: Version incorrect with some versions of grep
perf evlist: New command to list the names of events present in a perf.data file
perf script: Add support for H/W and S/W events
perf script: Add support for dumping symbols
perf script: Support custom field selection for output
perf script: Move printing of 'common' data from print_event and rename
perf tracing: Remove print_graph_cpu and print_graph_proc from trace-event-parse
perf script: Change process_event prototype
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
Current stack dump code scans entire stack and check each entry
contains a pointer to kernel code. If CONFIG_FRAME_POINTER=y it
could mark whether the pointer is valid or not based on value of
the frame pointer. Invalid entries could be preceded by '?' sign.
However this was not going to happen because scan start point
was always higher than the frame pointer so that they could not
meet.
Commit 9c0729dc80 ("x86: Eliminate bp argument from the stack
tracing routines") delayed bp acquisition point, so the bp was
read in lower frame, thus all of the entries were marked
invalid.
This patch fixes this by reverting above commit while retaining
stack_frame() helper as suggested by Frederic Weisbecker.
End result looks like below:
before:
[ 3.508329] Call Trace:
[ 3.508551] [<ffffffff814f35c9>] ? panic+0x91/0x199
[ 3.508662] [<ffffffff814f3739>] ? printk+0x68/0x6a
[ 3.508770] [<ffffffff81a981b2>] ? mount_block_root+0x257/0x26e
[ 3.508876] [<ffffffff81a9821f>] ? mount_root+0x56/0x5a
[ 3.508975] [<ffffffff81a98393>] ? prepare_namespace+0x170/0x1a9
[ 3.509216] [<ffffffff81a9772b>] ? kernel_init+0x1d2/0x1e2
[ 3.509335] [<ffffffff81003894>] ? kernel_thread_helper+0x4/0x10
[ 3.509442] [<ffffffff814f6880>] ? restore_args+0x0/0x30
[ 3.509542] [<ffffffff81a97559>] ? kernel_init+0x0/0x1e2
[ 3.509641] [<ffffffff81003890>] ? kernel_thread_helper+0x0/0x10
after:
[ 3.522991] Call Trace:
[ 3.523351] [<ffffffff814f35b9>] panic+0x91/0x199
[ 3.523468] [<ffffffff814f3729>] ? printk+0x68/0x6a
[ 3.523576] [<ffffffff81a981b2>] mount_block_root+0x257/0x26e
[ 3.523681] [<ffffffff81a9821f>] mount_root+0x56/0x5a
[ 3.523780] [<ffffffff81a98393>] prepare_namespace+0x170/0x1a9
[ 3.523885] [<ffffffff81a9772b>] kernel_init+0x1d2/0x1e2
[ 3.523987] [<ffffffff81003894>] kernel_thread_helper+0x4/0x10
[ 3.524228] [<ffffffff814f6880>] ? restore_args+0x0/0x30
[ 3.524345] [<ffffffff81a97559>] ? kernel_init+0x0/0x1e2
[ 3.524445] [<ffffffff81003890>] ? kernel_thread_helper+0x0/0x10
-v5:
* fix build breakage with oprofile
-v4:
* use 0 instead of regs->bp
* separate out printk changes
-v3:
* apply comment from Frederic
* add a couple of printk fixes
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Soren Sandmann <ssp@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1300416006-3163-1-git-send-email-namhyung@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They were generated by 'codespell' and then manually reviewed.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.39' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (55 commits)
KVM: unbreak userspace that does not sets tss address
KVM: MMU: cleanup pte write path
KVM: MMU: introduce a common function to get no-dirty-logged slot
KVM: fix rcu usage in init_rmode_* functions
KVM: fix kvmclock regression due to missing clock update
KVM: emulator: Fix permission checking in io permission bitmap
KVM: emulator: Fix io permission checking for 64bit guest
KVM: SVM: Load %gs earlier if CONFIG_X86_32_LAZY_GS=n
KVM: x86: Remove useless regs_page pointer from kvm_lapic
KVM: improve comment on rcu use in irqfd_deassign
KVM: MMU: remove unused macros
KVM: MMU: cleanup page alloc and free
KVM: MMU: do not record gfn in kvm_mmu_pte_write
KVM: MMU: move mmu pages calculated out of mmu lock
KVM: MMU: set spte accessed bit properly
KVM: MMU: fix kvm_mmu_slot_remove_write_access dropping intermediate W bits
KVM: Start lock documentation
KVM: better readability of efer_reserved_bits
KVM: Clear async page fault hash after switching to real mode
KVM: VMX: Initialize vm86 TSS only once.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (38 commits)
amd64_edac: Fix decode_syndrome types
amd64_edac: Fix DCT argument type
amd64_edac: Fix ranges signedness
amd64_edac: Drop local variable
amd64_edac: Fix PCI config addressing types
amd64_edac: Fix DRAM base macros
amd64_edac: Fix node id signedness
amd64_edac: Drop redundant declarations
amd64_edac: Enable driver on F15h
amd64_edac: Adjust ECC symbol size to F15h
amd64_edac: Simplify scrubrate setting
PCI: Rename CPU PCI id define
amd64_edac: Improve DRAM address mapping
amd64_edac: Sanitize ->read_dram_ctl_register
amd64_edac: Adjust sys_addr to chip select conversion routine to F15h
amd64_edac: Beef up early exit reporting
amd64_edac: Revamp online spare handling
amd64_edac: Fix channel interleave removal
amd64_edac: Correct node interleaving removal
amd64_edac: Add support for interleaved region swapping
...
Fix up trivial conflict in include/linux/pci_ids.h due to
AMD_15H_NB_MISC being renamed as AMD_15H_NB_F3 next to the new
AMD_15H_NB_LINK entry.
WARNING: arch/x86/built-in.o(.text+0x1bb74): Section mismatch in reference from the function kvm_guest_cpu_online() to the function .cpuinit.text:kvm_guest_cpu_init()
The function kvm_guest_cpu_online() references
the function __cpuinit kvm_guest_cpu_init().
This is often because kvm_guest_cpu_online lacks a __cpuinit
annotation or the annotation of kvm_guest_cpu_init is wrong.
This patch fixes the warning.
Tested with linux-next (next-20101231)
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
With increasing number of PCI function ids, add the PCI function id
in the define name instead of its symbolic name in the BKDG for more
clarity.
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Remove a couple of assigment statements that appear twice.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
* 'x86-trampoline-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix binutils-2.21 symbol related build failures
x86-64, trampoline: Remove unused variable
x86, reboot: Fix the use of passed arguments in 32-bit BIOS reboot
x86, reboot: Move the real-mode reboot code to an assembly file
x86: Make the GDT_ENTRY() macro in <asm/segment.h> safe for assembly
x86, trampoline: Use the unified trampoline setup for ACPI wakeup
x86, trampoline: Common infrastructure for low memory trampolines
Fix up trivial conflicts in arch/x86/kernel/Makefile
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: (21 commits)
PM / Hibernate: Reduce autotuned default image size
PM / Core: Introduce struct syscore_ops for core subsystems PM
PM QoS: Make pm_qos settings readable
PM / OPP: opp_find_freq_exact() documentation fix
PM: Documentation/power/states.txt: fix repetition
PM: Make system-wide PM and runtime PM treat subsystems consistently
PM: Simplify kernel/power/Kconfig
PM: Add support for device power domains
PM: Drop pm_flags that is not necessary
PM: Allow pm_runtime_suspend() to succeed during system suspend
PM: Clean up PM_TRACE dependencies and drop unnecessary Kconfig option
PM: Remove CONFIG_PM_OPS
PM: Reorder power management Kconfig options
PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME)
PM / ACPI: Remove references to pm_flags from bus.c
PM: Do not create wakeup sysfs files for devices that cannot wake up
USB / Hub: Do not call device_set_wakeup_capable() under spinlock
PM: Use appropriate printk() priority level in trace.c
PM / Wakeup: Don't update events_check_enabled in pm_get_wakeup_count()
PM / Wakeup: Make pm_save_wakeup_count() work as documented
...
* 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu, x86: Add arch-specific this_cpu_cmpxchg_double() support
percpu: Generic support for this_cpu_cmpxchg_double()
alpha: use L1_CACHE_BYTES for cacheline size in the linker script
percpu: align percpu readmostly subsection to cacheline
Fix up trivial conflict in arch/x86/kernel/vmlinux.lds.S due to the
percpu alignment having changed ("x86: Reduce back the alignment of the
per-CPU data section")
PEBS_EVENT_CONSTRAINT() is just a duplicate of INTEL_UEVENT_CONSTRAINT().
Remove it and use INTEL_UEVENT_CONSTRAINT() instead.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299684089-22835-3-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use INTEL_EVENT_CONSTRAINT() for the events where all umasks support PEBS.
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299684089-22835-2-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Support for Always Running APIC timer (ARAT) was introduced in
commit db954b5898. This feature
allows us to avoid switching timers from LAPIC to something else
(e.g. HPET) and go into timer broadcasts when entering deep
C-states.
AMD processors don't provide a CPUID bit for that feature but
they also keep APIC timers running in deep C-states (except for
cases when the processor is affected by erratum 400). Therefore
we should set ARAT feature bit on AMD CPUs.
Tested-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: Mark Langsdorf <mark.langsdorf@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com>
LKML-Reference: <1300205624-4813-1-git-send-email-ostr@amd64.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 7f74f8f28a
(x86 quirk: Fix polarity for IRQ0 pin2 override on SB800
systems) introduced a regression. It removed some SB600 specific
code to determine the revision ID without adapting a
corresponding revision ID check for SB600.
See this mail thread:
http://marc.info/?l=linux-kernel&m=129980296006380&w=2
This patch adapts the corresponding check to cover all SB600
revisions.
Tested-by: Wang Lei <f3d27b@gmail.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org # 38.x, 37.x, 32.x
LKML-Reference: <20110315143137.GD29499@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
x86: Clean up apic.c and apic.h
x86: Remove superflous goal definition of tsc_sync
x86: dt: Correct local apic documentation in device tree bindings
x86: dt: Cleanup local apic setup
x86: dt: Fix OLPC=y/INTEL_CE=n build
rtc: cmos: Add OF bindings
x86: ce4100: Use OF to setup devices
x86: ioapic: Add OF bindings for IO_APIC
x86: dtb: Add generic bus probe
x86: dtb: Add support for PCI devices backed by dtb nodes
x86: dtb: Add device tree support for HPET
x86: dtb: Add early parsing of IO_APIC
x86: dtb: Add irq domain abstraction
x86: dtb: Add a device tree for CE4100
x86: Add device tree support
x86: e820: Remove conditional early mapping in parse_e820_ext
x86: OLPC: Make OLPC=n build again
x86: OLPC: Remove extra OLPC_OPENFIRMWARE_DT indirection
x86: OLPC: Cleanup config maze completely
x86: OLPC: Hide OLPC_OPENFIRMWARE config switch
...
Fix up conflicts in arch/x86/platform/ce4100/ce4100.c
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (93 commits)
x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others()
x86-64, NUMA: Don't call numa_set_distanc() for all possible node combinations during emulation
x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation()
x86-64, NUMA: Clean up initmem_init()
x86-64, NUMA: Fix numa_emulation code with node0 without RAM
x86-64, NUMA: Revert NUMA affine page table allocation
x86: Work around old gas bug
x86-64, NUMA: Better explain numa_distance handling
x86-64, NUMA: Fix distance table handling
mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK
x86-64, NUMA: Fix size of numa_distance array
x86: Rename e820_table_* to pgt_buf_*
bootmem: Move __alloc_memory_core_early() to nobootmem.c
bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c
bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c
x86-64, NUMA: Seperate out numa_alloc_distance() from numa_set_distance()
x86-64, NUMA: Add proper function comments to global functions
x86-64, NUMA: Move NUMA emulation into numa_emulation.c
x86-64, NUMA: Prepare numa_emulation() for moving NUMA emulation into a separate file
x86-64, NUMA: Do not scan two times for setup_node_bootmem()
...
Fix up conflicts in arch/x86/kernel/smpboot.c
* 'x86-mem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, mem: Convert memmove() to assembly file and fix return value bug
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (116 commits)
x86: Enable forced interrupt threading support
x86: Mark low level interrupts IRQF_NO_THREAD
x86: Use generic show_interrupts
x86: ioapic: Avoid redundant lookup of irq_cfg
x86: ioapic: Use new move_irq functions
x86: Use the proper accessors in fixup_irqs()
x86: ioapic: Use irq_data->state
x86: ioapic: Simplify irq chip and handler setup
x86: Cleanup the genirq name space
genirq: Add chip flag to force mask on suspend
genirq: Add desc->irq_data accessor
genirq: Add comments to Kconfig switches
genirq: Fixup fasteoi handler for oneshot mode
genirq: Provide forced interrupt threading
sched: Switch wait_task_inactive to schedule_hrtimeout()
genirq: Add IRQF_NO_THREAD
genirq: Allow shared oneshot interrupts
genirq: Prepare the handling of shared oneshot interrupts
genirq: Make warning in handle_percpu_event useful
x86: ioapic: Move trigger defines to io_apic.h
...
Fix up trivial(?) conflicts in arch/x86/pci/xen.c due to genirq name
space changes clashing with the Xen cleanups. The set_irq_msi() had
moved to xen_bind_pirq_msi_to_irq().
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Combine printk()s in show_regs_common()
x86: Don't call dump_stack() from arch_trigger_all_cpu_backtrace_handler()
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix and clean up generic_processor_info()
x86: Don't copy per_cpu cpuinfo for BSP two times
x86: Move llc_shared_map out of cpu_info
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, binutils, xen: Fix another wrong size directive
x86: Remove dead config option X86_CPU
x86: Really print supported CPUs if PROCESSOR_SELECT=y
x86: Fix a bogus unwind annotation in lib/semaphore_32.S
um, x86-64: Fix UML build after adding CFI annotations to lib/rwsem_64.S
x86: Remove unused bits from lib/thunk_*.S
x86: Use {push,pop}_cfi in more places
x86-64: Add CFI annotations to lib/rwsem_64.S
x86, asm: Cleanup unnecssary macros in asm-offsets.c
x86, system.h: Drop unused __SAVE/__RESTORE macros
x86: Use bitmap library functions
x86: Partly unify asm-offsets_{32,64}.c
x86: Reduce back the alignment of the per-CPU data section
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (62 commits)
posix-clocks: Check write permissions in posix syscalls
hrtimer: Remove empty hrtimer_init_hres_timer()
hrtimer: Update hrtimer->state documentation
hrtimer: Update base[CLOCK_BOOTTIME].offset correctly
timers: Export CLOCK_BOOTTIME via the posix timers interface
timers: Add CLOCK_BOOTTIME hrtimer base
time: Extend get_xtime_and_monotonic_offset() to also return sleep
time: Introduce get_monotonic_boottime and ktime_get_boottime
hrtimers: extend hrtimer base code to handle more then 2 clockids
ntp: Remove redundant and incorrect parameter check
mn10300: Switch do_timer() to xtimer_update()
posix clocks: Introduce dynamic clocks
posix-timers: Cleanup namespace
posix-timers: Add support for fd based clocks
x86: Add clock_adjtime for x86
posix-timers: Introduce a syscall for clock tuning.
time: Splitout compat timex accessors
ntp: Add ADJ_SETOFFSET mode bit
time: Introduce timekeeping_inject_offset
posix-timer: Update comment
...
Fix up new system-call-related conflicts in
arch/x86/ia32/ia32entry.S
arch/x86/include/asm/unistd_32.h
arch/x86/include/asm/unistd_64.h
arch/x86/kernel/syscall_table_32.S
(name_to_handle_at()/open_by_handle_at() vs clock_adjtime()), and some
due to movement of get_jiffies_64() in:
kernel/time.c
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (184 commits)
perf probe: Clean up probe_point_lazy_walker() return value
tracing: Fix irqoff selftest expanding max buffer
tracing: Align 4 byte ints together in struct tracer
tracing: Export trace_set_clr_event()
tracing: Explain about unstable clock on resume with ring buffer warning
ftrace/graph: Trace function entry before updating index
ftrace: Add .ref.text as one of the safe areas to trace
tracing: Adjust conditional expression latency formatting.
tracing: Fix event alignment: skb:kfree_skb
tracing: Fix event alignment: mce:mce_record
tracing: Fix event alignment: kvm:kvm_hv_hypercall
tracing: Fix event alignment: module:module_request
tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup
tracing: Remove lock_depth from event entry
perf header: Stop using 'self'
perf session: Use evlist/evsel for managing perf.data attributes
perf top: Don't let events to eat up whole header line
perf top: Fix events overflow in top command
ring-buffer: Remove unused #include <linux/trace_irq.h>
tracing: Add an 'overwrite' trace_option.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
tidy the trailing symlinks traversal up
Turn resolution of trailing symlinks iterative everywhere
simplify link_path_walk() tail
Make trailing symlink resolution in path_lookupat() iterative
update nd->inode in __do_follow_link() instead of after do_follow_link()
pull handling of one pathname component into a helper
fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
readlinkat(), fchownat() and fstatat() with empty relative pathnames
Allow O_PATH for symlinks
New kind of open files - "location only".
ext4: Copy fs UUID to superblock
ext3: Copy fs UUID to superblock.
vfs: Export file system uuid via /proc/<pid>/mountinfo
unistd.h: Add new syscalls numbers to asm-generic
x86: Add new syscalls for x86_64
x86: Add new syscalls for x86_32
fs: Remove i_nlink check from file system link callback
fs: Don't allow to create hardlink for deleted file
vfs: Add open by file handle support
...
The isci driver needs to retrieve its preboot OROM image which contains
necessary runtime parameters like platform specific sas addresses and
phy configuration. There is no ROM BAR associated with this area,
instead we will need to scan legacy expansion ROM space.
1/ Promote the probe_roms_32 implementation to x86-64
2/ Add a facility to find and map an adapter rom by pci device (according to
PCI Firmware Specification Revision 3.0)
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
LKML-Reference: <20110308183226.6246.90354.stgit@localhost6.localdomain6>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Intel Archiecture Software Developer's Manual section 7.1.3 specifies that a
core serializing instruction such as "cpuid" should be executed on _each_ core
before the new instruction is made visible.
Failure to do so can lead to unspecified behavior (Intel XMC erratas include
General Protection Fault in the list), so we should avoid this at all cost.
This problem can affect modified code executed by interrupt handlers after
interrupt are re-enabled at the end of stop_machine, because no core serializing
instruction is executed between the code modification and the moment interrupts
are reenabled.
Because stop_machine_text_poke performs the text modification from the first CPU
decrementing stop_machine_first, modified code executed in thread context is
also affected by this problem. To explain why, we have to split the CPUs in two
categories: the CPU that initiates the text modification (calls text_poke_smp)
and all the others. The scheduler, executed on all other CPUs after
stop_machine, issues an "iret" core serializing instruction, and therefore
handles core serialization for all these CPUs. However, the text modification
initiator can continue its execution on the same thread and access the modified
text without any scheduler call. Given that the CPU that initiates the code
modification is not guaranteed to be the one actually performing the code
modification, it falls into the XMC errata.
Q: Isn't this executed from an IPI handler, which will return with IRET (a
serializing instruction) anyway?
A: No, now stop_machine uses per-cpu workqueue, so that handler will be
executed from worker threads. There is no iret anymore.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20110303160137.GB1590@Krystal>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: <stable@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds new syscalls to x86_32
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The variable pm_flags is used to prevent APM from being enabled
along with ACPI, which would lead to problems. However, acpi_init()
is always called before apm_init() and after acpi_init() has
returned, it is known whether or not ACPI will be used. Namely, if
acpi_disabled is not set after acpi_init() has returned, this means
that ACPI is enabled. Thus, it is sufficient to check acpi_disabled
in apm_init() to prevent APM from being enabled in parallel with
ACPI.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Len Brown <len.brown@intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: ce4100: Set pci ops via callback instead of module init
x86/mm: Fix pgd_lock deadlock
x86/mm: Handle mm_fault_error() in kernel space
x86: Don't check for BIOS corruption in first 64K when there's no need to
The caller of ioapic_register_intr() has a pointer to the irq_cfg for
the irq already. Hand it in to avoid a full lookup.
In msi_compose_msg() the pointer to irq_cfg is already available. No
need to look it up again.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the functions which take irq_data. We already have a pointer to
irq_data. That avoids a sparse irq lookup in move_*_irq.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use the state information in irq_data. That avoids a radix-tree lookup
from apic_ack_level() and simplifies setup_ioapic_dest().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
genirq is switching to a consistent name space for the irq related
functions. Convert x86. Conversion was done with coccinelle.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The latest binutils (2.21.0.20110302/Ubuntu) breaks the build
yet another time, under CONFIG_XEN=y due to a .size directive that
refers to a slightly differently named (hence, to the now very
strict and unforgiving assembler, non-existent) symbol.
[ mingo:
This unnecessary build breakage caused by new binutils
version 2.21 gets escallated back several kernel releases spanning
several years of Linux history, affecting over 130,000 upstream
kernel commits (!), on CONFIG_XEN=y 64-bit kernels (i.e. essentially
affecting all major Linux distro kernel configs).
Git annotate tells us that this slight debug symbol code mismatch
bug has been introduced in 2008 in commit 3d75e1b8:
3d75e1b8 (Jeremy Fitzhardinge 2008-07-08 15:06:49 -0700 1231) ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
The 'bug' is just a slight assymetry in ENTRY()/END()
debug-symbols sequences, with lots of assembly code between the
ENTRY() and the END():
ENTRY(xen_do_hypervisor_callback) # do_hypervisor_callback(struct *pt_regs)
...
END(do_hypervisor_callback)
Human reviewers almost never catch such small mismatches, and binutils
never even warned about it either.
This new binutils version thus breaks the Xen build on all upstream kernels
since v2.6.27, out of the blue.
This makes a straightforward Git bisection of all 64-bit Xen-enabled kernels
impossible on such binutils, for a bisection window of over hundred
thousand historic commits. (!)
This is a major fail on the side of binutils and binutils needs to turn
this show-stopper build failure into a warning ASAP. ]
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <kees.cook@canonical.com>
LKML-Reference: <1299877178-26063-1-git-send-email-heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch moves some functions and variables into init
sections, makes a function static and removes some lines of
cruft.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <1299826956-8607-2-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The extra tsc_sync.o goal definition is superflous.
CONFIG_X86_64_SMP depends on CONFIG_SMP
and tsc_sync.o is already in the definition of CONFIG_SMP.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <1299826956-8607-1-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently the index to the ret_stack is updated and the real return address
is saved in the ret_stack. Then we call the trace function. The trace
function could decide that it doesn't want to trace this function
(ex. set_graph_function does not match) and it will return 0 which means
not to trace this call.
The normal function graph tracer has this code:
if (!(trace->depth || ftrace_graph_addr(trace->func)) ||
ftrace_graph_ignore_irqs())
return 0;
What this states is, if the trace depth (which is curr_ret_stack)
is zero (top of nested functions) then test if we want to trace this
function. If this function is not to be traced, then return 0 and
the rest of the function graph tracer logic will not trace this function.
The problem arises when an interrupt comes in after we updated the
curr_ret_stack. The next function that gets called will have a trace->depth
of 1. Which fools this trace code into thinking that we are in a nested
function, and that we should trace. This causes interrupts to be traced
when they should not be.
The solution is to trace the function first and then update the ret_stack.
Reported-by: zhiping zhong <xzhong86@163.com>
Reported-by: wu zhangjin <wuzhangjin@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return 0 on failure. This will cause the initialization of the driver
to fail and prevent the driver from loading if the BIOS cannot handle
the PCC interface command to "get frequency". Otherwise, the driver
will load and display a very high value like "4294967274" (which is
actually -EINVAL) for frequency:
# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
4294967274
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
CC: stable@kernel.org
Signed-off-by: Dave Jones <davej@redhat.com>
Due to commit 781c5a67f1 it is
likely that the number of areas to scan for BIOS corruption is 0
-- especially when the first 64K is already reserved
(X86_RESERVE_LOW is 64K by default).
If that's the case then don't set up the scan.
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: <stable@kernel.org>
LKML-Reference: <20110225202838.2229.71011.sendpatchset@nchumbalkar.americas.hpqcorp.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
New binutils version 2.21.0.20110302-1 started checking that the symbol
parameter to the .size directive matches the entry name's
symbol parameter, unearthing two mismatches:
AS arch/x86/kernel/acpi/wakeup_rm.o
arch/x86/kernel/acpi/wakeup_rm.S: Assembler messages:
arch/x86/kernel/acpi/wakeup_rm.S:12: Error: .size expression with symbol `wakeup_code_start' does not evaluate to a constant
arch/x86/kernel/entry_32.S: Assembler messages:
arch/x86/kernel/entry_32.S:1421: Error: .size expression with
symbol `apf_page_fault' does not evaluate to a constant
The problem was discovered while using Debian's binutils
(2.21.0.20110302-1) and experimenting with binutils from
upstream.
Thanks Alexander and H.J. for the vital help.
Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <1299620364-21644-1-git-send-email-sedat.dilek@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
You can crash the kernel (with root/admin privileges) using kprobe tracer by running:
echo "p system_call_after_swapgs" > ./kprobe_events
echo 1 > ./events/kprobes/enable
The reason is that at the system_call_after_swapgs label, the
kernel stack is not set up. If optimized kprobes are enabled,
the user space stack is being used in this case (see optimized
kprobe template) and this might result in a crash.
There are several places like this over the entry code
(entry_$BIT). As it seems there's no any reasonable/maintainable
way to disable only those places where the stack is not ready, I
switched off the whole entry code from kprobe optimizing.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: acme@redhat.com
Cc: fweisbec@gmail.com
Cc: ananth@in.ibm.com
Cc: davem@davemloft.net
Cc: a.p.zijlstra@chello.nl
Cc: eric.dumazet@gmail.com
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <1298298313-5980-3-git-send-email-jolsa@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Put x86 entry code into a separate link section: .entry.text.
Separating the entry text section seems to have performance
benefits - caused by more efficient instruction cache usage.
Running hackbench with perf stat --repeat showed that the change
compresses the icache footprint. The icache load miss rate went
down by about 15%:
before patch:
19417627 L1-icache-load-misses ( +- 0.147% )
after patch:
16490788 L1-icache-load-misses ( +- 0.180% )
The motivation of the patch was to fix a particular kprobes
bug that relates to the entry text section, the performance
advantage was discovered accidentally.
Whole perf output follows:
- results for current tip tree:
Performance counter stats for './hackbench/hackbench 10' (500 runs):
19417627 L1-icache-load-misses ( +- 0.147% )
2676914223 instructions # 0.497 IPC ( +- 0.079% )
5389516026 cycles ( +- 0.144% )
0.206267711 seconds time elapsed ( +- 0.138% )
- results for current tip tree with the patch applied:
Performance counter stats for './hackbench/hackbench 10' (500 runs):
16490788 L1-icache-load-misses ( +- 0.180% )
2717734941 instructions # 0.502 IPC ( +- 0.079% )
5414756975 cycles ( +- 0.148% )
0.206747566 seconds time elapsed ( +- 0.137% )
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: masami.hiramatsu.pt@hitachi.com
Cc: ananth@in.ibm.com
Cc: davem@davemloft.net
Cc: 2nddept-manager@sdl.hitachi.co.jp
LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I'm sure it was a mere oversight that the CONFIG_ prefixes are
missing.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Dave Jones <davej@redhat.com>
LKML-Reference: <4D7118D30200007800034F79@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-5-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Intel Nehalem and Westmere CPUs the generic perf LLC-* events count the
L2 caches, not the real L3 LLC - this was inconsistent with behavior on
other CPUs.
Fixing this requires the use of the special OFFCORE_RESPONSE
events which need a separate mask register.
This has been implemented by the previous patch, now use this infrastructure
to set correct events for the LLC-* on Nehalem and Westmere.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-3-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change logs against Andi's original version:
- Extends perf_event_attr:config to config{,1,2} (Peter Zijlstra)
- Fixed a major event scheduling issue. There cannot be a ref++ on an
event that has already done ref++ once and without calling
put_constraint() in between. (Stephane Eranian)
- Use thread_cpumask for percore allocation. (Lin Ming)
- Use MSR names in the extra reg lists. (Lin Ming)
- Remove redundant "c = NULL" in intel_percore_constraints
- Fix comment of perf_event_attr::config1
Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
that can be used to monitor any offcore accesses from a core.
This is a very useful event for various tunings, and it's
also needed to implement the generic LLC-* events correctly.
Unfortunately this event requires programming a mask in a separate
register. And worse this separate register is per core, not per
CPU thread.
This patch:
- Teaches perf_events that OFFCORE_RESPONSE needs extra parameters.
The extra parameters are passed by user space in the
perf_event_attr::config1 field.
- Adds support to the Intel perf_event core to schedule per
core resources. This adds fairly generic infrastructure that
can be also used for other per core resources.
The basic code has is patterned after the similar AMD northbridge
constraints code.
Thanks to Stephane Eranian who pointed out some problems
in the original version and suggested improvements.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299119690-13991-2-git-send-email-ming.m.lin@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch updates PEBS event constraints for Intel Atom, Nehalem, Westmere.
This patch also reorganizes the PEBS format/constraint detection code. It is
now based on processor model and not PEBS format. Two processors may use the
same PEBS format without have the same list of PEBS events.
In this second version, we simplified the initialization of the PEBS
constraints by leveraging the existing switch() statement in perf_event_intel.c.
We also renamed the constraint tables to be more consistent with regular
constraints.
In this 3rd version, we drop BR_INST_RETIRED.MISPRED from Intel Atom as it does
not seem to work. Use MISPREDICTED_BRANCH_RETIRED instead. Also add FP_ASSIST.*
o both Intel Nehalem and Westmere. I misssed those in the earlier patches.
Events were tested using libpfm4 perf_examples.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4d6e6b02.815bdf0a.637b.07a7@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch reverts NUMA affine page table allocation added by commit
1411e0ec31 (x86-64, numa: Put pgtable to local node memory).
The commit made an undocumented change where the kernel linear mapping
strictly follows intersection of e820 memory map and NUMA
configuration. If the physical memory configuration has holes or NUMA
nodes are not properly aligned, this leads to using unnecessarily
smaller mapping size which leads to increased TLB pressure. For
details,
http://thread.gmane.org/gmane.linux.kernel/1104672
Patches to fix the problem have been proposed but the underlying code
needs more cleanup and the approach itself seems a bit heavy handed
and it has been determined to revert the feature for now and come back
to it in the next developement cycle.
http://thread.gmane.org/gmane.linux.kernel/1105959
As init_memory_mapping_high() callsites have been consolidated since
the commit, reverting is done manually. Also, the RED-PEN comment in
arch/x86/mm/init.c is not restored as the problem no longer exists
with memblock based top-down early memory allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Make functions used strictly in bool context return bool. Also,
fixup used types and comments, and make a local function static,
while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Borislav Petkov <bp@amd64.org>
LKML-Reference: <20110303115932.GA8603@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds basic SandyBridge support, including hardware
cache events and PEBS events support.
It has been tested on SandyBridge CPUs with perf stat and also
with PEBS based profiling - both work fine.
The patch does not affect other models.
v2 -> v3:
- fix PEBS event 0xd0 with right umask combinations
- move snb pebs constraint assignment to intel_pmu_init
v1 -> v2:
- add more raw and PEBS events constraints
- use offcore events for LLC-* cache events
- remove the call to Nehalem workaround enable_all function
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <1299072424.2175.24.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do the notifier registration later, so we don't have to worry
about freeing it if we fail the msr allocation.
Signed-off-by: Dave Jones <davej@redhat.com>
It appears that when powernow-k8 finds that
No compatible ACPI _PSS objects found.
and suggests
Try again with latest BIOS.
it fails the module load, but does not unregister the cpu_notifier that was
registered in powernowk8_init
This ends up leaving freed memory on the cpu notifier list for some other
poor module (e.g. md/raid5) to come along and trip over.
The following might be a partial fix, but I suspect there is probably other
clean-up that is needed.
( https://bugzilla.novell.com/show_bug.cgi?id=655215 has full dmesg traces).
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Cleaning up and shortening code...
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
LKML-Reference: <4D6BD35002000078000341DA@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
PAGE_SIZE_asm, PAGE_SHIFT_asm, THREAD_SIZE_asm can be safely removed from
asm-offsets.c, and be replaced by their non-'_asm' counterparts in the code
that uses them, since the _AC macro defined in include/linux/const.h makes
PAGE_SIZE/PAGE_SHIFT/THREAD_SIZE work with as.
Signed-off-by: Stratos Psomadakis <psomas@cslab.ece.ntua.gr>
LKML-Reference: <1298666774-17646-2-git-send-email-psomas@cslab.ece.ntua.gr>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 quirk: Fix polarity for IRQ0 pin2 override on SB800 systems
x86/mrst: Fix apb timer rating when lapic timer is used
x86: Fix reboot problem on VersaLogic Menlow boards
Up to now we force enable the local apic in the devicetree setup
uncoditionally and set smp_found_config unconditionally to 1 when a
devicetree blob is available. This breaks, when local apic is disabled
in the Kconfig.
Make it consistent by initializing device tree explicitely before
smp_get_config() so a non lapic configuration could be used as well.
To be functional that would require to implement PIT as an interrupt
host, but the only user of this code until now is ce4100 which
requires apics to be available. So we leave this up to those who need
it.
Tested-by: Sebastian Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On some SB800 systems polarity for IOAPIC pin2 is wrongly
specified as low active by BIOS. This caused system hangs after
resume from S3 when HPET was used in one-shot mode on such
systems because a timer interrupt was missed (HPET signal is
high active).
For more details see:
http://marc.info/?l=linux-kernel&m=129623757413868
Tested-by: Manoj Iyer <manoj.iyer@canonical.com>
Tested-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 37.x, 32.x
LKML-Reference: <20110224145346.GD3658@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function do_suspend_lowlevel() is specific to x86 and defined in
assembly code, so it should be called from the x86 low-level suspend
code rather than from acpi_suspend_enter().
Merge do_suspend_lowlevel() into the x86's acpi_save_state_mem() and
change the name of the latter to acpi_suspend_lowlevel(), so that the
function's purpose is better reflected by its name.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Both OLPC and CE4100 activate CONFIG_OF. OLPC uses PROMTREE while CE
uses FLATTREE. Compiling for OLPC only breaks due to missing flat tree
functions and variables.
Use proper wrappers and provide an empty x86_flattree_get_config()
inline so OF=y FLATTREE=n builds and works.
[ tglx: Make it work with HPET_TIMER=n and make a function static ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Need to adjust the clockevent device rating for the structure
that will be registered with clockevent system instead of the
temporary structure.
Without this fix, APB timer rating will be higher than LAPIC
timer such that it can not be released later to be used as the
broadcast timer.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
LKML-Reference: <1298506046-439-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows to load the OF driver based informations from the device
tree. Systems without BIOS may need to perform some initialization.
PowerPC creates a PNP device from the OF information and performs this
kind of initialization in their private PCI quirk. This looks more
generic.
This patch also avoids registering the platform RTC driver on X86 if
we have a device tree blob. Otherwise we would setup the device based
on the hardcoded information in arch/x86 rather than the device tree
based one.
[ tglx: Changed "int of_have_populated_dt()" to bool as recommended by
Grant ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: rtc-linux@googlegroups.com
Cc: Alessandro Zummo <a.zummo@towertech.it>
LKML-Reference: <1298405266-1624-12-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ioapic_xlate provides a translation from the information in device tree
to ioapic related informations. This includes
- obtaining hw irq which is the vector number "=> pin number + gsi"
- obtaining type (level/edge/..)
- programming this information into ioapic
ioapic_add_ofnode adds an irq_domain based on informations from the device
tree. This information (irq_domain) is required in order to map a device to
its proper interrupt controller.
[ tglx: Adapted to the io_apic changes, which let us move that whole code
to devicetree.c ]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-10-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For now we probe these busses and we change this to board dependent
probes once we have to.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-9-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
x86_of_pci_init() does two things:
- it provides a generic irq enable and disable function. enable queries
the device tree for the interrupt information, calls ->xlate on the
irq host and updates the pci->irq information for the device.
- it walks through PCI bus(es) in the device tree and adds its children
(device) nodes to appropriate pci_dev nodes in kernel. So the dtb
node information is available at probe time of the PCI device.
Adding a PCI bus based on the information in the device tree is
currently not supported. Right now direct access via ioports is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-8-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Set hpet_address based on information provied form DTB
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
LKML-Reference: <1298405266-1624-7-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
APIC and IO_APIC have to be added to the system early because
native_init_IRQ() requires it.
In order to obtain the address of the ioapic the device tree has to be
unflattened so of_address_to_resource() works.
The device tree is relocated to ensure it is always covered by the
kernel mapping. That way the boot loader does not have to make
any assumptions about kernel's memory layout.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
LKML-Reference: <1298405266-1624-6-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The here introduced irq_domain abstraction represents a generic irq
controller. It is a subset of powerpc's irq_host which is going to be
renamed to irq_domain and then become generic. This implementation will
be removed once it is generic.
The xlate callback is resposible to parse irq informations like irq type
and number and returns the hardware irq number which is reported by the
hardware as active.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-5-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds minimal support for device tree on x86. The device
tree blob is passed to the kernel via setup_data which requires at
least boot protocol 2.09.
Memory size, restricted memory regions, boot arguments are gathered
the traditional way so things like cmd_line are just here to let the
code compile.
The current plan is use the device tree as an extension and to gather
information which can not be enumerated and would have to be hardcoded
otherwise. This includes things like
- which devices are on this I2C/SPI bus?
- how are the interrupts wired to IO APIC?
- where could my hpet be?
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-3-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch ensures that the memory passed from parse_setup_data() is
large enough to cover the complete data structure. That means that the
conditional mapping in parse_e820_ext() can go.
While here, I also attempt not to map two pages if the address is not
aligned to a page boundary.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Cc: sodaville@linutronix.de
Cc: devicetree-discuss@lists.ozlabs.org
LKML-Reference: <1298405266-1624-2-git-send-email-bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
io_apic_set_pci_routing() and mp_save_irq() check the pin_programmed
bit before calling io_apic_setup_irq_pin() and set the bit when the
pin was setup.
Move that duplicated code into a separate function and use it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no point to have irq_trigger() and irq_polarity() as wrappers
around the MPBIOS_* camel case functions. Get rid of both the inlines
and the ugly camel case.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The only difference here is that we did not call
__add_pin_to_irq_node() for the legacy irqs, but that's not worth 30
lines of extra code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Remove the duplicated code and call the function. It does not matter
whether we allocated the cfg before calling setup_local_APIC() and we
can set the irq chip and handler after that as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There are about four places in the ioapic code which do exactly the
same setup sequence. Also the OF based ioapic setup needs that
function to avoid putting the OF specific code into ioapic.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Two consecutive
for(...)
for(...)
lines to avoid an extra indentation are just horrible to read. I had
to look more than once to figure out what the code is doing.
Split out the inner loop into a separate function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is debug code and it does not matter at all whether we print each
not connected pin in an extra line or try to be extra clever.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch adds IOAPIC dummy functions for compilation
with local APIC, but without IOAPIC.
The local variable ioapic_entries in enable_IR_x2apic()
does not need initialization anymore, since the dummy
returns NULL.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-4-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently arch_disable_smp_support() on x86 disables only the
support for the IOAPIC and is also compiled in if SMP-support is
not.
Therefore this function is renamed to disable_ioapic_support(),
which meets its purpose and is only compiled in the kernel
when IOAPIC support is also.
A new arch_disable_smp_support() is created in smpboot.c,
which calls disable_ioapic_support() and gets only compiled
in the kernel when SMP support is also.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
LKML-Reference: <1298385487-4708-3-git-send-email-henne@nachtwindheim.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Neither CONFIG_OLPC_OPENFIRMWARE nor CONFIG_OLPC_OPENFIRMWARE_DT are
really necessary.
OLPC selects OLPC_OPENFIRMWARE unconditionally, so move the "select
OF" part under OLPC config option and fixup the dependencies in
Makefiles and code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andres Salomon <dilinger@queued.net>
Reason: Import mainline device tree changes on which further patches
depend on or conflict.
Trivial conflict in: drivers/spi/pxa2xx_spi_pci.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This converts the remaining x86 clocksources to use
clocksource_register_hz/khz.
CC: jacob.jun.pan@intel.com
CC: Glauber Costa <glommer@redhat.com>
CC: Dimitri Sivanich <sivanich@sgi.com>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Jeremy Fitzhardinge <jeremy@xensource.com>
CC: Chris McDermott <lcm@us.ibm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
VersaLogic Menlow based boards hang on reboot unless reboot=bios
is used. Add quirk to reboot through the BIOS.
Tested on at least four boards.
Signed-off-by: Kushal Koolwal <kushalkoolwal@gmail.com>
LKML-Reference: <1298152563-21594-1-git-send-email-kushalkoolwal@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
install_equiv_cpu_table() returns type int. It uses negative
error codes so using an unsigned type breaks the error handling.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: open list:AMD MICROCODE UPD... <amd64-microcode@amd64.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
LKML-Reference: <20110218091716.GA4384@bicker>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Removed unused variable left over from development.
Reported-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <AANLkTik6UJ680mWJcu_W+jerLcqPjwjvaXyxB1jAMaG0@mail.gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>
The initial version of this patch had %eax being a segment and %ecx
being the mode. I had changed the interfaces, but not the actual
implementation!
Reported-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <AANLkTikxqk=HEw9R-Du=v-1ti1HDGAY9vaNUep2XARaz@mail.gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthieu Castet <castet.matthieu@free.fr>