Impact: simplification
In the current code, for level triggered migration, we need to modify the
io-apic RTE with the update vector information, along with modifying interrupt
remapping table entry(IRTE) with vector and destination. This is to ensure that
remote IRR bit inthe IOAPIC RTE gets cleared when the cpu does EOI.
With this patch, for level triggered, we eliminate the io-apic RTE modification
(with the updated vector information), by using a virtual vector (io-apic pin
number). Real vector that is used for interrupting cpu will be coming from
the interrupt-remapping table entry. Trigger mode in the IRTE will always be
edge, and the actual level or edge trigger will be setup in the IO-APIC RTE.
So a level triggered interrupt will appear as an edge to the local apic
cpu but still as level to the IO-APIC.
With this change, level irq migration can be done by simply modifying
the interrupt-remapping table entry with out changing the io-apic RTE.
And as the interrupt appears as edge at the cpu, in addition to do the
local apic EOI, we need to do IO-APIC directed EOI to clear the remote
IRR bit in the IO-APIC RTE.
This simplies the irq migration in the presence of interrupt-remapping.
Idea-by: Rajesh Sankaran <rajesh.sankaran@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup, paranoia
We were not clearing the local APIC in clear_local_APIC() in the
presence of x2apic. Fix it.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: make kexec work with x2apic
disable_IO_APIC() gets called during crashdump aswell, which configures the
IO-APIC/LAPIC so that legacy interrupts can be delivered for the kexec'd kernel.
In the presence of interrupt-remapping, we need to change the
interrupt-remapping configuration aswell as modifying IO-APIC for virtual wire
B mode.
To keep things simple during the crash, use virtual wire A mode
(for which we don't need to touch io-apic and interrupt-remapping tables).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: interface augmentation (not yet used)
Enable fault handling flow for intr-remapping aswell. Fault handling
code now shared by both dma-remapping and intr-remapping.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: build fix with CONFIG_RELOCATABLE
Move _end into a dummy section, so that relocs.c will know it is a
relocatable symbol.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Impact: disambiguate real .bss variables from .brk storage
Add a .brk section after the .bss section. This has no effect
on the final vmlinux, but it more clearly distinguishes the space
taken by actual .bss symbols, and the variable space reserved
by .brk users.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Impact: Tighten bound to avoid masking errors
The definition of MAPPING_BEYOND_END was excessive; this has a nasty
tendency to mask bugs. We have learned over time that this kind of
bug hiding can cause some very strange errors. Therefore, tighten the
bound to only need to map the actual kernel area.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Impact: cleanup
ALLOCATOR_SLOP is a vestigial remain from when we used the
bootmem allocator to allocate the kernel's linear memory mapping.
Now we directly reserve pages from the e820 mapping, and no
longer require secondary structures to keep track of allocated
pages.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: crash fix
head_32.S needs to map the kernel itself, and enough space so
that mm/init.c can allocate space from the e820 allocator
for the linear map of low memory.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Don't boost at the addresses which are listed on exception tables,
because major page fault will occur on those addresses. In that case,
kprobes can not ensure that when instruction buffer can be freed since
some processes will sleep on the buffer.
kprobes-ia64 already has same check.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order for ntpd to correctly synchronize the clocks, the frequency of
the system clock must not be off by more than 500 ppm (or, put another
way, 1:2000), or ntpd will end up giving up on trying to synchronize
properly, and ends up reseting the clock in jumps instead.
The fast TSC PIT calibration sometimes failed this test - it was
assuming that the PIT reads always took about one microsecond each (2us
for the two reads to get a 16-bit timer), and that calibrating TSC to
the PIT over 15ms should thus be sufficient to get much closer than
500ppm (max 2us error on both sides giving 4us over 15ms: a 270 ppm
error value).
However, that assumption does not always hold: apparently some hardware
is either very much slower at reading the PIT registers, or there was
other noise causing at least one machine to get 700+ ppm errors.
So instead of using a fixed 15ms timing loop, this changes the fast PIT
calibration to read the TSC delta over the individual PIT timer reads,
and use the result to calculate the error bars on the PIT read timing
properly. We then successfully calibrate the TSC only if the maximum
error bars fall below 500ppm.
In the process, we also relax the timing to allow up to 25ms for the
calibration, although it can happen much faster depending on hardware.
Reported-and-tested-by: Jesper Krogh <jesper@krogh.cc>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During bootup, when we reprogram the PIT (programmable interval timer)
to start counting down from 0xffff in order to use it for the fast TSC
calibration, we should also make sure to delay a bit afterwards to allow
the PIT hardware to actually start counting with the new value.
That will happens at the next CLK pulse (1.193182 MHz), so the easiest
way to do that is to just wait at least one microsecond after
programming the new PIT counter value. We do that by just reading the
counter value back once - which will take about 2us on PC hardware.
Reported-and-tested-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: don't trim e820 according to wrong mtrr
Ozan reports that his server emits strange warning.
it turns out the BIOS sets the MTRRs incorrectly.
Ignore those strange ranges, and don't trim e820,
just emit one warning about BIOS
Reported-by: Ozan Çağlayan <ozan@pardus.org.tr>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BEE1E7.7020706@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Bug fix on UP
Referring commit cc3ca22063,
Peter removed __cpuinit annotations for mce_cpu_features()
and its successor functions, which caused troubles on UP
configurations.
However the intel_init_cmci() was introduced after that and
it also has __cpuinit annotation even though it is called from
mce_cpu_features(). Remove the annotation from that function
too.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: help debug e820 bugs
Try to print out more info, to catch wrong call parameters.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BCB557.3030000@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot crash
Need to exit early if the addr is far above 64k.
The crash got exposed by:
78a8b35: x86: make e820_update_range() handle small range update
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@kernel.org>
LKML-Reference: <49BC2279.2030101@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: makes vmlinux section information more useful
Don't use ram after _end blindly for pagetables. aka init pages is before _end
put those pg table into .bss
[Adapted to use brk segment - Jeremy]
v2: keep initial page table up to 512M only.
v4: put initial page tables just before _end
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: new interface; remove hard-coded limit
Add RESERVE_BRK(name, size) macro to reserve space in the brk
area. This should be a conservative (ie, larger) estimate of
how much space might possibly be required from the brk area.
Any unused space will be freed, so there's no real downside
on making the reservation too large (within limits).
The name should be unique within a given file, and somewhat
descriptive.
The C definition of RESERVE_BRK() ends up being more complex than
one would expect to work around a cluster of gcc infelicities:
The first attempt was to simply try putting __section(.brk_reservation)
on a variable. This doesn't work because it ends up making it a
@progbits section, which gets actual space allocated in the vmlinux
executable.
The second attempt was to emit the space into a section using asm,
but gcc doesn't allow arguments to be passed to file-level asm()
statements, making it hard to pass in the size.
The final attempt is to wrap the asm() in a function to allow
it to have arguments, and put the function itself into the
.discard section, which vmlinux*.lds drops entirely from the
emitted vmlinux.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: simplification
We only need to map the kernel in head_32.S, not the whole of
lowmem. We use 512MB as a reasonable (but arbitrary) limit on
the maximum size of the kernel image.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: use new interface instead of previous ad hoc implementation
Use extend_brk() to allocate memory for DMI rather than having an
ad-hoc allocator.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: use new interface instead of previous ad hoc implementation
Rather than having special purpose init_pg_table_start/end variables
to delimit the kernel pagetable built by head_32.S, just use the brk
mechanism to extend the bss for the new pagetable.
This patch removes init_pg_table_start/end and pg0, defines __brk_base
(which is page-aligned and immediately follows _end), initializes
the brk region to start there, and uses it for the 32-bit pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: build fix
The brk initialization functions were incorrectly located inside
an #ifdef CONFIG_VLK_DEV_INITRD block, causing the obvious build failure in
minimal configurations.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Move the symbols delimiting a section part of the section
(section relative) rather than absolute. This avoids any
unexpected gaps between the section-start symbol and the first
data in the section, which could be caused by implicit
alignment of the section data. It also makes the general
form of vmlinux_64.lds.S consistent with vmlinux_32.lds.S.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
there should be no difference, except:
* the 64bit variant now also initializes the padlock unit.
* ->c_early_init() is executed again from ->c_init()
* the 64bit fixups made into 32bit path.
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: herbert@gondor.apana.org.au
LKML-Reference: <1237029843-28076-2-git-send-email-sebastian@breakpoint.cc>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: print out fewer lines
1. print continuous range with same type together
2. change _INFO to _DEBUG
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BACB61.8000302@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix debug warning
Jaswinder noticed that there is a warning about smp_processor_id()
in get_mtrr().
Fix it by wrapping the printout into a get/put_cpu() pair.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49BAB7FF.4030107@kernel.org>
[ changed to get/put_cpu(), cleaned up surrounding code a it. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enhance e820 code to handle more cases
Try to handle new range which could be covered by one entry.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: jbeulich@novell.com
LKML-Reference: <49B9F0C1.10402@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Complete/fix the cleanups of cpu/common.c:
- fix ugly warning due to asm/topology.h -> linux/topology.h change
- standardize the style across the file
- simplify/refactor the code flow where possible
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
LKML-Reference: <1237009789.4387.2.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix:
arch/x86/kernel/entry_32.S:446: Warning: 00000000080001d1 shortened to 00000000000001d1
arch/x86/kernel/entry_32.S:457: Warning: 000000000800feff shortened to 000000000000feff
arch/x86/kernel/entry_32.S:527: Warning: 00000000080001d1 shortened to 00000000000001d1
arch/x86/kernel/entry_32.S:541: Warning: 000000000800feff shortened to 000000000000feff
arch/x86/kernel/entry_32.S:676: Warning: 0000000008000091 shortened to 0000000000000091
TIF_SYSCALL_FTRACE is 0x08000000 and until now we checked the
first 16 bits of the work mask - bit 27 falls outside of that.
Update the entry_32.S code to check the full 32-bit mask.
[ %cx => %ecx fix from Cyrill Gorcunov <gorcunov@gmail.com> ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "H. Peter Anvin" <hpa@kernel.org>
LKML-Reference: <1237012693.18733.3.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
kernel/built-in.o: In function `ftrace_syscall_exit':
(.text+0x76667): undefined reference to `syscall_nr_to_meta'
ftrace.o is built:
obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
But now a CONFIG_FTRACE_SYSCALLS dependency is needed too.
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
One usage site was missed in the sizeof_field -> sizeof_ptr_field
rename.
Cc: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313104218.A30096@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Perform a selftest of branch trace store when a cpu is initialized.
WARN and disable branch trace store support if the selftest fails.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313104507.A30125@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: more robust DS feature enumeration
Detect the size of the pointer-type fields in the DS area
configuration via the DTES64 features rather than based on
the cpuid.
Rename a variable to denote that size to reflect that it only
covers the pointer-type fields.
Add more boot-time diagnostics giving the detected size and
the sizes of BTS and PEBS records.
Use the size of the BTS/PEBS record to indicate that the
respective feature is not available (if the record size is zero).
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313104218.A30096@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit c2724775ce put a statement
after return, which makes that statement unreachable.
Move that statement before return.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090313075622.GB8933@hack>
Cc: <stable@kernel.org> # .29 only
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: bug fix + BIOS workaround
BIOS is expected to clear the SYSCFG[MtrrFixDramModEn] on AMD CPUs
after fixed MTRRs are configured.
Some BIOSes do not clear SYSCFG[MtrrFixDramModEn] on BP (and on APs).
This can lead to obfuscation in Linux when this bit is not cleared on
BP but cleared on APs. A consequence of this is that the saved
fixed-MTRR state (from BP) differs from the fixed-MTRRs of APs --
because RdDram/WrDram bits are read as zero when
SYSCFG[MtrrFixDramModEn] is cleared -- and Linux tries to sync
fixed-MTRR state from BP to AP. This implies that Linux sets
SYSCFG[MtrrFixDramEn] and activates those bits.
More important is that (some) systems change these bits in SMM when
ACPI is enabled. Hence it is racy if Linux modifies RdMem/WrMem bits,
too.
(1) The patch modifies an old fix from Bernhard Kaindl to get
suspend/resume working on some Acer Laptops. Bernhard's patch
tried to sync RdMem/WrMem bits of fixed MTRR registers and that
helped on those old Laptops. (Don't ask me why -- can't test it
myself). But this old problem was not the motivation for the
patch. (See http://lkml.org/lkml/2007/4/3/110)
(2) The more important effect is to fix issues on some more current systems.
On those systems Linux panics or just freezes, see
http://bugzilla.kernel.org/show_bug.cgi?id=11541
(and also duplicates of this bug:
http://bugzilla.kernel.org/show_bug.cgi?id=11737http://bugzilla.kernel.org/show_bug.cgi?id=11714)
The affected systems boot only using acpi=ht, acpi=off or
when the kernel is built with CONFIG_MTRR=n.
The acpi options prevent full enablement of ACPI. Obviously when
ACPI is enabled the BIOS/SMM modfies RdMem/WrMem bits. When
CONFIG_MTRR=y Linux also accesses and modifies those bits when it
needs to sync fixed-MTRRs across cores (Bernhard's fix, see (1)).
How do you synchronize that? You can't. As a consequence Linux
shouldn't touch those bits at all (Rationale are AMD's BKDGs which
recommend to clear the bit that makes RdMem/WrMem accessible).
This is the purpose of this patch. And (so far) this suffices to
fix (1) and (2).
I suggest not to touch RdDram/WrDram bits of fixed-MTRRs and
SYSCFG[MtrrFixDramEn] and to clear SYSCFG[MtrrFixDramModEn] as
suggested by AMD K8, and AMD family 10h/11h BKDGs.
BIOS is expected to do this anyway. This should avoid that
Linux and SMM tread on each other's toes ...
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: trenn@suse.de
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20090312163937.GH20716@alberich.amd.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide the x86 trace callbacks to trace syscalls.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix left range size on head
| commit 5c0e6f035d
| x86: fix code paths used by update_mptable
| Impact: fix crashes under Xen due to unrobust e820 code
fixes one e820 bug, but introduces another bug.
Need to update size for left range at first in case it is header.
also add __e820_add_region take more parameter.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: jbeulich@novell.com
LKML-Reference: <49B9E286.502@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
We are removing cpumask_t in favour of struct cpumask: mainly as a
marker of what code is now CONFIG_CPUMASK_OFFSTACK-safe.
The only non-trivial change here is vector_allocation_domain():
explicitly clear the mask and set the first word, rather than using
assignment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
It's not legal to do assignments into cpumask_var_t; they will soon be of
variable length.
So explicitly clear the mask and set the first word, rather than using
assignment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup, remove cpumask from stack
summit_send_IPI_allbutself might as well call
default_send_IPI_mask_allbutself_logical(). Also change cpumask_t to
struct cpumask and &cpu_online_map to cpu_online_mask while here.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce kernel memory usage when CONFIG_CPUMASK_OFFSTACK=y
Straightforward conversion: done for 32 and 64 bit kernels.
node_to_cpumask_map is now a cpumask_var_t array.
64-bit used to be a dynamic cpumask_t array, and 32-bit used to be a
static cpumask_t array.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
We take the 64-bit code and use it on 32-bit as well. The new file
is called mm/numa.c.
In a minor cleanup, we use cpu_none_mask instead of declaring a local
cpu_mask_none.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce kernel memory usage when CONFIG_CPUMASK_OFFSTACK=y
Simple conversion of mce_device_initialized to cpumask_var_t. We don't
check the alloc_cpumask_var() return since it's boot-time only, and
the misc_register() in that same function isn't checked.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce per-cpu size for CONFIG_CPUMASK_OFFSTACK=y
In most places it's cleaner to use the accessors cpu_sibling_mask()
and cpu_core_mask() wrappers which already exist.
I couldn't avoid cleaning up the access in oprofile, either.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup, reduce memory usage for CONFIG_CPUMASK_OFFSTACK=y
I *think* every path calls check_nmi_watchdog before using the
watchdog, so that's the right place for the initialization.
If that's wrong, we'll get a nice NULL-deref with
CONFIG_CPUMASK_OFFSTACK=y, and have uncovered another bug.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
(Thanks to Al Viro for reminding me of this, via Ingo)
CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:
#define CPU_MASK_ALL (cpumask_t) { { ... } }
Taking the address of such a temporary is questionable at best,
unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:
#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)
Which formalizes this practice. One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).
So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR
with the modern "cpu_all_mask" (a real const struct cpumask *), and remove
CPU_MASK_ALL_PTR altogether.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
Impact: cleanup
mtrr main.c is too big, seperate mtrr cleanup and mtrr e820 trim
code to another file.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B87C7B.80809@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: print more debug info
Keep it consistent with autodetect version.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B87C0A.4010105@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve MTRR debugging messages
There's still inefficiencies suspected with the MTRR sanitizing
code, so make sure we get all the info we need from a dmesg.
- Remove unneeded mtrr_show
(It will only printout one time by first cpu, so it is no big deal.)
- Also print out directly from get_mtrr, because it doesn't update mtrr_state.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B9BA5A.40108@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crashes under Xen due to unrobust e820 code
find_e820_area_size() must return a properly distinguishable and
out-of-bounds value when it fails, and -1UL does not meet that
criteria on i386/PAE. Additionally, callers of the function must
check against that value.
early_reserve_e820() should be prepared for the region found to be
outside of the addressable range on 32-bits.
e820_update_range_map() should not blindly update e820, but should do
all it work on the map it got a pointer passed for (which in 50% of the
cases is &e820_saved). It must also not call e820_add_region(), as that
again acts on e820 unconditionally.
The issues were found when trying to make this option work in our Xen
kernel (i.e. where some of the silent assumptions made in the code
would not hold).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B9171B.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Without apic=verbose, using the update_mptable option would result in
garbled and confusing output due to the inconsistent use of printk() vs
apic_printk().
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B914B6.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: 32/64-bit consolidation
In a first step, this allows fixing phys_addr_valid() for PAE (which
until now reported all addresses to be valid). Subsequently, this will
also allow simplifying some MTRR handling code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B9101E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change /proc/interrupts output ABI
With the number of interrupts on large systems growing, assumptions on
the width an interrupt number requires when converted to a decimal
string turn invalid. Therefore, calculate the maximum number of digits
dynamically.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B911EB.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: debuggability and micro-optimization
Putting whatever is possible into the (final) .rodata section increases
the likelihood of catching memory corruption bugs early, and reduces
false cache line sharing.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B90961.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: mark save_paranoid as non-kprobe-able code
This appears to be necessary as the function gets called from
kprobes-unsafe exception handling stubs (i.e. which themselves
live in .kprobes.text).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B8F44F.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
These got left in needlessly when ret_from_fork got simplified.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B8F355.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move store_ldt outside the CONFIG_PARAVIRT section and
also clean up the code a bit.
Signed-off-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: saving power _very_ little
round_jiffies() round up absolute jiffies to full second.
round_jiffies_relative() round up relative jiffies to full second.
The "t->expires" is absolute jiffies. Then, round_jiffies() should be
used instead round_jiffies_relative().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: New major feature
This patch add kexec jump support for x86_64. More information about
kexec jump can be found in corresponding x86_32 support patch.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Fix corner case that cannot yet occur
image->start may be outside of 0 ~ max_pfn, for example when jumping
back to original kernel from kexeced kenrel. This patch add identity
map for pages at image->start.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup
Fix some coding style issue for kexec x86.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: micro-optimization
There's a number of variables in the sched_clock() path that are
in .data/.bss - but not marked __read_mostly. This creates the
danger of accidental false cacheline sharing with some other,
write-often variable.
So mark them __read_mostly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce:
cat /sys/kernel/debug/x86/cpu/*
for Intel and AMD processors to view / debug the state of each CPU.
By using this we can debug whole range of registers and other
cpu information for debugging purpose and monitor how things
are changing.
This can be useful for developers as well as for users.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1236701373.3387.4.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: code reorganization
Separate out embedding first chunk setup helper from x86 embedding
first chunk allocator and put it in mm/percpu.c. This will be used by
the default percpu first chunk allocator and possibly by other archs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, more flexibility for first chunk init
Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
This restriction stemmed from implementation detail and made things a
bit less intuitive. This patch allows @dyn_size to be specified
regardless of @unit_size and swaps the positions of @dyn_size and
@unit_size so that the parameter order makes more sense (static,
reserved and dyn sizes followed by enclosing unit_size).
While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.
Signed-off-by: Tejun Heo <tj@kernel.org>
This reverts commit e088e4c9cd.
Removing the sysfs interface for p4-clockmod was flagged as a
regression in bug 12826.
Course of action:
- Find out the remaining causes of overheating, and fix them
if possible. ACPI should be doing the right thing automatically.
If it isn't, we need to fix that.
- mark p4-clockmod ui as deprecated
- try again with the removal in six months.
It's not really feasible to printk about the deprecation, because
it needs to happen at all the sysfs entry points, which means adding
a lot of strcmp("p4-clockmod".. calls to the core, which.. bleuch.
Signed-off-by: Dave Jones <davej@redhat.com>
Use and actual unsigned long bitmap instead of casting our way around.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
LKML-Reference: <1236508459.22914.3645.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup and code size reduction on 64-bit
This code is only applied to Intel Pentium and AMD K7 32-bit cpus.
Move those checks to intel_init()/amd_init() for 32-bit
so 64-bit will not build this code.
Also change to use cpu_index check to see if we need to emit warning.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B377D2.8030108@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In uv_flush_tlb_others() (arch/x86/kernel/tlb_uv.c),
the "WARN_ON(!in_atomic())" fails if CONFIG_PREEMPT is not enabled.
And CONFIG_PREEMPT is not enabled by default in the distribution that
most UV owners will use.
We could #ifdef CONFIG_PREEMPT the warning, but that is not good form.
And there seems to be no suitable fix to in_atomic() when CONFIG_PREMPT
is not on.
As Ingo commented:
> and we have no proper primitive to test for atomicity. (mainly
> because we dont know about atomicity on a non-preempt kernel)
So we drop the WARN_ON.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use fixmaps instead of vmap/vunmap in text_poke() for avoiding
page allocation and delayed unmapping.
At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <49B14352.2040705@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use the mutual exclusion provided by the text edit lock in alternatives code.
Since alternative_smp_* will be called from module init code, etc,
we'd better protect it from other subsystems.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
LKML-Reference: <49B14332.9030109@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
ds_write_config() can write the BTS as well as the PEBS part of
the DS config. ds_request_pebs() passes the wrong qualifier, which
results in the wrong configuration to be written.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090305085721.A22550@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In case a ptraced task is reaped (while the tracer is still attached),
ds_exit_thread() is called before ptrace_exit(). The latter will
release the bts_tracer and remove the thread's ds_ctx.
The former will WARN() if the context is not NULL.
Oleg Nesterov submitted patches that move ptrace_exit() before
exit_thread() and thus reverse the order of the above calls.
Remove the bad warning. I will add it again when Oleg's changes are in.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
LKML-Reference: <20090305084954.A22000@sedona.ch.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix relocation overflow during module load
x86_64 uses 32bit relocations for symbol access and static percpu
symbols whether in core or modules must be inside 2GB of the percpu
segement base which the dynamic percpu allocator doesn't guarantee.
This patch makes x86_64 reserve PERCPU_MODULE_RESERVE bytes in the
first chunk so that module percpu areas are always allocated from the
first chunk which is always inside the relocatable range.
This problem exists for any percpu allocator but is easily triggered
when using the embedding allocator because the second chunk is located
beyond 2GB on it.
This patch also changes the meaning of PERCPU_DYNAMIC_RESERVE such
that it only indicates the size of the area to reserve for dynamic
allocation as static and dynamic areas can be separate. New
PERCPU_DYNAMIC_RESERVED is increased by 4k for both 32 and 64bits as
the reserved area separation eats away some allocatable space and
having slightly more headroom (currently between 4 and 8k after
minimal boot sans module area) makes sense for common case
performance.
x86_32 can address anywhere from anywhere and doesn't need reserving.
Mike Galbraith first reported the problem first and bisected it to the
embedding percpu allocator commit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Impact: add reserved allocation functionality and use it for module
percpu variables
This patch implements reserved allocation from the first chunk. When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator. This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.
If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation. If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation. The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.
If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: reduce unnecessary memory usage on certain configurations
Embedding percpu allocator allocates unit_size *
smp_num_possible_cpus() bytes consecutively and use it for the first
chunk. However, if the static area is small, this can result in
excessive prellocated free space in the first chunk due to
PCPU_MIN_UNIT_SIZE restriction.
This patch makes embedding percpu allocator preallocate only what's
necessary as described by PERPCU_DYNAMIC_RESERVE and return the
leftover to the bootmem allocator.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: argument semantic cleanup
In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant
auto-sizing. It's okay for @unit_size as 0 doesn't make sense but 0
dynamic reserve size is valid. Alos, if arch @dyn_size is calculated
from other parameters, it might end up passing in 0 @dyn_size and
malfunction when the size is automatically adjusted.
This patch makes both @unit_size and @dyn_size ssize_t and use -1 for
auto sizing.
Signed-off-by: Tejun Heo <tj@kernel.org>
The below completes the K7+ performance counter support:
- IRQ support
- NMI support
KernelTop output works now as well.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jaswinder Singh Rajput <jaswinder@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1236273633.5187.286.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, micro-optimization
Pre-initialize boot_cpu_data.x86_phys_bits to a reasonable default
to remove the use of system_state tests in __virt_addr_valid()
and __phys_addr().
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: decrease hangs risks with the graph tracer on slow systems
Since the function graph tracer can spend too much time on timer
interrupts, it's better now to use the more lightweight local
clock. Anyway, the function graph traces are more reliable on a
per cpu trace.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af243d.06e9300a.53ad.ffff840c@mx.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We try to avoid this type of ifdef and we can safely remove this
ifdef.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The latency of p4-clockmod sucks so hard that scaling on a regular
basis with ondemand is a really bad idea.
Signed-off-by: Matthew Garrett <mjg59@srcf.ucam.org>
Signed-off-by: Dave Jones <davej@redhat.com>
Dell XPS710 will hang on reboot. This is resolved by adding a quirk to
set bios reboot.
Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: "manoj.iyer" <manoj.iyer@canonical.com>
Cc: <stable@kernel.org>
LKML-Reference: <1236196380.3231.89.camel@emiko>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save a bit of RAM
Get the exact size for the reserve_bootmem() call.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49AE4922.605@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot with mptable above max_low_mapped
Try to use early_ioremap() to map MPC to make sure it works even it is
at the end of ram.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49AE4901.3090801@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-and-tested-by: Kevin O'Connor <kevin@koconnor.net>
Impact: fix math-emu related crash while using GDB/ptrace
init_fpu() calls finit to initialize a task's xstate, while finit always
works on the current task. If we use PTRACE_GETFPREGS on another
process and both processes did not already use floating point, we get
a null pointer exception in finit.
This patch creates a new function finit_task that takes a task_struct
parameter. finit becomes a wrapper that simply calls finit_task with
current. On the plus side this avoids many calls to get_current which
would each resolve to an inline assembler mov instruction.
An empty finit_task has been added to i387.h to avoid linker errors in
case the compiler still emits the call in init_fpu when
CONFIG_MATH_EMULATION is not defined.
The declaration of finit in i387.h has been removed as the remaining
code using this function gets its prototype from fpu_proto.h.
Signed-off-by: Daniel Glöckner <dg@emlix.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Bill Metzenthen <billm@melbpc.org.au>
LKML-Reference: <E1Lew31-0004il-Fg@mailer.emlix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch provides a high resolution clock/timer source using the
SGI UV system-wide synchronized RTC clock/timer hardware.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090304185918.GC24419@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch allocates a system interrupt vector for various platform
specific uses.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090304185605.GA24419@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix boot failure on EFI system with large runtime memory range
Brian Maly reported that some EFI system with large runtime memory
range can not boot. Because the FIX_MAP used to map runtime memory
range is smaller than run time memory range.
This patch fixes this issue by re-implement efi_ioremap() with
init_memory_mapping().
Reported-and-tested-by: Brian Maly <bmaly@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Brian Maly <bmaly@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236135513.6204.306.camel@yhuang-dev.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reactivate DMI quirks on EFI hardware
DMI tables are loaded by EFI, so the dmi calls must happen after
efi_init() and not before.
Currently Apple hardware uses DMI to determine the framebuffer mappings
for efifb. Without DMI working you also have no video on MacBook Pro.
This patch resolves the DMI issue for EFI hardware (DMI is now properly
detected at boot), and additionally efifb now loads on Apple hardware
(i.e. video works).
Signed-off-by: Brian Maly <bmaly@redhat>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: ying.huang@intel.com
LKML-Reference: <49ADEDA3.1030406@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/setup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
Impact: build fix
The APIC code rewrite in the x86 tree broke the x86/mce branch:
arch/x86/kernel/cpu/mcheck/threshold.c: In function ‘mce_threshold_interrupt’:
arch/x86/kernel/cpu/mcheck/threshold.c:24: error: implicit declaration of function ‘ack_APIC_irq’
Also tidy up the file a bit while at it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bad frame in rt_sigreturn on 64-bit
After commit 97286a2b64 some applications
fail to return from signal handler:
[ 145.150133] firefox[3250] bad frame in rt_sigreturn frame:00007f902b44eb28 ip:352e80b307 sp:7f902b44ef70 orax:ffffffffffffffff in libpthread-2.9.so[352e800000+17000]
[ 665.519017] firefox[5420] bad frame in rt_sigreturn frame:00007faa8deaeb28 ip:352e80b307 sp:7faa8deaef70 orax:ffffffffffffffff in libpthread-2.9.so[352e800000+17000]
The root cause is forgetting to keep 64 byte aligned value of
fpstate for next stack pointer calculation.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
LKML-Reference: <49AC85C1.7060600@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call. A 64-bit process make a 32-bit system call with int $0x80.
In both these cases, audit_syscall_entry() will use the wrong system
call number table and the wrong system call argument registers. This
could be used to circumvent a syscall audit configuration that filters
based on the syscall numbers or argument details.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With x86-32 and -64 using the same mechanism for managing the
tss io permissions bitmap, large chunks of process*.c are
trivially unifyable, including:
- exit_thread
- flush_thread
- __switch_to_xtra (along with tsc enable/disable)
and as bonus pickups:
- sys_fork
- sys_vfork
(Note: asmlinkage expands to empty on x86-64)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove 32-bit optimization to prepare unification
x86-32 and -64 differ in the way they context-switch tasks
with io permission bitmaps. x86-64 simply copies the next
tasks io bitmap into place (if any) on context switch. x86-32
invalidates the bitmap on context switch, so that the next
IO instruction will fault; at that point it installs the
appropriate IO bitmap.
This makes context switching IO-bitmap-using tasks a bit more
less expensive, at the cost of making the next IO instruction
slower due to the extra fault. This tradeoff only makes sense
if IO-bitmap-using processes are relatively common, but they
don't actually use IO instructions very often.
However, in a typical desktop system, the only process likely
to be using IO bitmaps is the X server, and nothing at all on
a server. Therefore the lazy context switch doesn't really win
all that much, and its just a gratuitious difference from
64-bit code.
This patch removes the lazy context switch, with a view to
unifying this code in a later change.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove __cpuinitdata section placement for translation_table
structure, since it is referenced from a functions within .text.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Remove __init section placement for some functions/data, so that
we don't get section mismatch warnings.
Also make inline function instead of empty setup_summit macro.
[v2]
One of them was not caught by
DEBUG_SECTION_MISMATCH=y
magic. Fix it.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Remove __init section placement for some functions, so that we don't
get section mismatch warnings.
[v2]:
2 of them were not caught by
DEBUG_SECTION_MISMATCH=y
magic. Fix it.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
using pr_info in perf_counter.c fixes various 80 characters warnings and
also indenting for conditional statement
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
making decent declrations for struct pmc_x86_ops and
fix checkpatch error:
ERROR: Macros with complex values should be enclosed in parenthesis
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Perform same-cluster checking even for masks with all (nr_cpu_ids)
bits set and report correct apicid on success instead.
While at it, convert it to for_each_cpu and newer cpumask api.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Perform same-cluster checking even for masks with all (nr_cpu_ids)
bits set and report BAD_APICID on failure.
While at it, convert it to for_each_cpu.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove es7000_cpu_mask_to_apicid_cluster completely, because it's
almost the same as es7000_cpu_mask_to_apicid except 2 code paths.
One of them is about to be removed soon, the another should be
BAD_APICID (it's a fail path).
The _cluster one was not invoked on apic->cpu_mask_to_apicid_and
anyway, since there was no _cluster_and variant.
Also use newer cpumask functions.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The ones which go only into struct apic are de-inlined
by compiler anyway, so remove the inline specifier from them.
Afterwards, remove bigsmp_setup_portio_remap completely as it
is unused.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: unification
show_cpuinfo_core is identical for 32 and 64 bit and can be unified,
and CONFIG_X86_HT inherently depends on CONFIG_X86_SMP.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Add missing __user annotation to the parameter of get_sigframe().
Also change cast type to void __user * of *fpstate.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/kernel/cpu/intel_cacheinfo.c:139: warning: ‘k8_nb_id’ defined but not used
arch/x86/kernel/cpu/intel_cacheinfo.c:527: warning: ‘free_cache_attributes’ defined but not used
arch/x86/kernel/cpu/intel_cacheinfo.c:538: warning: ‘detect_cache_attributes’ defined but not used
Unused variables in the !CONFIG_SYSCTL case.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If the TSC is constant and non-stop, also set it reliable.
(We will turn this off in DMI quirks for multi-chassis systems)
The performance number on a 16-way Nehalem system running
32 tasks that context-switch between each other is significant:
sched_clock_stable=0 sched_clock_stable=1
.................... ....................
22.456925 million/sec 24.306972 million/sec [+8.2%]
lmbench's "lat_ctx -s 0 2" goes from 0.63 microseconds to
0.59 microseconds - a 6.7% increase in context-switching
performance.
Perfstat of 1 million pipe context switches between two tasks:
Performance counter stats for './pipe-test-1m':
[before] [after]
............ ............
37621.421089 36436.848378 task clock ticks (msecs)
0 0 CPU migrations (events)
2000274 2000189 context switches (events)
194 193 pagefaults (events)
8433799643 8171016416 CPU cycles (events) -3.21%
8370133368 8180999694 instructions (events) -2.31%
4158565 3895941 cache references (events) -6.74%
44312 46264 cache misses (events)
2349.287976 2279.362465 wall-time (msecs) -3.06%
The speedup comes straight from the reduction in the instruction
count. sched_clock_cpu() got simpler and the whole workload thus
executes faster.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/apic/es7000_32.c:702: error: 'es7000_acpi_madt_oem_check_cluster' undeclared here (not in a function)
Provide a es7000_acpi_madt_oem_check_cluster() definition in the !ACPI
case too.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
init_deasserted is only available on SMP. Make the secondary-wakeup
function conditional on SMP.
Also clean up the file some.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
- rename apic->wakeup_cpu to apic->wakeup_secondary_cpu, to
make it apparent that this is an SMP-only method
- handle NULL ->wakeup_secondary_cpus to mean the default INIT
wakeup sequence - this allows simplification of the APIC
driver templates.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
that is only needed when CONFIG_X86_VSMP is defined with 64bit
also remove dead code about PCI, because CONFIG_X86_VSMP depends on PCI
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
x86_quirks->update_apic() calling looks crazy. so try to remove it:
1. every apic take wakeup_cpu member directly
2. separate es7000_apic to es7000_apic_cluster
3. use uv_wakeup_cpu directly
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Not owning an nforce2 is a sign of good taste, not an error.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
http://bugzilla.kernel.org/show_bug.cgi?id=10968
[ Updated for current tree, and fixed compile failure
when p4-clockmod was built modular -- davej]
From: Matthias-Christian Ott <ott@mirix.org>
Signed-off-by: Dominik Brodowski <linux@brodo.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Change the link order of the cpufreq modules to ensure that they're
probed in the preferred order when statically linked in.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
This is the typical message you get if you plug in a CPU
which is newer than your BIOS. It's annoying seeing this
message for each core.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
powernow-k8 driver should always try to get cpufreq info from ACPI.
Otherwise it will not be able to detect the transition latency correctly
which results in ondemand governor taking a wrong sampling rate which will
then result in sever performance loss.
Let the user not shoot himself in the foot and always compile in ACPI
support for powernow-k8.
This also fixes a wrong message if ACPI_PROCESSOR is compiled as a module and
#ifndef CONFIG_ACPI_PROCESSOR
path is chosen.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
This driver has so many long function names, and deep nested if's
The remaining warnings will need some code restructuring to clean up.
Signed-off-by: Dave Jones <davej@redhat.com>
The remaining warning about the simple_strtoul conversion
to strict_strtoul seems kind of pointless to me.
Signed-off-by: Dave Jones <davej@redhat.com>
GNU indent complains about this being ambiguous, because it's dumb.
One of my automated tests relies on the output of indent, so this shuts
it up.
Signed-off-by: Dave Jones <davej@redhat.com>
Recent changes in setup_percpu.c made a now meaningless DBG()
statement fail to compile and introduced a
comparison-of-different-types warning. Fix them.
Compile failure is reported by Ingo Molnar.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix marginal race condition
One the first CPU the machine checks are enabled early before
the local APIC is enabled. This could in theory lead
to some lost CMCI events very early during boot because
CMCIs cannot be delivered with disabled LAPIC.
The poller also doesn't recover from this because it doesn't
check CMCI banks.
Add an explicit CMCI banks check after the LAPIC is enabled.
This is only done for CPU #0, the other CPUs only initialize
machine checks after the LAPIC is on.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Avoids confusing other OSes.
Disable the CMCI vector on reboot to avoid confusing other OS.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix on UP
The MCE code is reinitialized from resume, so we can't use
__cpuinit/__cpuexit for most of the code. Remove those annotations
for anything downstream of mce_init().
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Major new feature
Intel CMCI (Corrected Machine Check Interrupt) is a new
feature on Nehalem CPUs. It allows the CPU to trigger
interrupts on corrected events, which allows faster
reaction to them instead of with the traditional
polling timer.
Also use CMCI to discover shared banks. Machine check banks
can be shared by CPU threads or even cores. Using the CMCI enable
bit it is possible to detect the fact that another CPU already
saw a specific bank. Use this to assign shared banks only
to one CPU to avoid reporting duplicated events.
On CPU hot unplug bank sharing is re discovered. This is done
using a thread that cycles through all the CPUs.
To avoid races between the poller and CMCI we only poll
for banks that are not CMCI capable and only check CMCI
owned banks on a interrupt.
The shared banks ownership information is currently only used for
CMCI interrupts, not polled banks.
The sharing discovery code follows the algorithm recommended in the
IA32 SDM Vol3a 14.5.2.1
The CMCI interrupt handler just calls the machine check poller to
pick up the machine check event that caused the interrupt.
I decided not to implement a separate threshold event like
the AMD version has, because the threshold is always one currently
and adding another event didn't seem to add any value.
Some code inspired by Yunhong Jiang's Xen implementation,
which was in term inspired by a earlier CMCI implementation
by me.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Define a per cpu bitmap that contains the banks polled by the machine
check poller. This is needed for the CMCI code in the next patches
to be able to disable polling on specific banks.
The bank by default contains all banks, so there is no behaviour
change. Only future code will remove some banks from the polling
set.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: behavior change, use common code
Use a standard leaky bucket ratelimit for the machine check
warning print interval instead of waiting every check_interval.
Also decrease the limit to twice per minute.
This interacts better with threshold interrupts because
they can happen more often than check_interval.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: minor bugfix
The threshold handler on AMD (and soon on Intel) could be theoretically
reentered by the hardware. This could lead to corrupted events
because the machine check poll code assumes it is not reentered.
Move the APIC ACK to the end of the interrupt handler to let
the hardware avoid that.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup; preparation for feature
The mce_amd_64 code has an own private MC threshold vector with an own
interrupt handler. Since Intel needs a similar handler
it makes sense to share the vector because both can not
be active at the same time.
I factored the common APIC handler code into a separate file which can
be used by both the Intel or AMD MC code.
This is needed for the next patch which adds an Intel specific
CMCI handler.
This patch should be a nop for AMD, it just moves some code
around.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup (code movement)
Move MAX_NR_BANKS into mce.h because it's needed there
for followup patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The ones which go only into struct genapic are de-inlined
by compiler anyway, so remove the inline specifier from them.
Afterwards, remove summit_setup_portio_remap completely as it
is unused.
Remove inline also from summit_cpu_mask_to_apicid, since it's
not worth it (it is used in struct genapic too).
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use BAD_APICID instead of 0xFF constants in summit_cpu_mask_to_apicid.
Also remove bogus comments about what we actually return.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
native_usergs_sysret64 is described as
extern void native_usergs_sysret64(void)
so lets add ENDPROC here
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: heukelum@fastmail.fm
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
NEXT_PAGE already has 'balign' so no
need to keep this redundant one.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: heukelum@fastmail.fm
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add better first percpu allocation for NUMA
On NUMA, embedding allocator can't be used as different units can't be
made to fall in the correct NUMA nodes. To use large page mapping,
each unit needs to be remapped. However, percpu areas are usually
much smaller than large page size and unused space hurts a lot as the
number of cpus grow. This allocator remaps large pages for each chunk
but gives back unused part to the bootmem allocator making the large
pages mapped twice.
This adds slightly to the TLB pressure but is much better than using
4k mappings while still being NUMA-friendly.
Ingo suggested that this would be the correct approach for NUMA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Impact: add better first percpu allocation for !NUMA
On !NUMA, we can simply allocate contiguous memory and use it for the
first chunk without mapping it into vmalloc area. As the memory area
is covered by the large page physical memory mapping, it allows the
dynamic perpcu allocator to not add any TLB overhead for the static
percpu area and whatever falls into the first chunk and the
implementation is very simple too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: modularize percpu first chunk allocation
x86 is gonna have a few different strategies for the first chunk
allocation. Modularize it by separating out the current allocation
mechanism into pcpu_alloc_bootmem() and setup_pcpu_4k().
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: more latitude for first percpu chunk allocation
The first percpu chunk serves the kernel static percpu area and may or
may not contain extra room for further dynamic allocation.
Initialization of the first chunk needs to be done before normal
memory allocation service is up, so it has its own init path -
pcpu_setup_static().
It seems archs need more latitude while initializing the first chunk
for example to take advantage of large page mapping. This patch makes
the following changes to allow this.
* Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
to reserve in the first chunk for further dynamic allocation.
* Rename pcpu_setup_static() to pcpu_setup_first_chunk().
* Make pcpu_setup_first_chunk() much more flexible by fetching page
pointer by callback and adding optional @unit_size, @free_size and
@base_addr arguments which allow archs to selectively part of chunk
initialization to their likings.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: minor change to populate_extra_pte() and addition of pmd flavor
Update populate_extra_pte() to return pointer to the pte_t for the
specified address and add populate_extra_pmd() which only populates
till the pmd and returns pointer to the pmd entry for the address.
For 64bit, pud/pmd/pte fill functions are separated out from
set_pte_vaddr[_pud]() and used for set_pte_vaddr[_pud]() and
populate_extra_{pte|pmd}().
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Bug fix when CPU hotplug is disabled
Correct the following broken __cpuinit/__cpuexit annotations:
- mce_cpu_features() is called from mce_resume(), and so cannot be
__cpuinit.
- mce_disable_cpu() and mce_reenable_cpu() are called from
mce_cpu_callback(), and so cannot be __cpuexit().
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: Cleanup
Checkin be44d2aabc eliminates the use of
a 16-bit stack for espfix. However, at least one instruction remained
that only operated on the low 16 bits of %esp.
This is not a bug per se because the kernel stack is always an aligned
4K or 8K block. Therefore it cannot cross 64K boundaries; this code,
in fact, relies strictly on that fact.
However, it's a lot cleaner (and, for that matter, smaller) to operate
on the entire 32-bit register.
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
CC: Zachary Amsden <zach@vmware.com>
CC: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: fix early crash on LinuxBIOS systems
Kevin O'Connor reported that Coreboot aka LinuxBIOS tries to put
mptable somewhere very high, well above max_low_pfn (below which
BIOSes generally put the mptable), causing a panic.
The BIOS will probably be changed to be compatible with older
Linus versions, but nevertheless the MP-spec does not forbid
an MP-table in arbitrary system RAM, so make sure it all
works even if the table is in an unexpected place.
Check physptr with max_low_pfn * PAGE_SIZE.
Reported-by: Kevin O'Connor <kevin@koconnor.net>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Stefan Reinauer <stepan@coresystems.de>
Cc: coreboot@coreboot.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Make x86_quirks support more transparent. The highlevel
methods are now named:
extern void x86_quirk_pre_intr_init(void);
extern void x86_quirk_intr_init(void);
extern void x86_quirk_trap_init(void);
extern void x86_quirk_pre_time_init(void);
extern void x86_quirk_time_init(void);
This makes it clear that if some platform extension has to
do something here that it is considered ... weird, and is
discouraged.
Also remove arch_hooks.h and move it into setup.h (and other
header files where appropriate).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead code
Remove:
- pre_setup_arch_hook()
- mca_nmi_hook()
If needed they can be added back via an x86_quirk handler.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the sysdev_suspend/resume from the callee to the callers, with
no real change in semantics, so that we can rework the disabling of
interrupts during suspend/hibernation.
This is based on an earlier patch from Linus.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now nobody cares, but the suspend/resume code will eventually want
to suspend device interrupts without suspending the timer, and will
depend on this flag to know.
The modern x86 timer infrastructure uses the local APIC timers and never
shows up as a device interrupt at all, so it isn't affected and doesn't
need any of this.
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If BIOS hands over the control to OS in legacy xapic mode, select
legacy xapic related ops in the early apic probe and shift to x2apic
ops later in the boot sequence, only after enabling x2apic mode.
If BIOS hands over the control in x2apic mode, select x2apic related
ops in the early apic probe.
This fixes the early boot panic, where we were selecting x2apic ops,
while the cpu is still in legacy xapic mode.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix system hang on some systems operating with HZ_1000
On a system that stalled with HZ_1000, the first value written to
T0_CMP (when the main counter was not stopped) did not trigger an
interrupt. Instead after the main counter wrapped around (after
several minutes) an interrupt was triggered and afterwards the
periodic interrupt took effect.
This can be fixed by implementing HPET spec recommendation for
programming the periodic mode (i.e. stopping the main counter).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Mark Hounschell <markh@compro.net>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix these sparse warnings:
arch/x86/kernel/machine_kexec_32.c:124:22: warning: Using plain integer as NULL pointer
arch/x86/kernel/traps.c:950:24: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As acpi_enter_sleep_state can fail, take this into account in
do_suspend_lowlevel and don't return to the do_suspend_lowlevel's
caller. This would break (currently) fpu status and preempt count.
Technically, this means use `call' instead of `jmp' and `jmp' to
the `resume_point' after the `call' (i.e. if
acpi_enter_sleep_state returns=fails). `resume_point' will handle
the restore of fpu and preempt count gracefully.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
- remove %ds re-set, it's already set in wakeup_long64
- remove double labels and alignment (ENTRY already adds both)
- use meaningful resume point labelname
- skip alignment while jumping from wakeup_long64 to the resume point
- remove .size, .type and unused labels
[v2]
- added ENDPROCs
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Bug fix on UP
Checkin 6ec68bff3c:
x86, mce: reinitialize per cpu features on resume
introduced a call to mce_cpu_features() in the resume path, in order
for the MCE machinery to get properly reinitialized after a resume.
However, this function (and its successors) was flagged __cpuinit,
which becomes __init on UP configurations (on SMP suspend/resume
requires CPU hotplug and so this would not be seen.)
Remove the offending __cpuinit annotations for mce_cpu_features() and
its successor functions.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Rename TASK_SIZE64 to TASK_SIZE_MAX, and provide the
define on 32-bit too. (mapped to TASK_SIZE)
This allows 32-bit code to make use of the (former-) TASK_SIZE64
symbol as well, in a clean way.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to prevent NMI lockup
If the page fault handler produces a WARN_ON in the modifying of
text, and the system is setup to have a high frequency of NMIs,
we can lock up the system on a failure to modify code.
The modifying of code with NMIs allows all NMIs to modify the code
if it is about to run. This prevents a modifier on one CPU from
modifying code running in NMI context on another CPU. The modifying
is done through stop_machine, so only NMIs must be considered.
But if the write causes the page fault handler to produce a warning,
the print can slow it down enough that as soon as it is done
it will take another NMI before going back to the process context.
The new NMI will perform the write again causing another print and
this will hang the box.
This patch turns off the writing as soon as a failure is detected
and does not wait for it to be turned off by the process context.
This will keep NMIs from getting stuck in this back and forth
of print outs.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: keep kernel text read only
Because dynamic ftrace converts the calls to mcount into and out of
nops at run time, we needed to always keep the kernel text writable.
But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
the kernel code to writable before ftrace modifies the text, and converts
it back to read only afterward.
The kernel text is converted to read/write, stop_machine is called to
modify the code, then the kernel text is converted back to read only.
The original version used SYSTEM_STATE to determine when it was OK
or not to change the code to rw or ro. Andrew Morton pointed out that
using SYSTEM_STATE is a bad idea since there is no guarantee to what
its state will actually be.
Instead, I moved the check into the set_kernel_text_* functions
themselves, and use a local variable to determine when it is
OK to change the kernel text RW permissions.
[ Update: Ingo Molnar suggested moving the prototypes to cacheflush.h ]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
clean up vmi_read_cycles to use max()
Reported-b: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new dynamic allocator, unified access to static/dynamic
percpu memory
Convert to the new dynamic percpu allocator.
* implement populate_extra_pte() for both 32 and 64
* update setup_per_cpu_areas() to use pcpu_setup_static()
* define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
* define config HAVE_DYNAMIC_PER_CPU_AREA
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
There are two allocated per-cpu accessor macros with almost identical
spelling. The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.
tj: kill percpu_ptr() and update UP too
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: economize memory for large NR_CPUS
percpu data is setup earlier than irq, we can use percpu data
to economize memory.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: fix time warps under vmware
Similar to the check for TSC going backwards in the TSC clocksource,
we also need this check for VMI clocksource.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Impact: Cleanup
The standard spelling of a printf pattern for long long is "ll", not
"L", which is for long double.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup, performance enhancement
The machine check poller is diverging more and more from the fatal
exception handler. Instead of adding more special cases separate the code
paths completely. The corrected poll path is actually quite simple,
and this doesn't result in much code duplication.
This makes both handlers much easier to read and results in
cleaner code flow. The exception handler now only needs to care
about uncorrected errors, which also simplifies the handling of multiple
errors. The corrected poller also now always runs in standard interrupt
context and does not need to do anything special to handle NMI context.
Minor behaviour changes:
- MCG status is now not cleared on polling.
- Only the banks which had corrected errors get cleared on polling
- The exception handler only clears banks with errors now
v2: Forward port to new patch order. Add "uc" argument.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup
This merely factors out duplicated code to set up
the initial struct mce state into a single function.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup; making code future proof; memory saving on small systems
This patch replaces the hardcoded max number of machine check banks with
dynamic allocation depending on what the CPU reports. The sysfs
data structures and the banks array are dynamically allocated.
There is still a hard bank limit (128) because the mcelog protocol uses
banks >= 128 as pseudo banks to escape other events. But we expect
that 128 banks is beyond any reasonable CPU for now.
This supersedes an earlier patch by Venki, but it solves the problem
more completely by making the limit fully dynamic (up to the 128
boundary).
This saves some memory on machines with less than 6 banks because
they won't need sysdevs for unused ones and also allows to
use sysfs to control these banks on possible future CPUs with
more than 6 banks.
This is an updated patch addressing Venki's comments. I also added in
another patch from Thomas which fixed the error allocation path (that
patch was previously separated)
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown
x86, mce: use force_sig_info to kill process in machine check
x86, mce: reinitialize per cpu features on resume
x86, rcu: fix strange load average and ksoftirqd behavior
There is nothing really arch specific of the push and pop functions
used by the function graph tracer. This patch moves them to generic
code.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: bugfix
Considering the situation as follow:
before: mcelog.next == 1, mcelog.entry[0].finished = 1
+--------------------------------------------------------------------------
R W1 W2 W3
read mcelog.next (1)
mcelog.next++ (2)
(working on entry 1,
finished == 0)
mcelog.next = 0
mcelog.next++ (1)
(working on entry 0)
mcelog.next++ (2)
(working on entry 1)
<----------------- race ---------------->
(done on entry 1,
finished = 1)
(done on entry 1,
finished = 1)
To fix the race condition, a cmpxchg loop is added to mce_read() to
ensure no new MCE record can be added between mcelog.next reading and
mcelog.next = 0.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Lower priority bug fix
Offlined CPUs could still get machine checks, but the machine check handler
cannot handle them properly, leading to an unconditional crash. Disable
machine checks on CPUs that are going down.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix, in this case the resume handler shouldn't run which
avoids incorrectly reenabling machine checks on resume
When MCEs are completely disabled on the command line don't set
up the sysdev devices for them either.
Includes a comment fix from Thomas Gleixner.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Higher priority bug fix
The machine check poller runs a single timer and then broadcasted an
IPI to all CPUs to check them. This leads to unnecessary
synchronization between CPUs. The original CPU running the timer has
to wait potentially a long time for all other CPUs answering. This is
also real time unfriendly and in general inefficient.
This was especially a problem on systems with a lot of events where
the poller run with a higher frequency after processing some events.
There could be more and more CPU time wasted with this, to
the point of significantly slowing down machines.
The machine check polling is actually fully independent per CPU, so
there's no reason to not just do this all with per CPU timers. This
patch implements that.
Also switch the poller also to use standard timers instead of work
queues. It was using work queues to be able to execute a user program
on a event, but mce_notify_user() handles this case now with a
separate callback. So instead always run the poll code in in a
standard per CPU timer, which means that in the common case of not
having to execute a trigger there will be less overhead.
This allows to clean up the initialization significantly, because
standard timers are already up when machine checks get init'ed. No
multiple initialization functions.
Thanks to Thomas Gleixner for some help.
Cc: thockin@google.com
v2: Use del_timer_sync() on cpu shutdown and don't try to handle
migrated timers.
v3: Add WARN_ON for timer running on unexpected CPU
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Needed for bug fix in next patch
This relaxes the requirement that mce_notify_user has to run in process
context. Useful for future changes, but also leads to cleaner
behaviour now. Now instead mce_notify_user can be called directly
from interrupt (but not NMI) context.
The work queue only uses a single global work struct, which can be done safely
because it is always free to reuse before the trigger function is executed.
This way no events can be lost.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: low priority bug fix
This removes part of a a patch I added myself some time ago. After some
consideration the patch was a bad idea. In particular it stopped machine check
exceptions during code patching.
To quote the comment:
* MCEs only happen when something got corrupted and in this
* case we must do something about the corruption.
* Ignoring it is worse than a unlikely patching race.
* Also machine checks tend to be broadcast and if one CPU
* goes into machine check the others follow quickly, so we don't
* expect a machine check to cause undue problems during to code
* patching.
So undo the machine check related parts of
8f4e956b31 NMIs are still disabled.
This only removes code, the only additions are a new comment.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
During suspend it is not reliable to process machine check
exceptions, because CPUs disappear but can still get machine check
broadcasts. Also the system is slightly more likely to
machine check them, but the handler is typically not a position
to handle them in a meaningfull way.
So disable them during suspend and enable them during resume.
Also make sure they are always disabled on hot-unplugged CPUs.
This new code assumes that suspend always hotunplugs all
non BP CPUs.
v2: Remove the WARN_ONs Thomas objected to.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bugfix
The ifdef for the apic clear on shutdown for the 64bit intel thermal
vector was incorrect and never triggered. Fix that.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix (with tolerant == 3)
do_exit cannot be called directly from the exception handler because
it can sleep and the exception handler runs on the exception stack.
Use force_sig() instead.
Based on a earlier patch by Ying Huang who debugged the problem.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
This fixes a long standing bug in the machine check code. On resume the
boot CPU wouldn't get its vendor specific state like thermal handling
reinitialized. This means the boot cpu wouldn't ever get any thermal
events reported again.
Call the respective initialization functions on resume
v2: Remove ancient init because they don't have a resume device anyways.
Pointed out by Thomas Gleixner.
v3: Now fix the Subject too to reflect v2 change
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: fix preemption bug
x86, olpc: fix model detection without OFW
x86, hpet: fix for LS21 + HPET = boot hang
x86: CPA avoid repeated lazy mmu flush
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86, ptrace, mm: fix double-free on race
Impact: build fix, cleanup
A couple of arch setup callbacks were mistakenly in apic_32.c, breaking
the build.
Also simplify the code a bit.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the 32-bit extended-arch APIC drivers to arch/x86/kernel/apic/
too, and rename apic_64.c to probe_64.c.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/ is getting a bit crowded, and the APIC
drivers are scattered into various different files.
Move them to arch/x86/kernel/apic/*, and also remove
the 'gen' prefix from those which had it.
Also move APIC related functionality: the IO-APIC driver,
the NMI and the IPI code.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Now that all APIC code is consolidated there's nothing 'gen' about
apics anymore - so rename 'struct genapic' to 'struct apic'.
This shortens the code and is nicer to read as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
- misc other cleanups that change the md5 signature
- consolidate global variables
- remove unnecessary __numaq_mps_oem_check() wrapper
- make numaq_mps_oem_check static
- update copyrights
- misc other cleanups pointed out by checkpatch
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These are cleanups that change the md5 signature:
- asm/ => linux/ include conversion
- simplify the code flow of find_unisys_acpi_oem_table()
- move ACPI methods into one #ifdef block
- remove 0/NULL initialization of statics
- simplify/standardize printouts
- update copyrights
- more cleanups, pointed out by checkpatch
arch/x86/kernel/es7000_32.o:
text data bss dec hex filename
2693 192 44 2929 b71 es7000_32.o.before
2688 192 44 2924 b6c es7000_32.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
- a number of structure definitions were stale
- remove needless wrappers around apic definitions
- fix details noticed by checkpatch
No code changed:
md5:
029d8fde0aaf6e934ea63bd8b36430fd es7000_32.o.before.asm
029d8fde0aaf6e934ea63bd8b36430fd es7000_32.o.after.asm
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
In the subarch times there were a number of externs between
various bits of the ES7000 code. Now that there's a single
es7000-platform support file, the externs can be removed and
the functions can be changed the statics.
Beyond the cleanup factor, this also shrinks the size of the
kernel image a bit:
arch/x86/kernel/es7000_32.o:
text data bss dec hex filename
2813 192 44 3049 be9 es7000_32.o.before
2693 192 44 2929 b71 es7000_32.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There were multiple definitions of apicid_cluster() scattered around
in APIC drivers - but the definitions are equivalent to the already
existing generic APIC_CLUSTER() method.
So remove apicid_cluster() and change all users to APIC_CLUSTER().
No code changed:
md5:
1b8244ba8d3d6a454593ce10f09dfa58 summit_32.o.before.asm
1b8244ba8d3d6a454593ce10f09dfa58 summit_32.o.after.asm
md5:
a593d98a882bf534622c70d9568497ac es7000_32.o.before.asm
a593d98a882bf534622c70d9568497ac es7000_32.o.after.asm
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
- remove unnecessary indirections that were artifacts of the subarch code
- clean up include file section
- clean up various small details
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
APIC_DEFINITION was a hack from the x86 subarch times, it has no
meaning anymore - remove it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- make oprofile build
- select X86_X2APIC from X86_UV - it relies on it
- export genapic for oprofile modular build
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
make it simpler, don't need have one extra struct.
v2: fix the sgi_uv build
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
so could deselect x2apic
and INTR_REMAP will select x2apic
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
1. move localise_nmi_watchdog() later
2. change setup_boot_APIC_clock() to setup_boot_clock() for 64-bit
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
set default value early - this allows the removal of a number
of dynamic initialization codepaths, and an #ifdef.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix powernow-k8 when acpi=off (or other error).
There was a spurious change introduced into powernow-k8 in this patch:
so that we try to "restore" the cpus_allowed we never saved. We revert
that file.
See lkml "[PATCH] x86/powernow: fix cpus_allowed brokage when
acpi=off" from Yinghai for the bug report.
Cc: Mike Travis <travis@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Commit 3d2a71a596 ("x86, traps: converge
do_debug handlers") changed the preemption disable logic of do_debug()
so vm86_handle_trap() is called with preemption disabled resulting in:
BUG: sleeping function called from invalid context at include/linux/kernel.h:155
in_atomic(): 1, irqs_disabled(): 0, pid: 3005, name: dosemu.bin
Pid: 3005, comm: dosemu.bin Tainted: G W 2.6.29-rc1 #51
Call Trace:
[<c050d669>] copy_to_user+0x33/0x108
[<c04181f4>] save_v86_state+0x65/0x149
[<c0418531>] handle_vm86_trap+0x20/0x8f
[<c064e345>] do_debug+0x15b/0x1a4
[<c064df1f>] debug_stack_correct+0x27/0x2c
[<c040365b>] sysenter_do_call+0x12/0x2f
BUG: scheduling while atomic: dosemu.bin/3005/0x10000001
Restore the original calling convention and reenable preemption before
calling handle_vm86_trap().
Reported-by: Michal Suchanek <hramrach@centrum.cz>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wrong disabling of cpu features
an amd system got this strange output:
CPU: CPU feature monitor disabled due to lack of CPUID level 0x5
but in /proc/cpuinfo I have:
cpuid level : 5
on intel system:
CPU: CPU feature monitor disabled due to lack of CPUID level 0x5
CPU: CPU feature dca disabled due to lack of CPUID level 0x9
but in /proc/cpuinfo i have:
cpuid level : 11
Tt turns out there is a typo, and we should use level member in df.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix "garbled display, laptop is unusable" bug
Commit e51a1ac2df ("x86, olpc: fix endian
bug in openfirmware workaround") breaks model comparison on OLPC; the value
0xc2 needs to be scaled up by olpc_board().
The pre-patch version was wrong, but accidentally worked anyway
(big-endian 0xc2 is big enough to satisfy all other board revisions,
but little endian 0xc2 is not).
Signed-off-by: Chris Ball <cjb@laptop.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Andres Salomon <dilinger@queued.net>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In general, the only definitions that assembly files can use
are in _types.S headers (where available), so convert them.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
The uv_hub_send_ipi() function needs to set the full apicid in the
UVH_IPI_INT mmr.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert the c/p state "power" tracer to use tracepoints. Avoids a
function call when the tracer is disabled.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Between 2.6.23 and 2.6.24-rc1 a change was made that broke IBM LS21
systems that had the HPET enabled in the BIOS, resulting in boot hangs
for x86_64.
Specifically commit b8ce335906, which
merges the i386 and x86_64 HPET code.
Prior to this commit, when we setup the HPET timers in x86_64, we did
the following:
hpet_writel(HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
HPET_TN_32BIT, HPET_T0_CFG);
However after the i386/x86_64 HPET merge, we do the following:
cfg = hpet_readl(HPET_Tn_CFG(timer));
cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC |
HPET_TN_SETVAL | HPET_TN_32BIT;
hpet_writel(cfg, HPET_Tn_CFG(timer));
However on LS21s with HPET enabled in the BIOS, the HPET_T0_CFG register
boots with Level triggered interrupts (HPET_TN_LEVEL) enabled. This
causes the periodic interrupt to be not so periodic, and that results in
the boot time hang I reported earlier in the delay calibration.
My fix: Always disable HPET_TN_LEVEL when setting up periodic mode.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Catch cases where lazy MMU state is active in a preemtible context
arch_flush_lazy_mmu_cpu() has been changed to disable preemption so
the checks in enter/leave will never trigger. Put the preemtible()
check into arch_flush_lazy_mmu_cpu() to catch such cases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: avoid access to percpu vars in preempible context
They are intended to be used whenever there's the possibility
that there's some stale state which is going to be overwritten
with a queued update, or to force a state change when we may be
in lazy mode. Either way, we could end up calling it with
preemption enabled, so wrap the functions in their own little
preempt-disable section so they can be safely called in any
context (though preemption should never be enabled if we're actually
in a lazy state).
(Move out of line to avoid #include dependencies.)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: cleanup
With the recent changes in the 32-bit code to make system calls which
use struct pt_regs take a pointer, sys_rt_sigreturn() have become
identical between 32 and 64 bits, and both are empty wrappers around
do_rt_sigreturn(). Remove both wrappers and rename both to
sys_rt_sigreturn().
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some syscalls need to access the pt_regs structure, either to copy
user register state or to modifiy it. This patch adds stubs to load
the address of the pt_regs struct into the %eax register, and changes
the syscalls to take the pointer as an argument instead of relying on
the assumption that the pt_regs structure overlaps the function
arguments.
Drop the use of regparm(1) due to concern about gcc bugs, and to move
in the direction of the eventual removal of regparm(0) for asmlinkage.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
arch/x86/kernel/mpparse.c: In function ‘smp_scan_config’:
arch/x86/kernel/mpparse.c:696: warning: format ‘%08lx’ expects type ‘long unsigned int’, but argument 3 has type ‘phys_addr_t’
arch/x86/kernel/mpparse.c: In function ‘update_mp_table’:
arch/x86/kernel/mpparse.c:1014: warning: format ‘%lx’ expects type ‘long unsigned int’, but argument 2 has type ‘phys_addr_t’
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: fix TIMER_ABSTIME for process wide cpu timers
timers: split process wide cpu clocks/timers, fix
x86: clean up hpet timer reinit
timers: split process wide cpu clocks/timers, remove spurious warning
timers: split process wide cpu clocks/timers
signal: re-add dead task accumulation stats.
x86: fix hpet timer reinit for x86_64
sched: fix nohz load balancer on cpu offline
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace, x86: fix the usage of ptrace_fork()
i8327: fix outb() parameter order
x86: fix math_emu register frame access
x86: math_emu info cleanup
x86: include correct %gs in a.out core dump
x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
x86: find nr_irqs_gsi with mp_ioapic_routing
x86: add clflush before monitor for Intel 7400 series
x86: disable intel_iommu support by default
x86: don't apply __supported_pte_mask to non-present ptes
x86: fix grammar in user-visible BIOS warning
x86/Kconfig.cpu: make Kconfig help readable in the console
x86, 64-bit: print DMI info in the oops trace
Ptrace_detach() races with __ptrace_unlink() if the traced task is
reaped while detaching. This might cause a double-free of the BTS
buffer.
Change the ptrace_detach() path to only do the memory accounting in
ptrace_bts_detach() and leave the buffer free to ptrace_bts_untrace()
which will be called from __ptrace_unlink().
The fix follows a proposal from Oleg Nesterov.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that no functions rely on struct pt_regs being passed by value,
various "no stack protector" annotations can be dropped.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some syscalls need to access the pt_regs structure, either to copy
user register state or to modifiy it. This patch adds stubs to load
the address of the pt_regs struct into the %eax register, and changes
the syscalls to regparm(1) to receive the pt_regs pointer as the
first argument.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The generic exception handler (error_code) passes in the pt_regs
pointer and the error code (unused in this case). The commit
"x86: fix math_emu register frame access" changed this to pass by
value, which doesn't work correctly with stack protector enabled.
Change it back to use the pt_regs pointer.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix x86_32 stack protector
Brian Gerst found out that %gs was being initialized to stack_canary
instead of stack_canary - 20, which basically gave the same canary
value for all threads. Fixing this also exposed the following bugs.
* cpu_idle() didn't call boot_init_stack_canary()
* stack canary switching in switch_to() was being done too late making
the initial run of a new thread use the old stack canary value.
Fix all of them and while at it update comment in cpu_idle() about
calling boot_init_stack_canary().
Reported-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With refactoring of wake_cpu macros the 32bit code in tip doesn't
execute generic_apic_probe if CONFIG_X86_32_NON_STANDARD is not set.
Even on a x86 STANDARD cpu we need to execute the generic_apic_probe
function, as we rely on this function to execute the update_genapic
quirk which initilizes apic->wakeup_cpu.
Failing to do so results in we making a call to a null function in do_boot_cpu.
The stack trace without the patch goes like this.
Booting processor 1 APIC 0x1 ip 0x6000
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<(null)>] (null)
*pdpt = 0000000000839001 *pde = 0000000000c97067 *pte = 0000000000000163
Oops: 0000 [#1] SMP
last sysfs file:
Modules linked in:
Pid: 1, comm: swapper Not tainted (2.6.29-rc4-tip #18) VMware Virtual Platform
EIP: 0062:[<00000000>] EFLAGS: 00010293 CPU: 0
EIP is at 0x0
EAX: 00000001 EBX: 00006000 ECX: c077ed00 EDX: 00006000
ESI: 00000001 EDI: 00000001 EBP: ef04cf40 ESP: ef04cf1c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 006a
Process swapper (pid: 1, ti=ef04c000 task=ef050000 task.ti=ef04c000)
Stack:
c0644e52 00000000 ef04cf24 ef04cf24 c064468d c0886dc0 00000000 c0702aea
ef055480 00000001 00000101 dead4ead ffffffff ffffffff c08af530 00000000
c0709715 ef04cf60 ef04cf60 00000001 00000000 00000000 dead4ead ffffffff
Call Trace:
[<c0644e52>] ? native_cpu_up+0x2de/0x45b
[<c064468d>] ? do_fork_idle+0x0/0x19
[<c0645c5e>] ? _cpu_up+0x88/0xe8
[<c0645d20>] ? cpu_up+0x42/0x4e
[<c07e7462>] ? kernel_init+0x99/0x14b
[<c07e73c9>] ? kernel_init+0x0/0x14b
[<c040375f>] ? kernel_thread_helper+0x7/0x10
Code: Bad EIP value.
EIP: [<00000000>] 0x0 SS:ESP 006a:ef04cf1c
I think we should call generic_apic_probe unconditionally for 32 bit now.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The constraint used for retrieving and restoring the parent function
pointer is incorrect. The parent variable is a pointer, and the
address of the pointer is modified by the asm statement and not
the pointer itself. It is incorrect to pass it in as an output
constraint since the asm will never update the pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new perf_counter feature
This extends the perf_counter_hw_event struct with bits that specify
that events in user, kernel and/or hypervisor mode should not be
counted (i.e. should be excluded), and adds code to program the PMU
mode selection bits accordingly on x86 and powerpc.
For software counters, we don't currently have the infrastructure to
distinguish which mode an event occurs in, so we currently fail the
counter initialization if the setting of the hw_event.exclude_* bits
would require us to distinguish. Context switches and CPU migrations
are currently considered to occur in kernel mode.
On x86, this changes the previous policy that only root can count
kernel events. Now non-root users can count kernel events or exclude
them. Non-root users still can't use NMI events, though. On x86 we
don't appear to have any way to control whether hypervisor events are
counted or not, so hw_event.exclude_hv is ignored.
On powerpc, the selection of whether to count events in user, kernel
and/or hypervisor mode is PMU-wide, not per-counter, so this adds a
check that the hw_event.exclude_* settings are the same as other events
on the PMU. Counters being added to a group have to have the same
settings as the other hardware counters in the group. Counters and
groups can only be enabled in hw_perf_group_sched_in or power_perf_enable
if they have the same settings as any other counters already on the
PMU. If we are not running on a hypervisor, the exclude_hv setting
is ignored (by forcing it to 0) since we can't ever get any
hypervisor events.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Impact: fix to prevent a kernel crash on fault
If for some reason the pointer to the parent function on the
stack takes a fault, the fix up code will not return back to
the original faulting code. This can lead to unpredictable
results and perhaps even a kernel panic.
A fault should not happen, but if it does, we should simply
disable the tracer, warn, and continue running the kernel.
It should not lead to a kernel crash.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The constraint used for retrieving and restoring the parent function
pointer is incorrect. The parent variable is a pointer, and the
address of the pointer is modified by the asm statement and not
the pointer itself. It is incorrect to pass it in as an output
constraint since the asm will never update the pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
In i8237A_resume(), when resetting the DMA controller, the parameters to
dma_outb() were mixed up.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
[ cleaned up the file a tiny bit. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: stack protector for x86_32
Implement stack protector for x86_32. GDT entry 28 is used for it.
It's set to point to stack_canary-20 and have the length of 24 bytes.
CONFIG_CC_STACKPROTECTOR turns off CONFIG_X86_32_LAZY_GS and sets %gs
to the stack canary segment on entry. As %gs is otherwise unused by
the kernel, the canary can be anywhere. It's defined as a percpu
variable.
x86_32 exception handlers take register frame on stack directly as
struct pt_regs. With -fstack-protector turned on, gcc copies the
whole structure after the stack canary and (of course) doesn't copy
back on return thus losing all changed. For now, -fno-stack-protector
is added to all files which contain those functions. We definitely
need something better.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: pt_regs changed, lazy gs handling made optional, add slight
overhead to SAVE_ALL, simplifies error_code path a bit
On x86_32, %gs hasn't been used by kernel and handled lazily. pt_regs
doesn't have place for it and gs is saved/loaded only when necessary.
In preparation for stack protector support, this patch makes lazy %gs
handling optional by doing the followings.
* Add CONFIG_X86_32_LAZY_GS and place for gs in pt_regs.
* Save and restore %gs along with other registers in entry_32.S unless
LAZY_GS. Note that this unfortunately adds "pushl $0" on SAVE_ALL
even when LAZY_GS. However, it adds no overhead to common exit path
and simplifies entry path with error code.
* Define different user_gs accessors depending on LAZY_GS and add
lazy_save_gs() and lazy_load_gs() which are noop if !LAZY_GS. The
lazy_*_gs() ops are used to save, load and clear %gs lazily.
* Define ELF_CORE_COPY_KERNEL_REGS() which always read %gs directly.
xen and lguest changes need to be verified.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
On x86_32, %gs is handled lazily. It's not saved and restored on
kernel entry/exit but only when necessary which usually is during task
switch but there are few other places. Currently, it's done by
calling savesegment() and loadsegment() explicitly. Define
get_user_gs(), set_user_gs() and task_user_gs() and use them instead.
While at it, clean up register access macros in signal.c.
This cleans up code a bit and will help future changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use .macro instead of cpp #define where approriate. This cleans up
code and will ease future changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do_device_not_available() is the handler for #NM and it declares that
it takes a unsigned long and calls math_emu(), which takes a long
argument and surprisingly expects the stack frame starting at the zero
argument would match struct math_emu_info, which isn't true regardless
of configuration in the current code.
This patch makes do_device_not_available() take struct pt_regs like
other exception handlers and initialize struct math_emu_info with
pointer to it and pass pointer to the math_emu_info to math_emulate()
like normal C functions do. This way, unless gcc makes a copy of
struct pt_regs in do_device_not_available(), the register frame is
correctly accessed regardless of kernel configuration or compiler
used.
This doesn't fix all math_emu problems but it at least gets it
somewhat working.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unstatic ioapic_write_entry and setup_ioapic_entry functions so that
the Xen code can do its own ioapic routing setup.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Add mp_find_ioapic_pin() to find an IO APIC's specific pin from a GSI,
and use this function within acpi/boot. Make it non-static so other
code can use it too.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] powernow-k8: Get transition latency from ACPI _PSS table
[CPUFREQ] Make ignore_nice_load setting of ondemand work as expected.
to prevent wrongly overwriting fixmap that still want to use.
ACPI used to rely on low mappings being all linearly mapped and
grew a habit: it never really unmapped certain kinds of tables
after use.
This can cause problems - for example the hypothetical case
when some spurious access still references it.
v2: remove prev_map and prev_size in __apci_map_table
v3: let acpi_os_unmap_memory() call early_iounmap too, so remove extral calling to
early_acpi_os_unmap_memory
v4: fix typo in one acpi_get_table_with_size calling
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On x86, __acpi_map_table uses early_ioremap() to create the mapping,
replacing the previous mapping with a new one. Once enough of the
kernel is up an running it switches to using normal ioremap(). At
that point, we need to clean up the final mapping to avoid a warning
from the early_ioremap subsystem.
This can be removed after all the instances in the ACPI code are fixed
that rely on early-ioremap's implicit overmapping of previously
mapped tables.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Always map acpi tables, rather than assuming we can use the normal
linear mapping to access the acpi tables. This is necessary in a
virtual environment where the linear mappings are to pseudo-physical
memory, but the acpi tables exist at a real physical address. It
doesn't hurt to map in the normal non-virtual case, so just do it
unconditionally.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
__acpi_map_table() effectively reimplements early_ioremap(). Rather
than have that duplication, just implement it in terms of
early_ioremap().
However, unlike early_ioremap(), __acpi_map_table() just maintains a
single mapping which gets replaced each call, and has no corresponding
unmap function. Implement this by just removing the previous mapping
each time its called. Unfortunately, this will leave a stray mapping
at the end.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 6194ba6ff6 ("x86: don't special-case
pmd allocations as much") made changes to the way we handle pmd allocations,
and while doing that it dropped a call to paravirt_release_pd on the
pgd page from the pgd_dtor code path.
As a result of this missing release, the hypervisor is now unaware of the
pgd page being freed, and as a result it ends up tracking this page as a
page table page.
After this the guest may start using the same page for other purposes, and
depending on what use the page is put to, it may result in various performance
and/or functional issues ( hangs, reboots).
Since this release is only required for VMI, I now release the pgd page from
the (vmi)_pgd_free hook.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
I noticed that kerneltop interrupts were accounted as NMI, but not their
perf counter origin.
Account NMI performance counter interrupts.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/cpu/perf_counter.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Impact: find right nr_irqs_gsi on some systems.
One test-system has gap between gsi's:
[ 0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
[ 0.000000] IOAPIC[0]: apic_id 4, version 0, address 0xfec00000, GSI 0-23
[ 0.000000] ACPI: IOAPIC (id[0x05] address[0xfeafd000] gsi_base[48])
[ 0.000000] IOAPIC[1]: apic_id 5, version 0, address 0xfeafd000, GSI 48-54
[ 0.000000] ACPI: IOAPIC (id[0x06] address[0xfeafc000] gsi_base[56])
[ 0.000000] IOAPIC[2]: apic_id 6, version 0, address 0xfeafc000, GSI 56-62
...
[ 0.000000] nr_irqs_gsi: 38
So nr_irqs_gsi is not right. some irq for MSI will overwrite with io_apic.
need to get that with acpi_probe_gsi when acpi io_apic is used
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the differences in interrupt handling hoisted into handle_irq(),
do_IRQ is more or less identical between 32 and 64 bit, so unify it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen uses a different interrupt path, so introduce handle_irq() to
allow interrupts to be inserted into the normal interrupt path. This
is handled slightly differently on 32 and 64-bit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/early_printk.c: In function ‘early_dbgp_init’:
arch/x86/kernel/early_printk.c:827: error: ‘PAGE_KERNEL_NOCACHE’ undeclared (first use in this function)
arch/x86/kernel/early_printk.c:827: error: (Each undeclared identifier is reported only once
arch/x86/kernel/early_printk.c:827: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For Intel 7400 series CPUs, the recommendation is to use a clflush on the
monitored address just before monitor and mwait pair [1].
This clflush makes sure that there are no false wakeups from mwait when the
monitored address was recently written to.
[1] "MONITOR/MWAIT Recommendations for Intel Xeon Processor 7400 series"
section in specification update document of 7400 series
http://download.intel.com/design/xeon/specupdt/32033601.pdf
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the function graph tracer picks a return address, it ensures this address
is really a kernel text one by calling __kernel_text_address()
Actually this path has never been taken.Its role was more likely to debug the tracer
on the beginning of its development but this function is wasteful since it is called
for every traced function.
The fault check is already sufficient.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the power tracer headers to trace/power.h to keep ftrace.h and power bits
more easy to maintain as separated topics.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup and bug fix
Use the linker to create symbols for certain per-cpu variables
that are offset by __per_cpu_load. This allows the removal of
the runtime fixup of the GDT pointer, which fixes a bug with
resume reported by Jiri Slaby.
Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Without frame pointers enabled, the x86 stack traces should not
pretend to be reliable; instead they should just be what they are:
unreliable.
The effect of this is that they have a '?' printed in the stacktrace,
to warn the reader that these entries are guesses rather than known
based on more reliable information.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: find right nr_irqs_gsi on some systems.
One test-system has gap between gsi's:
[ 0.000000] ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])
[ 0.000000] IOAPIC[0]: apic_id 4, version 0, address 0xfec00000, GSI 0-23
[ 0.000000] ACPI: IOAPIC (id[0x05] address[0xfeafd000] gsi_base[48])
[ 0.000000] IOAPIC[1]: apic_id 5, version 0, address 0xfeafd000, GSI 48-54
[ 0.000000] ACPI: IOAPIC (id[0x06] address[0xfeafc000] gsi_base[56])
[ 0.000000] IOAPIC[2]: apic_id 6, version 0, address 0xfeafc000, GSI 56-62
...
[ 0.000000] nr_irqs_gsi: 38
So nr_irqs_gsi is not right. some irq for MSI will overwrite with io_apic.
need to get that with acpi_probe_gsi when acpi io_apic is used
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make check-timer more robust potentially solve boot fragility
For edge trigger io-apic routing, we already unmasked the pin via
setup_IO_APIC_irq(), so don't unmask it again.
Also call local_irq_disable() between timer_irq_works(), because it
calls local_irq_enable() inside.
Also remove not needed apic version reading for 64-bit
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make nr_irqs depend more on cards used in a system
depend on nr_irq_gsi more, and have a ratio for MSI.
v2: make nr_irqs less than NR_VECTORS * nr_cpu_ids
aka if only one cpu, we only can support nr_irqs = NR_VECTORS
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Now that a generic in_nmi is available, this patch removes the
special code in the ring_buffer and implements the in_nmi generic
version instead.
With this change, I was also able to rename the "arch_ftrace_nmi_enter"
back to "ftrace_nmi_enter" and remove the code from the ring buffer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
The function graph tracer piggy backed onto the dynamic ftracer
to use the in_nmi custom code for dynamic tracing. The problem
was (as Andrew Morton pointed out) it really only wanted to bail
out if the context of the current CPU was in NMI context. But the
dynamic ftrace in_nmi custom code was true if _any_ CPU happened
to be in NMI context.
Now that we have a generic in_nmi interface, this patch changes
the function graph code to use it instead of the dynamic ftarce
custom code.
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: clean up
The in_nmi variable in x86 arch ftrace.c is a misnomer.
Andrew Morton pointed out that the in_nmi variable is incremented
by all CPUS. It can be set when another CPU is running an NMI.
Since this is actually intentional, the fix is to rename it to
what it really is: "nmi_running"
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: prevent deadlock in NMI
The ring buffers are not yet totally lockless with writing to
the buffer. When a writer crosses a page, it grabs a per cpu spinlock
to protect against a reader. The spinlocks taken by a writer are not
to protect against other writers, since a writer can only write to
its own per cpu buffer. The spinlocks protect against readers that
can touch any cpu buffer. The writers are made to be reentrant
with the spinlocks disabling interrupts.
The problem arises when an NMI writes to the buffer, and that write
crosses a page boundary. If it grabs a spinlock, it can be racing
with another writer (since disabling interrupts does not protect
against NMIs) or with a reader on the same CPU. Luckily, most of the
users are not reentrant and protects against this issue. But if a
user of the ring buffer becomes reentrant (which is what the ring
buffers do allow), if the NMI also writes to the ring buffer then
we risk the chance of a deadlock.
This patch moves the ftrace_nmi_enter called by nmi_enter() to the
ring buffer code. It replaces the current ftrace_nmi_enter that is
used by arch specific code to arch_ftrace_nmi_enter and updates
the Kconfig to handle it.
When an NMI is called, it will set a per cpu variable in the ring buffer
code and will clear it when the NMI exits. If a write to the ring buffer
crosses page boundaries inside an NMI, a trylock is used on the spin
lock instead. If the spinlock fails to be acquired, then the entry
is discarded.
This bug appeared in the ftrace work in the RT tree, where event tracing
is reentrant. This workaround solved the deadlocks that appeared there.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Fix compile problem:
CC arch/x86/kernel/early_printk.o
In file included from /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/early_printk.c:17:
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h: In function 'pmd_page':
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:516: error: implicit declaration of function '__pfn_to_section'
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:516: warning: initialization makes pointer from integer without a cast
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:516: error: implicit declaration of function '__section_mem_map_addr'
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:516: warning: return makes pointer from integer without a cast
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h: In function 'pud_page':
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:586: warning: initialization makes pointer from integer without a cast
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:586: warning: return makes pointer from integer without a cast
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h: In function 'pgd_page':
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:625: warning: initialization makes pointer from integer without a cast
/home/jeremy/hg/xen/paravirt/linux/arch/x86/include/asm/pgtable.h:625: warning: return makes pointer from integer without a cast
This is a cycling dependency between asm/pgtable.h and linux/mmzone.h
when using CONFIG_SPARSEMEM. Rather than hacking up the headers some
more, remove asm/pgtable.h, since early_printk.c doesn't actually need
it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Implement Linus's suggestion: introduce the hpet_cnt_ahead()
helper function to compare hpet time values - like other
wrapping counter comparisons are abstracted away elsewhere.
(jiffies, ktime_t, etc.)
Reported-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
only leave _default_ipi_xx etc in .h
Beyond the cleanup factor, this saves a bit of code size as well:
text data bss dec hex filename
7281931 1630144 1463304 10375379 9e50d3 vmlinux.before
7281753 1630144 1463304 10375201 9e5021 vmlinux.after
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
disable_ioapic_setup() in init/main.c is ugly as the function is
x86-specific. The #ifdef inline prototype there is ugly too.
Replace it with a generic arch_disable_smp_support() function - which
has a weak alias for non-x86 architectures and for non-ioapic x86 builds.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
At this time, the PowerNow! driver for K8 uses an experimentally
derived formula to calculate transition latency. The value it
provides is orders of magnitude too large on modern systems.
This patch replaces the formula with ACPI _PSS latency values
for more accuracy and better performance.
I've tested it on two 2nd generation Opteron systems, a 3rd
generation Operton system, and a Turion X2 without seeing any
stability problems.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Dave Jones <davej@redhat.com>
There's a small problem with hpet_rtc_reinit function - it checks
for the:
hpet_readl(HPET_COUNTER) - hpet_t1_cmp > 0
to continue increasing both the HPET_T1_CMP (register) and the
hpet_t1_cmp (variable).
But since the HPET_COUNTER is always 32-bit, if the hpet_t1_cmp
is 64-bit this condition will always be FALSE once the latter hits
the 32-bit boundary, and we can have a situation, when we don't
increase the HPET_T1_CMP register high enough.
The result - timer stops ticking, since HPET_T1_CMP becomes less,
than the COUNTER and never increased again.
The solution is (based on Linus's suggestion) to not compare 64-bits
(on 64-bit x86), but to do the comparison on 32-bit signed
integers.
Reported-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch echoes what we already do on 32-bit since
90f7d25c6b, and prints the DMI
product name in show_regs, so that system specific problems can be
easily identified.
Signed-off-by: Kyle McMartin <kyle@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With oprofile as a module, and unloaded by profiling script,
both oprofile and kerneltop work fine.. unless you leave kerneltop
running when you start profiling, then you may see badness.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They were long enough set deprecated...
Update Documentation/cpu-freq/users-guide.txt:
The deprecated files listed there seen not to exist for some time anymore
already.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: reduce kernel BSS size by 7 pages, improve code readability
Two page tables are used in current x86_64 kexec implementation. One
is used to jump from kernel virtual address to identity map address,
the other is used to map all physical memory. In fact, on x86_64,
there is no conflict between kernel virtual address space and physical
memory space, so just one page table is sufficient. The page table
pages used to map control page are dynamically allocated to save
memory if kexec image is not loaded. ASM code used to map control page
is replaced by C code too.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix to enable APIC for AMD Fam10h on chipsets with a missing/b0rked
ACPI MP table (MADT)
Booting a 32bit kernel on an AMD Fam10h CPU running on chipsets with
missing/b0rked MP table leads to a hang pretty early in the boot process
due to the APIC not being initialized. Fix that by falling back to the
default APIC base address in 32bit code, as it is done in the 64bit
codepath.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Fixes dumpstack and KDB on 64 bits
This re-adds the old stack pointer to the top of the irqstack to help
with unwinding. It was removed in commit d99015b1ab
as part of the save_args out-of-line work.
Both dumpstack and KDB require this information.
Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Zach says:
> Enable/Disable have no clobbers at all.
> Save clobbers only return value, %eax
> Restore also clobbers nothing.
This is precisely compatible with the calling convention, so we can
just call them directly without wrapping.
(Compile tested only.)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Fix:
arch/x86/kernel/irqinit_32.c:124: warning: 'smp_intr_init' defined but not used
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Eric Paris reported:
> I have an hp dl785g5 which is unable to successfully run
> 2.6.29-0.66.rc3.fc11.x86_64 or 2.6.29-rc2-next-20090126. During bootup
> (early in userspace daemons starting) I get the below BUG, which quickly
> renders the machine dead. I assume it is because sparse_irq_lock never
> gets released when the BUG kills that task.
Adjust lock sequence when migrating a descriptor with
CONFIG_NUMA_MIGRATE_IRQ_DESC enabled.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, ds, bts: cleanup/fix DS configuration
ring-buffer: reset timestamps when ring buffer is reset
trace: set max latency variable to zero on default
trace: stop all recording to ring buffer on ftrace_dump
trace: print ftrace_dump at KERN_EMERG log level
ring_buffer: reset write when reserve buffer fail
tracing/function-graph-tracer: fix a regression while suspend to disk
ring-buffer: fix alignment problem
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 setup: fix asm constraints in vesa_store_edid
xen: make sysfs files behave as their names suggest
x86: tone down mtrr_trim_uncached_memory() warning
x86: correct the CPUID pattern for MSR_IA32_MISC_ENABLE availability
[
mingo@elte.hu: these fixes are a subset of changes cherry-picked from:
git://git.kernel.org:/pub/scm/linux/kernel/git/jejb/voyager-2.6.git
They fix various problems that recent x86 changes caused in the Voyager
subarchitecture: both APIC changes and cpumask changes and certain
cleanups caused subarch assumptions to break.
Most of these changes are obsolete as the subarch code has been removed
from the x86 development tree - but we merge them upstream to make Voyager
build and boot.
]
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: split out a function, no functional change
Xen needs to be able to access percpu data from very early on. For
various reasons, it cannot also load the gdt at that time. It does,
however, have a pefectly functional gdt at that point, so there's no
pressing need to reload the gdt.
Split the function to load the segment registers off, so Xen can call
it directly.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, prepare for xen boot fix.
Xen needs to call this function very early to setup the GDT and
per-cpu segments. Remove the call to smp_processor_id() and just
pass in the cpu number.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: fix possible tlb mis-flushing on UV
uv_flush_send_and_wait() should return a pointer if the broadcast
remote tlb shootdown requests fail. That causes the conventional IPI
method of shootdown to be used.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Optimization
In the native case, pte_val, make_pte, etc are all just identity
functions, so there's no need to clobber a lot of registers over them.
(This changes the 32-bit callee-save calling convention to return both
EAX and EDX so functions can return 64-bit values.)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Optimization
One of the problems with inserting a pile of C calls where previously
there were none is that the register pressure is greatly increased.
The C calling convention says that the caller must expect a certain
set of registers may be trashed by the callee, and that the callee can
use those registers without restriction. This includes the function
argument registers, and several others.
This patch seeks to alleviate this pressure by introducing wrapper
thunks that will do the register saving/restoring, so that the
callsite doesn't need to worry about it, but the callee function can
be conventional compiler-generated code. In many cases (particularly
performance-sensitive cases) the callee will be in assembler anyway,
and need not use the compiler's calling convention.
Standard calling convention is:
arguments return scratch
x86-32 eax edx ecx eax ?
x86-64 rdi rsi rdx rcx rax r8 r9 r10 r11
The thunk preserves all argument and scratch registers. The return
register is not preserved, and is available as a scratch register for
unwrapped callee code (and of course the return value).
Wrapped function pointers are themselves wrapped in a struct
paravirt_callee_save structure, in order to get some warning from the
compiler when functions with mismatched calling conventions are used.
The most common paravirt ops, both statically and dynamically, are
interrupt enable/disable/save/restore, so handle them first. This is
particularly easy since their calls are handled specially anyway.
XXX Deal with VMI. What's their calling convention?
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Fix latent bug
The clobber is trying to say that anything except RDI is available for
clobbering, but actually clobbers everything. This hasn't mattered
because the clobbers were basically ignored, but subsequent patches
will rely on them.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Optimization
Several paravirt ops implementations simply return their arguments,
the most obvious being the make_pte/pte_val class of operations on
native.
On 32-bit, the identity function is literally a no-op, as the calling
convention uses the same registers for the first argument and return.
On 64-bit, it can be implemented with a single "mov".
This patch adds special identity functions for 32 and 64 bit argument,
and machinery to recognize them and replace them with either nops or a
mov as appropriate.
At the moment, the only users for the identity functions are the
pagetable entry conversion functions.
The result is a measureable improvement on pagetable-heavy benchmarks
(2-3%, reducing the pvops overhead from 5 to 2%).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix rare crash on 32-bit
The 32-bit APIC drivers had their send_IPI_self vectors set to NULL,
but ioapic_retrigger_irq() depends on it being always set. Fix it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
just like 64 bit switch from flat logical APIC messages to
flat physical mode automatically.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: 32-bit should use logical version
there are two version: for default_send_IPI_mask_sequence/allbutself
one in ipi.h and one in ipi.c for 32bit
it seems .h version overwrote ipi.c for a while.
restore it so 32 bit could use its old logical version.
also remove dupicated functions in .c
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move DMA-mapping.txt to Documentation/PCI/.
DMA-mapping.txt was supposed to be moved from Documentation/ to
Documentation/PCI/. The 00-INDEX files in those two directories
were updated, along with a few other text files, but the file
itself somehow escaped being moved, so move it and update more
text files and source files with its new location.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: Cleans up printk formatting
When LOCAL APIC was calibrated, the debug message is displayed as follows.
CPU0: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz stepping 06
Using local APIC timer interrupts.
calibrating APIC timer ...
... lapic delta = 3773131
... PM timer delta = 812434
APIC calibration not consistent with PM Timer: 226ms instead of 100ms
APIC delta adjusted to PM-Timer: 1662420 (3773131)
TSC delta adjusted to PM-Timer: 159592409 (362220564)
..... delta 1662420
..... mult: 71411249
..... calibration result: 265987
..... CPU clock speed is 1595.0924 MHz.
..... host bus clock speed is 265.0987 MHz.
There are three type of PM-Timer (PM-Timer, PM Timer, and PM timer),
in this message. This patch unifies those messages to PM-Timer.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: Fixes incorrect printk
LOCAL APIC is corrected by PM-Timer, when SMI occurred while LOCAL APIC is calibrated.
In this case, LOCAL APIC debug message(Boot with apic=debug) is displayed correctly,
however, CPU clock speed debug message is displayed wrongly .
When SMI occured on my machine, which has 1.6GHz CPU, CPU clock speed is displayed
3622.0205 MHz as follow.
CPU0: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz stepping 06
Using local APIC timer interrupts.
calibrating APIC timer ...
... lapic delta = 3773130
... PM timer delta = 812434
APIC calibration not consistent with PM Timer: 226ms instead of 100ms
APIC delta adjusted to PM-Timer: 1662420 (3773130)
..... delta 1662420
..... mult: 71411249
..... calibration result: 265987
..... CPU clock speed is 3622.0205 MHz. =====> here
..... host bus clock speed is 265.0987 MHz.
This patch fixes to displaying CPU clock speed correctly as follow.
CPU0: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz stepping 06
Using local APIC timer interrupts.
calibrating APIC timer ...
... lapic delta = 3773131
... PM timer delta = 812434
APIC calibration not consistent with PM Timer: 226ms instead of 100ms
APIC delta adjusted to PM-Timer: 1662420 (3773131)
TSC delta adjusted to PM-Timer: 159592409 (362220564)
..... delta 1662420
..... mult: 71411249
..... calibration result: 265987
..... CPU clock speed is 1595.0924 MHz.
..... host bus clock speed is 265.0987 MHz.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
X86_PC is the only remaining 'sub' architecture, so we dont need
it anymore.
This also cleans up a few spurious references to X86_PC in the
driver space - those certainly should be X86.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
xapic fix for 32bit platform with less than 8 cpu's.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
We may use macros from processor-flags.h instead
of hardcoding bits. Actually it's not direct mapping
of old instructions with new ones -- BTS does change
CF flag while MOV does not. But i didn't find any
dependency on CF in this code.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
X86_GENERICARCH is a misnomer - it contains non-PC 32-bit architectures
that are not included in the default build.
Rename it to X86_32_NON_STANDARD.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86/Voyager had this Kconfig quirk:
config X86_FIND_SMP_CONFIG
def_bool y
depends on X86_MPPARSE || X86_VOYAGER
Which splits off the find_smp_config() callback into a build-time quirk.
Voyager should use the existing x86_quirks.mach_find_smp_config() callback
to introduce SMP-config quirks. NUMAQ-32 and VISWS already use this.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Voyager has this Kconfig quirk:
config X86_BIOS_REBOOT
bool
depends on !X86_VOYAGER
default y
Voyager should use the existing machine_ops.emergency_restart reboot
quirk mechanism instead of a build-time quirk.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86/Voyager can boot on non-zero processors. While that can probably
be fixed by properly remapping the physical CPU IDs, keep boot_cpu_id
for now for easier transition - and expand it to all of x86.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86/Voyager subarch used to have this distinction between
'x86 SMP support' and 'Voyager SMP support':
config X86_SMP
bool
depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
This is a pointless distinction - Voyager can (and already does) use
smp_ops to implement various SMP quirks it has - and it can be extended
more to cover all the specialities of Voyager.
So remove this complication in the Kconfig space.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the 32-bit subarchitecture support code.
All subarchitectures but Voyager have been converted. Voyager will be
done later or will be removed.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We are getting rid of subarchitecture support - move the hook files
to asm/. (These are now stale and should be replaced with more explicit
runtime mechanisms - but the transition is simpler this way.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove remaining bits of the subarchitecture code. Now that all the
special platforms are runtime probed and runtime handled, we can remove
these facilities.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move all code to arch/x86/kernel/bigsmp_32.c.
With this it ceases to rely on any build-time subarch features.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move all NUMAQ code into arch/x86/kernel/numaq.c.
With this it ceases to rely on any build-time subarch features.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move all ES7000 code into arch/x86/kernel/es7000_32.c.
With this it ceases to rely on any build-time subarch features.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kerneloops.org is reporting a lot of these warnings that come due to
vmware not setting up any MTRRs for emulated CPUs:
| Reported 709 times (14696 total reports)
| BIOS bug (often in VMWare) where the MTRR's are set up incorrectly
| or not at all
|
| This warning was last seen in version 2.6.29-rc2-git1, and first
| seen in 2.6.24.
|
| More info:
| http://www.kerneloops.org/searchweek.php?search=mtrr_trim_uncached_memory
Keep a one-liner KERN_INFO about it - so that we have so notice if empty
MTRRs are caused by native hardware/BIOS weirdness.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Nothing exciting - a few subarches dont want APIC remote reads to
be performed - the others are content with the default method.
- extend the generic code to handle NULL methods
- clear out dummy methods and replace them with NULL
- clean up: remove wrapper macros, etc.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only NUMAQ does something substantial here, because it initializes
via NMIs (not via INIT as standard SMP startup) - so it needs to
store and restore the NMI vector.
- extend the generic code to handle NULL methods
- clear out dummy methods and replace them with NULL
- clean up: remove wrapper macros, etc.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only NUMAQ does something substantial here, because it initializes
via NMIs (not via INIT as standard SMP startup) - so it needs to
reset the APIC.
- extend the generic code to handle NULL methods
- clear out dummy methods and replace them with NULL
- clean up: remove wrapper macros, etc.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- spread out the namespace on a per APIC driver basis
- handle a NULL ->wait_for_init_deassert() as a 'dont wait' default method
- remove NUMAQ and Summit handlers
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit x86 has zero for ->trampoline_phys_low/high, but the smpboot
code can use these values - so it's better to set them up to their
correct values.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Our send_IPI_*() methods and definitions are a twisted mess: the same
symbol is defined to different things depending on .config details,
in a non-transparent way.
- spread out the quirks into separately named per apic driver methods
- prefix the standard PC methods with default_
- get rid of wrapper macro obfuscation
- clean up various details
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Call all the registered MPS quirk handlers early. These methods scan
low RAM typically for specific signatures so are safe to be called
early.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Refactor the ->phys_pkg_id() methods:
- namespace separation
- macro wrapper removal
- open-coded calls to the methods in the generic code
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- unify the call signature of 64-bit to that of 32-bit
- clean up the types all around
- clean up namespace contamination
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- eliminate the needless es7000_enable_apic_mode() complication which
was not apparent prior the namespace cleanups
- clean up the control flow in es7000_enable_apic_mode()
- other cleanups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only ES7000 has a real ->enable_apic_mode() method, the other
subarchitectures define it but keep it empty.
So mark the vector as NULL, extend the generic code to handle
NULL -setup_portio_remap() entries and remove all the empty
handlers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- spread out the namespace to per driver methods
- extend it to 64-bit as well so that we can use
apic->check_phys_apicid_present() unconditionally
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Only NUMAQ has a real ->setup_portio_remap() method, the other
subarchitectures define it but keep it empty.
So mark the vector as NULL, extend the generic code to handle
NULL -setup_portio_remap() entries and remove all the empty
handlers.
Also move the NUMAQ method from the header file into the
apic driver .c file.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
only NUMAQ uses this quirk: to prevent the timer IRQ from being added
on secondary nodes.
All other genapic templates can have a NULL ->multi_timer_check()
callback.
Also, extend the generic code to treat a NULL pointer accordingly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The bigsmp and es7000 subarchitectures un-defined APIC_DEST_LOGICAL in
a rather nasty way by re-defining it to zero. That is infinitely
fragile and makes it very hard to see what to code really does in
a given context. The very same constant has different meanings and
values - depending on which subarch is enabled.
Untangle this mess by never undefining the constant, but instead
propagating the right values into the genapic driver templates.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
the ->ESR_DISABLE shouting variant was used to enable the esr_disable
macro wrappers. Those ugly macros are removed now so we can rename
->ESR_DISABLE to ->disable_esr
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Most subarchitectures want to disable the APIC ESR (Error Status Register),
because they generally have hardware hacks that wrap standard CPUs into
a bigger system and hence the APIC bus is quite non-standard and weirdnesses
(lockups) have been seen with ESR reporting.
Remove the esr_disable macros and put the desired flag into each
subarchitecture's genapic template directly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the wrapper macros IRQ_DEST_MODE and IRQ_DELIVERY_MODE.
The typical 32-bit and the 64-bit build all dereference via the genapic,
so it's pointless to hide that indirection via these ugly macros.
Furthermore, it also obscures subarchitecture details.
So replace it with apic->irq_dest_mode / etc. accesses.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
int_delivery_mode is supposed to mean 'interrupt delivery mode', but
it's quite a misnomer as 'int' we usually think of as an integer type ...
The standard naming for such attributes is 'irq' - so rename the following
fields and macros:
int_delivery_mode => irq_delivery_mode
INT_DELIVERY_MODE => IRQ_DELIVERY_MODE
int_dest_mode => irq_dest_mode
INT_DEST_MODE => IRQ_DEST_MODE
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
x86 subarchitectures each defined a "apic_id_registered()" method,
which could be an inline function depending on which subarch we build
for, and which was also the name of a genapic field.
Untangle this namespace spaghetti by giving each of the instances
a separate name.
Also remove wrapper macro obfuscation.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: refactor code
x86 subarchitectures each defined a "acpi_madt_oem_check()" method,
which could be an inline function, or an extern, or a static function,
and which was also the name of a genapic field.
Untangle this namespace spaghetti by setting ->acpi_madt_oem_check()
to NULL on those subarchitectures that have no detection quirks,
and rename the other ones (summit, es7000) that do.
Also change default_acpi_madt_oem_check() to handle NULL entries,
and clean its control flow up as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- reorder fields so that they appear in struct genapic field ordering
- add zero-initialized fields too so that it's apparent which functionality
is default / missing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- reorder fields so that they appear in struct genapic field ordering
- add zero-initialized fields too so that it's apparent which functionality
is default / missing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- reorder fields so that they appear in struct genapic field ordering
- add zero-initialized fields too so that it's apparent which functionality
is default / missing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- reorder fields so that they appear in struct genapic field ordering
- add zero-initialized fields too so that it's apparent which functionality
is default / missing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- reorder fields so that they appear in struct genapic field ordering
- add zero-initialized fields too so that it's apparent which functionality
is default / missing.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename genapic-> to apic-> references because in a future chagne we'll
open-code all the indirect calls (instead of obscuring them via macros),
so we want this reference to be as short as possible.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Cleanup
While I was looking through the new and improved bootstrap code - great
work that, thanks! I found the below a slight improvement.
Remove unnecessary ugly #ifdef construct around debug register clear.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
fix the following warning:
CC arch/x86/kernel/cpu/intel_cacheinfo.o
arch/x86/kernel/cpu/intel_cacheinfo.c:314: warning: 'cpuid4_cache_lookup' defined but not used
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
x86_cpu_to_apicid and x86_bios_cpu_apicid aren't defined for voyage.
Earlier patch forgot to conditionalize early percpu clearing. Fix it.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: sync 32 and 64-bit code
Merge load_gs_base() into switch_to_new_gdt(). Load the GDT and
per-cpu state for the boot cpu when its new area is set up.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Rename init_gdt() to setup_percpu_segment(), and move it to
setup_percpu.c.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: standardize all x86 platforms on same setup code
With the preceding changes, Voyager can use the same per-cpu setup
code as all the other x86 platforms.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Small cleanup
Define BOOT_PERCPU_OFFSET and use it for this_cpu_offset and
__per_cpu_offset initializers.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Code movement
Move the variable definitions to apic.c. Ifdef the copying of
the two early per-cpu variables, since Voyager doesn't use them.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
The way the code is written, align is always PAGE_SIZE. Simplify
the code by removing the align variable.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Code movement, no functional change.
Move setup_cpu_local_masks() to kernel/cpu/common.c, where the
masks are defined.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Code movement, no functional change.
Move the 64-bit NUMA code from setup_percpu.c to numa_64.c
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: minor optimization
Eliminates the need for two loops over possible cpus.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
1. add smp_intr_init and apic_intr_init for 32bit, the same as 64bit
2. move the apic_intr_init() call before set gate with interrupt[i]
3. for 64bit, if ia32_emulation is not used, will make per_cpu to use 0x80 vector.
[ v2: should use !test_bit() instead of test_bit() with 32bit ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
debugobjects: add and use INIT_WORK_ON_STACK
rcu: remove duplicate CONFIG_RCU_CPU_STALL_DETECTOR
relay: fix lock imbalance in relay_late_setup_files
oprofile: fix uninitialized use of struct op_entry
rcu: move Kconfig menu
softlock: fix false panic which can occur if softlockup_thresh is reduced
rcu: add __cpuinit to rcu_init_percpu_data()
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimers: fix inconsistent lock state on resume in hres_timers_resume
time-sched.c: tick_nohz_update_jiffies should be static
locking, hpet: annotate false positive warning
kernel/fork.c: unused variable 'ret'
itimers: remove the per-cpu-ish-ness
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
xen: unitialised return value in xenbus_write_transaction
x86: fix section mismatch warning
x86: unmask CPUID levels on Intel CPUs, fix
x86: work around PAGE_KERNEL_WC not getting WC in iomap_atomic_prot_pfn.
x86: use standard PIT frequency
xen: handle highmem pages correctly when shrinking a domain
x86, mm: fix pte_free()
xen: actually release memory when shrinking domain
x86: unmask CPUID levels on Intel CPUs
x86: add MSR_IA32_MISC_ENABLE bits to <asm/msr-index.h>
x86: fix PTE corruption issue while mapping RAM using /dev/mem
x86: mtrr fix debug boot parameter
x86: fix page attribute corruption with cpa()
Revert "x86: signal: change type of paramter for sys_rt_sigreturn()"
x86: use early clobbers in usercopy*.c
x86: remove kernel_physical_mapping_init() from init section
fix: crash: IP: __bitmap_intersects+0x48/0x73
cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
work_on_cpu: Use our own workqueue.
work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
...
Impact: re-enable CPUID unmasking on affected processors
As far as I am capable of discerning from the documentation,
MSR_IA32_MISC_ENABLE should be available for all family 0xf CPUs, as
well as family 6 for model >= 0xd (newer Pentium M).
The documentation on this isn't ideal, so we need to be on the lookout
for errors, still.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Here function vmi_activate calls a init function activate_vmi , which
causes the following section mismatch warnings:
LD arch/x86/kernel/built-in.o
WARNING: arch/x86/kernel/built-in.o(.text+0x13ba9): Section mismatch
in reference from the function vmi_activate() to the function
.init.text:vmi_time_init()
The function vmi_activate() references
the function __init vmi_time_init().
This is often because vmi_activate lacks a __init
annotation or the annotation of vmi_time_init is wrong.
WARNING: arch/x86/kernel/built-in.o(.text+0x13bd1): Section mismatch
in reference from the function vmi_activate() to the function
.devinit.text:vmi_time_bsp_init()
The function vmi_activate() references
the function __devinit vmi_time_bsp_init().
This is often because vmi_activate lacks a __devinit
annotation or the annotation of vmi_time_bsp_init is wrong.
WARNING: arch/x86/kernel/built-in.o(.text+0x13bdb): Section mismatch
in reference from the function vmi_activate() to the function
.devinit.text:vmi_time_ap_init()
The function vmi_activate() references
the function __devinit vmi_time_ap_init().
This is often because vmi_activate lacks a __devinit
annotation or the annotation of vmi_time_ap_init is wrong.
Fix it by marking vmi_activate() as __init too.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove such constructs:
#ifdef CONFIG_EARLY_PRINTK
call early_printk
#else
call printk
#endif
Not only are they ugly, they are also pointless: a call to printk()
maps to early_printk during early bootup anyway, if CONFIG_EARLY_PRINTK
is enabled.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot hang on pre-model-15 Intel CPUs
rdmsrl_safe() does not work in very early bootup code yet, because we
dont have the pagefault handler installed yet so exception section
does not get parsed. rdmsr_safe() will just crash and hang the bootup.
So limit the MSR_IA32_MISC_ENABLE MSR read to those CPU types that
support it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fixes potential crashes on misconfigured systems.
Some CPU features require specific CPUID levels to be available in
order to function, as they contain information about the operation of
a specific feature. However, some BIOSes and virtualization software
provide the ability to mask CPUID levels in order to support legacy
operating systems. We try to enable such CPUID levels when we know
how to do it, but for the remaining cases, filter out such CPU
features when there is no way for us to support them.
Do this in one place, in the CPUID code, with a table-driven approach.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: Cleanup
When PAT was originally introduced, it was handled specially for a few
reasons:
- PAT bugs are hard to track down, so we wanted to maintain a
whitelist of CPUs.
- The i386 and x86-64 CPUID code was not yet unified.
Both of these are now obsolete, so handle PAT like any other features,
including ordinary feature blacklisting due to known bugs.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: use new framework
Use {get|put}_user_try, catch, and _ex in arch/x86/kernel/signal.c.
Note: this patch contains "WARNING: line over 80 characters", because when
introducing new block I insert an indent to avoid mistakes by edit.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Fix:
WARNING: arch/x86/kernel/built-in.o(.text+0xdd0f): Section mismatch in reference from the function pmc_generic_enable() to the function .cpuinit.text:perf_counters_lapic_init()
The function pmc_generic_enable() references
the function __cpuinit perf_counters_lapic_init().
This is often because pmc_generic_enable lacks a __cpuinit
annotation or the annotation of perf_counters_lapic_init is wrong.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ratelimit performance counter interrupts to 100KHz per CPU.
This replaces the irq-delta-time based method.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The file is used for 32 and 64 bit since:
commit cfb80c9eae
Author: Jeremy Fitzhardinge <jeremy@goop.org>
Date: Tue Dec 16 12:17:36 2008 -0800
x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Starting kerneltop with only -c 100 seems to be a bad idea, it can
easily lock the system due to perfcounter IRQ overload.
So add throttling: if a new IRQ arrives in a shorter than
PERFMON_MIN_PERIOD_NS time, turn off perfcounters and untrottle them
from the next timer tick.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
APIC definitions aren't needed here. Remove the include and fix
up the fallout.
tj: added include to mce_intel_64.c.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: bogus irq_cpustat field removed
idle_timestamp is left over from the removed irqbalance code.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
pte_flags() was introduced as a new pvop in order to extract just the
flags portion of a pte, which is a potentially cheaper operation than
extracting the page number as well. It turns out this operation is
not needed, because simply using a mask to extract the flags from a
pte is sufficient for all current users.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cleanup the cpuid check for DS configuration.
This also fixes a Corei7 CPUID enumeration bug.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix debugobjects warning
debugobject enabled kernels spit out a warning in hpet code due to a
workqueue which is initialized on stack.
Add INIT_WORK_ON_STACK() which calls init_timer_on_stack() and use it
in hpet.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: Fixes crashes with misconfigured BIOSes on XSAVE hardware
Avuton Olrich reported early boot crashes with v2.6.28 and
bisected it down to dc1e35c6e9
("x86, xsave: enable xsave/xrstor on cpus with xsave support").
If the CPUID limit bit in MSR_IA32_MISC_ENABLE is set, clear it to
make all CPUID information available. This is required for some
features to work, in particular XSAVE.
Reported-and-bisected-by: Avuton Olrich <avuton@gmail.com>
Tested-by: Avuton Olrich <avuton@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Make X86 SGI Ultraviolet support configurable. Saves about 13K of text size
on my modest config.
text data bss dec hex filename
6770537 1158680 694356 8623573 8395d5 vmlinux
6757492 1157664 694228 8609384 835e68 vmlinux.nouv
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
while looking at:
http://bugzilla.kernel.org/show_bug.cgi?id=11541
I realized that the mtrr.show param cannot work, because
the code is processed much too early.
This patch:
- Declares mtrr.show as early_param
- Stays consistent with the previous param (which I doubt
that it ever worked), so mtrr.show=1 would still work
- Declares mtrr_show as initdata
Signed-off-by: Thomas Renninger <trenn@suse.de>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Now that it's unified, move the (SMP) TLB flushing code from arch/x86/kernel/
to arch/x86/mm/, where it belongs logically.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
arch/x86/kernel/tlb_32.c
Merge it here because both the cpumask changes and the ongoing percpu
work is touching the TLB code. The percpu changes take precedence, as
they eliminate tlb_32.c altogether.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 4217458daf.
Justin Madru bisected this commit, it was causing weird Firefox
crashes.
The reason is that GCC mis-optimizes (re-uses) the on-stack parameters of
the calling frame, which corrupts the syscall return pt_regs state and
thus corrupts user-space register state.
So we go back to the slightly less clean but more optimization-safe
method of getting to pt_regs. Also add a comment to explain this.
Resolves: http://bugzilla.kernel.org/show_bug.cgi?id=12505
Reported-and-bisected-by: Justin Madru <jdm64@gawab.com>
Tested-by: Justin Madru <jdm64@gawab.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: less contention when issuing invalidate IPI, cleanup
Make x86_32 use the same tlb code as 64bit. The 64bit code uses
multiple IPI vectors for tlb shootdown to reduce contention. This
patch makes x86_32 allocate the same 8 IPIs as x86_64 and share the
code paths.
Note that the usage of asmlinkage is inconsistent for x86_32 and 64
and calls for further cleanup. This has been noted with a FIXME
comment in tlb_64.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: clean up, ipi vector number reordering for x86_32
Make the following changes to prepare for tlb merge.
* reorder x86_32 ip vectors
* adjust tlb_32.c and tlb_64.c such that their logics coincide exactly
- on spurious invalidate ipi, tlb_32 acks the irq
- tlb_64 now has proper memory barriers around clearing
flush_cpumask (no change in generated code)
* unexport flush_tlb_page from tlb_32.c, there's no user
* use unsigned int for cpu id
* drop unnecessary includes from tlb_64.c
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Make the following uv related cleanups.
* collect visible uv related definitions and interfaces into uv/uv.h
and use it. this cleans up the messy situation where on 64bit, uv
is defined properly, on 32bit generic it's dummy and on the rest
undefined. after this clean up, uv is defined on 64 and dummy on
32.
* update uv_flush_tlb_others() such that it takes cpumask of
to-be-flushed cpus as argument, instead of that minus self, and
returns yet-to-be-flushed cpumask, instead of modifying the passed
in parameter. this interface change will ease dummy implementation
of uv_flush_tlb_others() and makes uv tlb flush related stuff
defined in tlb_uv proper.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup, better irq_regs code generation for x86_64
Make 64-bit use the same optimizations as 32-bit.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
%fs is currently set to __KERNEL_DS at boot, and conditionally
switched to __KERNEL_PERCPU for secondary cpus. Instead, initialize
GDT_ENTRY_PERCPU to the same attributes as GDT_ENTRY_KERNEL_DS and
set %fs to __KERNEL_PERCPU unconditionally.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup && more compact percpu area layout with future changes
Move 64-bit GDT to page-aligned section and clean up comment
formatting.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Remove byte locks implementation, which was introduced by Jeremy in
8efcbab6 ("paravirt: introduce a "lock-byte" spinlock implementation"),
but turned out to be dead code that is not used by any in-kernel
virtualization guest (Xen uses its own variant of spinlocks implementation
and KVM is not planning to move to byte locks).
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cleanup the cpuid check for DS configuration.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dump the branch trace on an oops (based on ftrace_dump_on_oops).
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: x86_64 percpu area layout change, irq_stack now at the beginning
Now that the PDA is empty except for the stack canary, it can be removed.
The irqstack is moved to the start of the per-cpu section. If the stack
protector is enabled, the canary overlaps the bottom 48 bytes of the irqstack.
tj: * updated subject
* dropped asm relocation of irq_stack_ptr
* updated comments a bit
* rebased on top of stack canary changes
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Use cpu_number to determine if the adjustment is necessary.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Copy the code to cpu_init() to satisfy the requirement that the cpu
be reinitialized. Remove all other calls, since the segments are
already initialized in head_64.S.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
Make the following cleanups.
* remove duplicate comment from boot_init_stack_canary() which fits
better in the other place - cpu_idle().
* move stack_canary offset check from __switch_to() to
boot_init_stack_canary().
Signed-off-by: Tejun Heo <tj@kernel.org>
-tip testing found this crash:
> [ 35.258515] calling acpi_cpufreq_init+0x0/0x127 @ 1
> [ 35.264127] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 35.267554] IP: [<ffffffff80478092>] __bitmap_intersects+0x48/0x73
> [ 35.267554] PGD 0
> [ 35.267554] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c is still broken: there's no
allocation of the variable mask, so we pass in an uninitialized cmd.mask
field to drv_read(), which then passes it to the scheduler which then
crashes ...
Switch it over to the much simpler constant-cpumask-pointers approach.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new work_on_cpu function to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for drv_read() and drv_write().
Basically converts do_drv_{read,write} into "work_on_cpu" functions that
are now called by drv_read and drv_write.
Note: This patch basically reverts 50c668d6 which reverted 7503bfba, now
that the work_on_cpu() function is more stable.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Dieter Ries <clip2@gmx.de>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <cpufreq@vger.kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since commit 75c46fa, "x64, x2apic/intr-remap: MSI and MSI-X
support for interrupt remapping infrastructure", x86 has had an
implementation of arch_setup_msi_irqs().
That implementation does not call arch_setup_msi_irq(), instead it calls
setup_irq(). No other x86 code calls arch_setup_msi_irq().
That leaves only arch_setup_msi_irqs() in drivers/pci/msi.c, but that
routine is overridden by the x86 version of arch_setup_msi_irqs().
So arch_setup_msi_irq() is dead code, remove it.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The function setup_cpu_local_masks() has been marked __init, in
order to remove the following section mismatch messages:
WARNING: vmlinux.o(.text+0x3c2c7): Section mismatch in reference from the function setup_cpu_local_masks() to the function .init.text:alloc_bootmem_cpumask_var()
The function setup_cpu_local_masks() references
the function __init alloc_bootmem_cpumask_var().
This is often because setup_cpu_local_masks lacks a __init
annotation or the annotation of alloc_bootmem_cpumask_var is wrong.
WARNING: vmlinux.o(.text+0x3c2d3): Section mismatch in reference from the function setup_cpu_local_masks() to the function .init.text:alloc_bootmem_cpumask_var()
The function setup_cpu_local_masks() references
the function __init alloc_bootmem_cpumask_var().
This is often because setup_cpu_local_masks lacks a __init
annotation or the annotation of alloc_bootmem_cpumask_var is wrong.
WARNING: vmlinux.o(.text+0x3c2df): Section mismatch in reference from the function setup_cpu_local_masks() to the function .init.text:alloc_bootmem_cpumask_var()
The function setup_cpu_local_masks() references
the function __init alloc_bootmem_cpumask_var().
This is often because setup_cpu_local_masks lacks a __init
annotation or the annotation of alloc_bootmem_cpumask_var is wrong.
WARNING: vmlinux.o(.text+0x3c2eb): Section mismatch in reference from the function setup_cpu_local_masks() to the function .init.text:alloc_bootmem_cpumask_var()
The function setup_cpu_local_masks() references
the function __init alloc_bootmem_cpumask_var().
This is often because setup_cpu_local_masks lacks a __init
annotation or the annotation of alloc_bootmem_cpumask_var is wrong.
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add debug warning
Fire off one message if two apic's discovered with different
apic versions. (this code is only called during CPU init)
The goal of this is to pave the way of the removal of the apic_version[]
array. We dont expect any apic version incompatibilities in the x86
landscape of systems [if so we dont handle them very well and probably
never will handle deep apic version assymetries well], but it's prudent
to have a debug check for one kernel cycle nevertheless.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/x86/include/asm/pda.h
We merge tip/core/percpu into tip/perfcounters/core because of a
semantic and contextual conflict: the former eliminates the PDA,
while the latter extends it with apic_perf_irqs field.
Resolve the conflict by moving the new field to the irq_cpustat
structure on 64-bit too.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tj: * in asm-offsets_64.c, pda.h inclusion shouldn't be removed as pda
is still referenced in the file
* s/oldrsp/old_rsp/
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Also clean up PER_CPU_VAR usage in xen-asm_64.S
tj: * remove now unused stack_thread_info()
* s/kernelstack/kernel_stack/
* added FIXME comment in xen-asm_64.S
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
tj: moved cpu_number definition out of CONFIG_HAVE_SETUP_PER_CPU_AREA
for voyager.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move the exception stacks to per-cpu, removing specific allocation code.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move the irqstackptr variable from the PDA to per-cpu. Make the
stacks themselves per-cpu, removing some specific allocation code.
Add a seperate flag (is_boot_cpu) to simplify the per-cpu boot
adjustments.
tj: * sprinkle some underbars around.
* irq_stack_ptr is not used till traps_init(), no reason to
initialize it early. On SMP, just leaving it NULL till proper
initialization in setup_per_cpu_areas() works. Dropped
is_boot_cpu and early irq_stack_ptr initialization.
* do DECLARE/DEFINE_PER_CPU(char[IRQ_STACK_SIZE], irq_stack)
instead of (char, irq_stack[IRQ_STACK_SIZE]).
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: use new work_on_cpu function to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for drv_read() and drv_write().
Basically converts do_drv_{read,write} into "work_on_cpu" functions that
are now called by drv_read and drv_write.
Note: This patch basically reverts 50c668d6 which reverted 7503bfba, now
that the work_on_cpu() function is more stable.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Tested-by: Dieter Ries <clip2@gmx.de>
Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <cpufreq@vger.kernel.org>
Fix two compilation warnings in drivers/acpi/sleep.c, one triggered
by unsetting CONFIG_SUSPEND and the other triggered by unsetting
CONFIG_HIBERNATION, by moving some code under the appropriate
#ifdefs .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Check CONFIG_FREEZER instead of CONFIG_PM because kprobe booster
depends on freeze_processes() and thaw_processes() when CONFIG_PREEMPT=y.
This fixes a linkage error which occurs when CONFIG_PREEMPT=y, CONFIG_PM=y
and CONFIG_FREEZER=n.
Reported-by: Cheng Renquan <crquan@gmail.com>
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Len Brown <len.brown@intel.com>
On x86_64, if get_per_cpu_var() is used before per cpu area is setup
(if lockdep is turned on, it happens), it needs this_cpu_off to point
to __per_cpu_load. Initialize accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
It is an optimization and a cleanup, and adds the following new
generic percpu methods:
percpu_read()
percpu_write()
percpu_add()
percpu_sub()
percpu_and()
percpu_or()
percpu_xor()
and implements support for them on x86. (other architectures will fall
back to a default implementation)
The advantage is that for example to read a local percpu variable,
instead of this sequence:
return __get_cpu_var(var);
ffffffff8102ca2b: 48 8b 14 fd 80 09 74 mov -0x7e8bf680(,%rdi,8),%rdx
ffffffff8102ca32: 81
ffffffff8102ca33: 48 c7 c0 d8 59 00 00 mov $0x59d8,%rax
ffffffff8102ca3a: 48 8b 04 10 mov (%rax,%rdx,1),%rax
We can get a single instruction by using the optimized variants:
return percpu_read(var);
ffffffff8102ca3f: 65 48 8b 05 91 8f fd mov %gs:0x7efd8f91(%rip),%rax
I also cleaned up the x86-specific APIs and made the x86 code use
these new generic percpu primitives.
tj: * fixed generic percpu_sub() definition as Roel Kluin pointed out
* added percpu_and() for completeness's sake
* made generic percpu ops atomic against preemption
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Tejun Heo <tj@kernel.org>
Do the following cleanups:
* kill x86_64_init_pda() which now is equivalent to pda_init()
* use per_cpu_offset() instead of cpu_pda() when initializing
initial_gs
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pda is now a percpu variable and there's no reason it can't use plain
x86 percpu accessors. Add x86_test_and_clear_bit_percpu() and replace
pda op implementations with wrappers around x86 percpu accessors.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
As pda is now allocated in percpu area, it can easily be made a proper
percpu variable. Make it so by defining per cpu symbol from linker
script and declaring it in C code for SMP and simply defining it for
UP. This change cleans up code and brings SMP and UP closer a bit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that pda is allocated as part of percpu, percpu doesn't need to be
accessed through pda. Unify x86_64 SMP percpu access with x86_32 SMP
one. Other than the segment register, operand size and the base of
percpu symbols, they behave identical now.
This patch replaces now unnecessary pda->data_offset with a dummy
field which is necessary to keep stack_canary at its place. This
patch also moves per_cpu_offset initialization out of init_gdt() into
setup_per_cpu_areas(). Note that this change also necessitates
explicit per_cpu_offset initializations in voyager_smp.c.
With this change, x86_OP_percpu()'s are as efficient on x86_64 as on
x86_32 and also x86_64 can use assembly PER_CPU macros.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
Currently pdas and percpu areas are allocated separately. %gs points
to local pda and percpu area can be reached using pda->data_offset.
This patch folds pda into percpu area.
Due to strange gcc requirement, pda needs to be at the beginning of
the percpu area so that pda->stack_canary is at %gs:40. To achieve
this, a new percpu output section macro - PERCPU_VADDR_PREALLOC() - is
added and used to reserve pda sized chunk at the start of the percpu
area.
After this change, for boot cpu, %gs first points to pda in the
data.init area and later during setup_per_cpu_areas() gets updated to
point to the actual pda. This means that setup_per_cpu_areas() need
to reload %gs for CPU0 while clearing pda area for other cpus as cpu0
already has modified it when control reaches setup_per_cpu_areas().
This patch also removes now unnecessary get_local_pda() and its call
sites.
A lot of this patch is taken from Mike Travis' "x86_64: Fold pda into
per cpu area" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
_cpu_pda array first uses statically allocated storage in data.init
and then switches to allocated bootmem to conserve space. However,
after folding pda area into percpu area, _cpu_pda array will be
removed completely. Drop the reallocation part to simplify the code
for soon-to-follow changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
CPU startup code in head_64.S loaded address of a zero page into %gs
for temporary use till pda is loaded but address to the actual pda is
available at the point. Load the real address directly instead.
This will help unifying percpu and pda handling later on.
This patch is mostly taken from Mike Travis' "x86_64: Fold pda into
per cpu area" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
This patch makes percpu symbols zerobased on x86_64 SMP by adding
PERCPU_VADDR() to vmlinux.lds.h which helps setting explicit vaddr on
the percpu output section and using it in vmlinux_64.lds.S. A new
PHDR is added as existing ones cannot contain sections near address
zero. PERCPU_VADDR() also adds a new symbol __per_cpu_load which
always points to the vaddr of the loaded percpu data.init region.
The following adjustments have been made to accomodate the address
change.
* code to locate percpu gdt_page in head_64.S is updated to add the
load address to the gdt_page offset.
* __per_cpu_load is used in places where access to the init data area
is necessary.
* pda->data_offset is initialized soon after C code is entered as zero
value doesn't work anymore.
This patch is mostly taken from Mike Travis' "x86_64: Base percpu
variables at zero" patch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make vmlinux_32.lds.S use the generic PERCPU() macro instead of open
coding it. This will ease future changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[ Based on original patch from Christoph Lameter and Mike Travis. ]
* Ruggedize some calls in setup_percpu.c to prevent mishaps
in early calls, particularly for non-critical functions.
* Cleanup DEBUG_PER_CPU_MAPS usages and some comments.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make early_per_cpu() a lvalue as per_cpu() is and use it where
applicable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function uv_wait_completion() spins on reads of a memory-mapped
register, waiting for completion of BAU hardware replies.
It should call "cpu_relax()" between those reads to improve performance
on hyperthreaded configurations.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
-tip testing found this crash:
> [ 35.258515] calling acpi_cpufreq_init+0x0/0x127 @ 1
> [ 35.264127] BUG: unable to handle kernel NULL pointer dereference at (null)
> [ 35.267554] IP: [<ffffffff80478092>] __bitmap_intersects+0x48/0x73
> [ 35.267554] PGD 0
> [ 35.267554] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c is still broken: there's no
allocation of the variable mask, so we pass in an uninitialized cmd.mask
field to drv_read(), which then passes it to the scheduler which then
crashes ...
Switch it over to the much simpler constant-cpumask-pointers approach.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen the effect of the 'nolapic' boot parameter
"nolapic" should not only suppress SMP and use of the LAPIC, but it
also ought to have the effect of disabling all IO-APIC related activity
as well as PCI MSI and HT-IRQs.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: micro-optimization, memory reduction
On x86_64 flush tlb data is stored in per_cpu variables. This is
unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
are accessed.
This patch aims at making the code less confusing (there's nothing
really "per_cpu") by using a plain array. It also would save some memory
on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
[ Ravikiran G Thirumalai also pointed out that the correct alignment
is ____cacheline_internodealigned_in_smp, so that there's no
bouncing on vsmp. ]
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
Acked-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This commit broke flush_tlb_others_ipi() causing boot hangs on a
16 logical cpu system:
> commit 4595f9620c
> Author: Rusty Russell <rusty@rustcorp.com.au>
> Date: Sat Jan 10 21:58:09 2009 -0800
>
> x86: change flush_tlb_others to take a const struct cpumask
This change resulted in sending the invalidate tlb vector to the
sender itself causing the hang. flush_tlb_others_ipi() should exclude
the sender itself from the destination list.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix potential boot crash on MAXSMP
Remove code left over by:
50c668d: Revert "cpumask: use work_on_cpu in acpi-cpufreq.c for drv_read
That cmd.cpumask is not allocated anymore. No impact on default !MAXSMP
kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel "Smackover" x58 BIOS don't have HPET enabled in the BIOS, so allow
to force enable it at least. The register layout is the same as in other
recent ICHs, so all the code can be reused.
Using numerical PCI-ID because it's unlikely the PCI-ID will be used
anywhere else.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e0c7317557.
This patch was wrong, as lockdep (and thus the irq state tracer)
aren't nmi safe. People are already seeing lockdep warnings due
to this.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 7503bfbae8.
Dieter Ries reported bootup soft-hangs and bisected it back to
this commit, and reverting this commit gave him a working system.
The commit introduces work_on_cpu() use into the cpufreq code,
but that is subtly problematic from a lock hierarchy POV: the
hotplug-cpu lock is an highlevel lock that is taken before
lowlevel locks, and in this codepath we are called with the
policy lock taken.
Dieter did not have lockdep enabled so we dont have a nice stack
trace proof for this, but using work_on_cpu() in such a lowlevel
place certainly looks wrong, so we revert the patch.
work_on_cpu() needs to be reworked to be more generally usable.
Reported-by: Dieter Ries <clip2@gmx.de>
Tested-by: Dieter Ries <clip2@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix this by reintroducing asm/smp.h include in apic.c - later on
I will fix this by removing non-smp data from smp.h
Also fix the __inquire_remote_apic() prototype/inline.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix this by reintroducing asm/smp.h include in mpparse.c - later on
I will fix this by removing non-smp data from smp.h.
Reported-by: Petr Titera <P.Titera@century.cz>
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alexander Beregalov reported that this warning is caused by the HPET code:
> hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
> hpet0: 3 comparators, 64-bit 14.318180 MHz counter
> ODEBUG: object is on stack, but not annotated
> ------------[ cut here ]------------
> WARNING: at lib/debugobjects.c:251 __debug_object_init+0x2a4/0x352()
> Bisected down to 26afe5f2fb
> (x86: HPET_MSI Initialise per-cpu HPET timers)
The commit is fine - but the on-stack workqueue entry needs annotation.
Reported-and-bisected-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
ERROR: trailing whitespace
ERROR: code indent should use tabs where possible
WARNING: %Ld/%Lu are not-standard C, use %lld/%llu
WARNING: printk() should include KERN_ facility level
ERROR: spaces required around that '=' (ctx:VxW)
total: 13 errors, 2 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: preparation, cleanup, add KERN_INFO printk
Modify references from NR_IRQS to nr_irqs as the later will become
variable-sized based on nr_cpu_ids when CONFIG_SPARSE_IRQS=y.
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: reduce stack usage.
init_intel_cacheinfo() does not use the cpumask so define a subset
of struct _cpuid4_info (_cpuid4_info_regs) that can be used instead.
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: Reduce memory usage, use new cpumask API.
Use cpumask_var_t for 'cpus' cpumask in struct threshold_bank and update
remaining old cpumask_t functions to new cpumask API.
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: Improve tlb flush performance for UV
Calling alloc_cpumask_var a zillion times a second does affect
performance. Replace with static cpumask.
Note: when CONFIG_X86_UV is defined, this extra PER_CPU memory
will be optimized out for non-UV configs as is_uv_system() will
then return a constant 0.
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: reduce stack usage, use new cpumask API.
This is made a little more tricky by uv_flush_tlb_others which
actually alters its argument, for an IPI to be sent to the remaining
cpus in the mask.
I solve this by allocating a cpumask_var_t for this case and falling back
to IPI should this fail.
To eliminate temporaries in the caller, all flush_tlb_others implementations
now do the this-cpu-elimination step themselves.
Note also the curious "cpus_or(f->flush_cpumask, cpumask, f->flush_cpumask)"
which has been there since pre-git and yet f->flush_cpumask is always zero
at this point.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: reduce memory usage, use new cpumask API.
Replace the affinity and pending_masks with cpumask_var_t's. This adds
to the significant size reduction done with the SPARSE_IRQS changes.
The added functions (init_alloc_desc_masks & init_copy_desc_masks) are
in the include file so they can be inlined (and optimized out for the
!CONFIG_CPUMASKS_OFFSTACK case.) [Naming chosen to be consistent with
the other init*irq functions, as well as the backwards arg declaration
of "from, to" instead of the more common "to, from" standard.]
Includes a slight change to the declaration of struct irq_desc to embed
the pending_mask within ifdef(CONFIG_SMP) to be consistent with other
references, and some small changes to Xen.
Tested: sparse/non-sparse/cpumask_offstack/non-cpumask_offstack/nonuma/nosmp on x86_64
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: virtualization@lists.osdl.org
Cc: xen-devel@lists.xensource.com
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Impact: micro-optimization
The patch below removes an unnecessary locked instruction from
switch_to(). TIF_FORK is only ever set in copy_thread() on initial
process creation, and gets cleared during the first scheduling of the
process. As such, it is safe to use an unlocked test for the flag
within switch_to().
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The swiotlb_arch_range_needs_mapping() hook should take a physical
address rather than a virtual address in order to support highmem pages.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
x86: fix section mismatch warnings in mcheck/mce_amd_64.c
x86: offer frame pointers in all build modes
x86: remove duplicated #include's
x86: k8 numa register active regions later
x86: update Alan Cox's email addresses
x86: rename all fields of mpc_table mpc_X to X
x86: rename all fields of mpc_oemtable oem_X to X
x86: rename all fields of mpc_bus mpc_X to X
x86: rename all fields of mpc_cpu mpc_X to X
x86: rename all fields of mpc_intsrc mpc_X to X
x86: rename all fields of mpc_lintsrc mpc_X to X
x86: rename all fields of mpc_iopic mpc_X to X
x86: irqinit_64.c init_ISA_irqs should be static
Documentation/x86/boot.txt: payload length was changed to payload_length
x86: setup_percpu.c fix style problems
x86: irqinit_64.c fix style problems
x86: irqinit_32.c fix style problems
x86: i8259.c fix style problems
x86: irq_32.c fix style problems
x86: ioport.c fix style problems
...
* 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
[IA64] fix typo in cpumask_of_pcibus()
x86: fix x86_32 builds for summit and es7000 arch's
cpumask: use work_on_cpu in acpi-cpufreq.c for read_measured_perf_ctrs
cpumask: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
cpumask: use cpumask_var_t in acpi-cpufreq.c
cpumask: use work_on_cpu in acpi/cstate.c
cpumask: convert struct cpufreq_policy to cpumask_var_t
cpumask: replace CPUMASK_ALLOC etc with cpumask_var_t
x86: cleanup remaining cpumask_t ops in smpboot code
cpumask: update pci_bus_show_cpuaffinity to use new cpumask API
cpumask: update local_cpus_show to use new cpumask API
ia64: cpumask fix for is_affinity_mask_valid()
We found a situation on Linus' machine that the Nvidia timer quirk hit on
a Intel chipset system. The problem is that the system has a fancy Nvidia
card with an own PCI bridge, and the early-quirks code looking for any
NVidia bridge triggered on it incorrectly. This didn't lead a boot
failure by luck, but the timer routing code selecting the wrong timer
first and some ugly messages. It might lead to real problems on other
systems.
I checked all the devices which are currently checked for by early_quirks
and it turns out they are all located in the root bus zero.
So change the early-quirks loop to only scan bus 0. This incidently also
saves quite some unnecessary scanning work, because early_quirks doesn't
go through all the non root busses.
The graphics card is not on bus 0, so it is not matched anymore.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some boxes there exist both RSDT and XSDT table. But unfortunately
sometimes there exists the following error when XSDT table is used:
a. 32/64X address mismatch
b. The 32/64X FACS address mismatch
In such case the boot option of "acpi=rsdt" is provided so that
RSDT is tried instead of XSDT table when the system can't work well.
http://bugzilla.kernel.org/show_bug.cgi?id=8246
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
cc:Thomas Renninger <trenn@suse.de>
Signed-off-by: Len Brown <len.brown@intel.com>
The Cx Register address obtained from the _CST object is used as the MWAIT
hints if the register type is FFixedHW. And it is used to check whether
the Cx type is supported or not.
On some boxes the following Cx state package is obtained from _CST object:
>{
ResourceTemplate ()
{
Register (FFixedHW,
0x01, // Bit Width
0x02, // Bit Offset
0x0000000000889759, // Address
0x03, // Access Size
)
},
0x03,
0xF5,
0x015E }
In such case we should use the bit[7:4] of Cx address to check whether
the Cx type is supported or not.
mask the MWAIT hint to avoid array address overflow
Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
Acked-by:Venki Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpf->mpf_X fields to
mpf->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpf' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
intel_mp_floating should be renamed to mpf_intel.
The reason: the 'f' in MPF already means 'floating'
which means MP Floating pointer structure -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (98 commits)
PCI PM: Put PM callbacks in the order of execution
PCI PM: Run default PM callbacks for all devices using new framework
PCI PM: Register power state of devices during initialization
PCI PM: Call pci_fixup_device from legacy routines
PCI PM: Rearrange code in pci-driver.c
PCI PM: Avoid touching devices behind bridges in unknown state
PCI PM: Move pci_has_legacy_pm_support
PCI PM: Power-manage devices without drivers during suspend-resume
PCI PM: Add suspend counterpart of pci_reenable_device
PCI PM: Fix poweroff and restore callbacks
PCI: Use msleep instead of cpu_relax during ASPM link retraining
PCI: PCIe portdrv: Add kerneldoc comments to remining core funtions
PCI: PCIe portdrv: Rearrange code so that related things are together
PCI: PCIe portdrv: Fix suspend and resume of PCI Express port services
PCI: PCIe portdrv: Add kerneldoc comments to some core functions
x86/PCI: Do not use interrupt links for devices using MSI-X
net: sfc: Use pci_clear_master() to disable bus mastering
PCI: Add pci_clear_master() as opposite of pci_set_master()
PCI hotplug: remove redundant test in cpq hotplug
PCI: pciehp: cleanup register and field definitions
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
trivial: chack -> check typo fix in main Makefile
trivial: Add a space (and a comma) to a printk in 8250 driver
trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
trivial: Fix misspelling of "firmware" in powerpc Makefile
trivial: Fix misspelling of "firmware" in usb.c
trivial: Fix misspelling of "firmware" in qla1280.c
trivial: Fix misspelling of "firmware" in a100u2w.c
trivial: Fix misspelling of "firmware" in megaraid.c
trivial: Fix misspelling of "firmware" in ql4_mbx.c
trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
trivial: Fix misspelling of "firmware" in ipw2100.c
trivial: Fix misspelling of "firmware" in atmel.c
trivial: Fix misspelled firmware in Kconfig
trivial: fix an -> a typos in documentation and comments
trivial: fix then -> than typos in comments and documentation
trivial: update Jesper Juhl CREDITS entry with new email
trivial: fix singal -> signal typo
trivial: Fix incorrect use of "loose" in event.c
trivial: printk: fix indentation of new_text_line declaration
trivial: rtc-stk17ta8: fix sparse warning
...
This patch is part of a larger patch series which will remove
the "char bus_id[20]" name string from struct device. The device
name is managed in the kobject anyway, and without any size
limitation, and just needlessly copied into "struct device".
To set and read the device name dev_name(dev) and dev_set_name(dev)
must be used. If your code uses static kobjects, which it shouldn't
do, "const char *init_name" can be used to statically provide the
name the registered device should have. At registration time, the
init_name field is cleared, to enforce the use of dev_name(dev) to
access the device name at a later time.
We need to get rid of all occurrences of bus_id in the entire tree
to be able to enable the new interface. Please apply this patch,
and possibly convert any remaining remaining occurrences of bus_id.
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Mark the function local_allocate_threshold_blocks() with __cpuinit,
in order to remove the following section mismatch messages:
WARNING: arch/x86/kernel/cpu/mcheck/built-in.o(.text+0x1363): Section mismatch in reference from the function local_allocate_threshold_blocks() to the function .cpuinit.text:allocate_threshold_blocks()
The function local_allocate_threshold_blocks() references
the function __cpuinit allocate_threshold_blocks().
This is often because local_allocate_threshold_blocks lacks a __cpuinit
annotation or the annotation of allocate_threshold_blocks is wrong.
WARNING: arch/x86/kernel/cpu/built-in.o(.text+0x1def): Section mismatch in reference from the function local_allocate_threshold_blocks() to the function .cpuinit.text:allocate_threshold_blocks()
The function local_allocate_threshold_blocks() references
the function __cpuinit allocate_threshold_blocks().
This is often because local_allocate_threshold_blocks lacks a __cpuinit
annotation or the annotation of allocate_threshold_blocks is wrong.
WARNING: arch/x86/kernel/built-in.o(.text+0xef2b): Section mismatch in reference from the function local_allocate_threshold_blocks() to the function .cpuinit.text:allocate_threshold_blocks()
The function local_allocate_threshold_blocks() references
the function __cpuinit allocate_threshold_blocks().
This is often because local_allocate_threshold_blocks lacks a __cpuinit
annotation or the annotation of allocate_threshold_blocks is wrong.
All the callsites of this function are __cpuinit already, and all the
functions it calls are __cpuinit as well.
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: fix rcutorture bug
rcu: eliminate synchronize_rcu_xxx macro
rcu: make treercu safe for suspend and resume
rcu: fix rcutree grace-period-latency bug on small systems
futex: catch certain assymetric (get|put)_futex_key calls
futex: make futex_(get|put)_key() calls symmetric
locking, percpu counters: introduce separate lock classes
swiotlb: clean up EXPORT_SYMBOL usage
swiotlb: remove unnecessary declaration
swiotlb: replace architecture-specific swiotlb.h with linux/swiotlb.h
swiotlb: add support for systems with highmem
swiotlb: store phys address in io_tlb_orig_addr array
swiotlb: add hwdev to swiotlb_phys_to_bus() / swiotlb_sg_to_bus()
Add kprobe_insn_mutex for protecting kprobe_insn_pages hlist, and remove
kprobe_mutex from architecture dependent code.
This allows us to call arch_remove_kprobe() (and free_insn_slot) while
holding kprobe_mutex.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds swiotlb_map_page and swiotlb_unmap_page to lib/swiotlb.c and
remove IA64 and X86's swiotlb_map_page and swiotlb_unmap_page.
This also removes unnecessary swiotlb_map_single, swiotlb_map_single_attrs,
swiotlb_unmap_single and swiotlb_unmap_single_attrs.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This converts X86 and IA64 to use include/linux/dma-mapping.h.
It's a bit large but pretty boring. The major change for X86 is
converting 'int dir' to 'enum dma_data_direction dir' in DMA mapping
operations. The major changes for IA64 is using map_page and
unmap_page instead of map_single and unmap_single.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch converts dma_map_single and dma_unmap_single to use
map_page and unmap_page respectively and removes unnecessary
map_single and unmap_single in struct dma_mapping_ops.
This leaves intel-iommu's dma_map_single and dma_unmap_single since
IA64 uses them. They will be removed after the unification.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a preparation of struct dma_mapping_ops unification. We use
map_page and unmap_page instead of map_single and unmap_single.
We will remove map_single hook in the last patch in this patchset.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a preparation of struct dma_mapping_ops unification. We use
map_page and unmap_page instead of map_single and unmap_single.
We will remove map_single and unmap_single hooks in the last patch in
this patchset.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a preparation of struct dma_mapping_ops unification. We use
map_page and unmap_page instead of map_single and unmap_single.
We will remove map_single and unmap_single hooks in the last patch in
this patchset.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a preparation of struct dma_mapping_ops unification. We use
map_page and unmap_page instead of map_single and unmap_single.
We will remove map_single and unmap_single hooks in the last patch in
this patchset.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is a preparation of struct dma_mapping_ops unification. We use
map_page and unmap_page instead of map_single and unmap_single.
This is sorta temporary workaround. We will move them to lib/swiotlb.c
to enable x86's swiotlb code to directly use them.
We will remove map_single and unmap_single hooks in the last patch in
this patchset.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for read_measured_perf_ctrs().
Basically splits off the work function from get_measured_perf which is
run on the designated cpu. Moves definition of struct perf_cur out of
function local namespace, and is used as the work function argument.
References in get_measured_perf use values in the perf_cur struct.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for drv_read() and drv_write().
Basically converts do_drv_{read,write} into "work_on_cpu" functions that
are now called by drv_read and drv_write.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce stack usage, use new cpumask API.
Replace the cpumask_t in struct drv_cmd with a cpumask_var_t. Remove unneeded
online_policy_cpus cpumask_t in acpi_cpufreq_target. Update refs to use
new cpumask API.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce stack usage
Replace the saving of current->cpus_allowed and set_cpus_allowed_ptr() with
a work_on_cpu function for the acpi_processor_ffh_cstate_probe() function.
Basically splits acpi_processor_ffh_cstate_probe() into two functions, the
other being acpi_processor_ffh_cstate_probe_cpu which is the work function
run on the designated cpu.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce memory usage
This is part of an effort to reduce structure sizes for machines
configured with large NR_CPUS. cpumask_t gets replaced by
cpumask_var_t, which is either struct cpumask[1] (small NR_CPUS) or
struct cpumask * (large NR_CPUS).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
There's only one user, and it's a fairly easy conversion.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
swiotlb: Don't include linux/swiotlb.h twice in lib/swiotlb.c
intel-iommu: fix build error with INTR_REMAP=y and DMAR=n
swiotlb: add missing __init annotations
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] Fix on resume, now preserves user policy min/max.
[CPUFREQ] Add Celeron Core support to p4-clockmod.
[CPUFREQ] add to speedstep-lib additional fsb values for core processors
[CPUFREQ] Disable sysfs ui for p4-clockmod.
[CPUFREQ] p4-clockmod: reduce noise
[CPUFREQ] clean up speedstep-centrino and reduce cpumask_t usage
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->mpc_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpc' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->oem_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'oem' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->mpc_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpc' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->mpc_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpc' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->mpc_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpc' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->mpc_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpc' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
It would be cleaner to rename all the mpc->mpc_X fields to
mpc->X - that alone would give 4 characters per usage site.
(we already know that it's an 'mpc' entity -
no need to duplicate that in the field too)
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
WARNING: Use #include <linux/percpu.h> instead of <asm/percpu.h>
WARNING: Use #include <linux/topology.h> instead of <asm/topology.h>
WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
ERROR: spaces required around that '?' (ctx:VxW)
ERROR: spaces required around that ':' (ctx:VxV)
total: 2 errors, 4 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
WARNING: Use #include <linux/io.h> instead of <asm/io.h>
WARNING: Use #include <linux/delay.h> instead of <asm/delay.h>
ERROR: trailing whitespace
WARNING: externs should be avoided in .c files
ERROR: space required after that ',' (ctx:VxV)
WARNING: space prohibited between function name and open parenthesis '('
total: 2 errors, 4 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
WARNING: Use #include <linux/acpi.h> instead of <asm/acpi.h>
WARNING: Use #include <linux/io.h> instead of <asm/io.h>
WARNING: Use #include <linux/delay.h> instead of <asm/delay.h>
ERROR: code indent should use tabs where possible
total: 1 errors, 3 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new cpumask API to reduce memory and stack usage
Allocate the following local cpumasks based on the number of cpus that
are present. References will use new cpumask API. (Currently only
modified for x86_64, x86_32 continues to use the *_map variants.)
cpu_callin_mask
cpu_callout_mask
cpu_initialized_mask
cpu_sibling_setup_mask
Provide the following accessor functions:
struct cpumask *cpu_sibling_mask(int cpu)
struct cpumask *cpu_core_mask(int cpu)
Other changes are when setting or clearing the cpu online, possible
or present maps, use the accessor functions.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
ERROR: space prohibited after that open parenthesis '('
ERROR: space prohibited before that close parenthesis ')'
total: 4 errors, 0 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, fix style problems, more readable
Fix:
WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
ERROR: code indent should use tabs where possible
total: 9 errors, 2 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
WARNING: Use #include <linux/io.h> instead of <asm/io.h>
WARNING: Use #include <linux/kdebug.h> instead of <asm/kdebug.h>
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
ERROR: "foo * bar" should be "foo *bar"
ERROR: trailing whitespace
ERROR: spaces required around that ':' (ctx:WxO)
ERROR: spaces required around that ':' (ctx:OxW)
total: 7 errors, 4 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
WARNING: Use #include <linux/nmi.h> instead of <asm/nmi.h>
WARNING: Use #include <linux/timex.h> instead of <asm/timex.h>
WARNING: line over 80 characters
ERROR: else should follow close brace '}'
total: 2 errors, 4 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mpc_config_oemtable should be renamed to mpc_oemtable.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mpc_config_lintsrc should be renamed to mpc_lintsrc.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mpc_config_intsrc should be renamed to mpc_intsrc.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mpc_config_ioapic should be renamed to mpc_ioapic.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mpc_config_processor should be renamed to mpc_cpu.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Plus 'processor' is a lot longer than 'cpu' - so we try to use 'cpu' in all
type names, as much as possible.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mpc_config_bus should be renamed to mpc_bus.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, solve 80 columns wrap problems
mp_config_table should be renamed to mpc_table.
The reason: the 'c' in MPC already means 'config' -
no need to repeat that in the type name.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix debug/crash printout
Since errorcode is popped out, RIP is on the top of the stack.
Use real RIP value instead of wrong CS.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits)
x86: setup_per_cpu_areas() cleanup
cpumask: fix compile error when CONFIG_NR_CPUS is not defined
cpumask: use alloc_cpumask_var_node where appropriate
cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t
x86: use cpumask_var_t in acpi/boot.c
x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids
sched: put back some stack hog changes that were undone in kernel/sched.c
x86: enable cpus display of kernel_max and offlined cpus
ia64: cpumask fix for is_affinity_mask_valid()
cpumask: convert RCU implementations, fix
xtensa: define __fls
mn10300: define __fls
m32r: define __fls
h8300: define __fls
frv: define __fls
cris: define __fls
cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS
cpumask: zero extra bits in alloc_cpumask_var_node
cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/
cpumask: convert mm/
...
Impact: cleanup
__alloc_bootmem and __alloc_bootmem_node do panic
for us in case of fail so no need for additional
checks here.
Also lets use pr_*() macros for printing.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reduce inter-node memory traffic.
Reduces inter-node memory traffic (offloading the global system bus)
by allocating referenced struct cpumasks on the same node as the
referring struct.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reduce memory usage, use new API.
This is part of an effort to reduce structure sizes for machines
configured with large NR_CPUS. cpumask_t gets replaced by
cpumask_var_t, which is either struct cpumask[1] (small NR_CPUS) or
struct cpumask * (large NR_CPUS).
(Changes to powernow-k* by <travis>.)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reduce stack size, use new API.
Replace cpumask_t with cpumask_var_t.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reduce future system panics due to cpumask operations using NR_CPUS
Insure that code does not look at bits >= nr_cpu_ids as when cpumasks are
allocated based on nr_cpu_ids, these extra bits will not be defined.
Also some other minor updates:
* change in to use cpu accessor function set_cpu_present() instead of
directly accessing cpu_present_map w/cpu_clear() [arch/x86/kernel/reboot.c]
* use cpumask_of() instead of &cpumask_of_cpu() [arch/x86/kernel/reboot.c]
* optimize some cpu_mask_to_apicid_and functions.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: enables /sys/devices/system/cpu/{kernel_max,offline} user interface
By setting total_cpus, the drivers/base/cpu.c will display the
values of kernel_max (NR_CPUS-1) and the offlined cpu map.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The #ifdef's are no longer necessary when the iommu-api and the amd
iommu updates are merged together.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/amd_iommu.c:1299:6: warning: symbol 'prealloc_protection_domains' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Imapct: add a new struct member to 'struct protection_domain'
When using protection domains for dma_ops and KVM its better to know for
which subsystem it was allocated. Add a flags member to struct
protection domain for that purpose.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: save unneeded logic to add and remove domains to the list
The removal of a protection domain from the iommu_pd_list is not
necessary. Another benefit is that we save complexity because we don't
have to readd it later when the device no longer uses the domain.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: split one function into three
The separate functions are required synchronize commands across all
hardware IOMMUs in the system.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: change code to free pagetables from protection domains
The dma_ops_free_pagetable function can only free pagetables from
dma_ops domains. Change that to free pagetables of pure protection
domains.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: function rename
The iommu_map function maps only one page. Make this clear in the
function name.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: reduce printk noise
The message "alloc irq_2_pin on cpu 0 node 0" is printed way too often.
% dmesg|grep irq_2_pin|wc -l
20
Get rid of the debug printks.
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce kernel size a bit
The current kernel build warns:
WARNING: vmlinux.o(.text+0x11458): Section mismatch in reference from the function swiotlb_alloc_boot() to the function .init.text:__alloc_bootmem_low()
The function swiotlb_alloc_boot() references
the function __init __alloc_bootmem_low().
This is often because swiotlb_alloc_boot lacks a __init
annotation or the annotation of __alloc_bootmem_low is wrong.
WARNING: vmlinux.o(.text+0x1011f2): Section mismatch in reference from the function swiotlb_late_init_with_default_size() to the function .init.text:__alloc_bootmem_low()
The function swiotlb_late_init_with_default_size() references
the function __init __alloc_bootmem_low().
This is often because swiotlb_late_init_with_default_size lacks a __init
annotation or the annotation of __alloc_bootmem_low is wrong.
and indeed the functions calling __alloc_bootmem_low() can be marked
__init as well.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
x86: export vector_used_by_percpu_irq
x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
sched: nominate preferred wakeup cpu, fix
x86: fix lguest used_vectors breakage, -v2
x86: fix warning in arch/x86/kernel/io_apic.c
sched: fix warning in kernel/sched.c
sched: move test_sd_parent() to an SMP section of sched.h
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: framework for sched_mc/smt_power_savings=N
sched: convert BALANCE_FOR_xx_POWER to inline functions
x86: use possible_cpus=NUM to extend the possible cpus allowed
x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
x86: update io_apic.c to the new cpumask code
x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
x86: xen: use smp_call_function_many()
x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
...
Fixed up trivial conflict in kernel/time/tick-sched.c manually
* 'kvm-updates/2.6.29' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (140 commits)
KVM: MMU: handle large host sptes on invlpg/resync
KVM: Add locking to virtual i8259 interrupt controller
KVM: MMU: Don't treat a global pte as such if cr4.pge is cleared
MAINTAINERS: Maintainership changes for kvm/ia64
KVM: ia64: Fix kvm_arch_vcpu_ioctl_[gs]et_regs()
KVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI
KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip
KVM: fix handling of ACK from shared guest IRQ
KVM: MMU: check for present pdptr shadow page in walk_shadow
KVM: Consolidate userspace memory capability reporting into common code
KVM: Advertise the bug in memory region destruction as fixed
KVM: use cpumask_var_t for cpus_hardware_enabled
KVM: use modern cpumask primitives, no cpumask_t on stack
KVM: Extract core of kvm_flush_remote_tlbs/kvm_reload_remote_mmus
KVM: set owner of cpu and vm file operations
anon_inodes: use fops->owner for module refcount
x86: KVM guest: kvm_get_tsc_khz: return khz, not lpj
KVM: MMU: prepopulate the shadow on invlpg
KVM: MMU: skip global pgtables on sync due to cr3 switch
KVM: MMU: collapse remote TLB flushes on root sync
...
Impact: cleanup, fix style problems, more readable
Fixes style problems:
WARNING: Use #include <linux/smp.h> instead of <asm/smp.h>
WARNING: Use #include <linux/acpi.h> instead of <asm/acpi.h>
WARNING: suspect code indent for conditional statements (8, 17)
WARNING: space prohibited between function name and open parenthesis '('
total: 0 errors, 5 warnings
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crash on x86/UV
UV is the SGI "UltraViolet" machine, which is x86_64 based.
BAU is the "Broadcast Assist Unit", used for TLB shootdown in UV.
This patch removes the allocation and initialization of an unused table.
This table is left over from a development test mode. It is unused in
the present code.
And it was incorrectly initialized: 8 entries allocated but 17 initialized,
causing slab corruption.
This patch should go into 2.6.27 and 2.6.28 as well as the current tree.
Diffed against 2.6.28 (linux-next, 12/30/08)
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The pda rework (commit 3461b0af02)
to remove static boot cpu pdas introduced a performance bug.
_boot_cpu_pda is the actual pda used by the boot cpu and is definitely
not "__read_mostly" and ended up polluting the read mostly section with
writes. This bug caused regression of about 8-10% on certain syscall
intensive workloads.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Acked-by: Mike Travis <travis@sgi.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (34 commits)
nfsd race fixes: jfs
nfsd race fixes: reiserfs
nfsd race fixes: ext4
nfsd race fixes: ext3
nfsd race fixes: ext2
nfsd/create race fixes, infrastructure
filesystem notification: create fs/notify to contain all fs notification
fs/block_dev.c: __read_mostly improvement and sb_is_blkdev_sb utilization
kill ->dir_notify()
filp_cachep can be static in fs/file_table.c
fix f_count description in Documentation/filesystems/files.txt
make INIT_FS use the __RW_LOCK_UNLOCKED initialization
take init_fs to saner place
kill vfs_permission
pass a struct path * to may_open
kill walk_init_root
remove incorrect comment in inode_permission
expand some comments (d_path / seq_path)
correct wrong function name of d_put in kernel document and source comment
fix switch_names() breakage in short-to-short case
...
* 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sparseirq: move __weak symbols into separate compilation unit
sparseirq: work around __weak alias bug
sparseirq: fix hang with !SPARSE_IRQ
sparseirq: set lock_class for legacy irq when sparse_irq is selected
sparseirq: work around compiler optimizing away __weak functions
sparseirq: fix desc->lock init
sparseirq: do not printk when migrating IRQ descriptors
sparseirq: remove duplicated arch_early_irq_init()
irq: simplify for_each_irq_desc() usage
proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
irq: for_each_irq_desc() move to irqnr.h
hrtimer: remove #include <linux/irq.h>
kvm_get_tsc_khz() currently returns the previously-calculated preset_lpj
value, but it is in loops-per-jiffy, not kHz. The current code works
correctly only when HZ=1000.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, we only set the KVM paravirt signature in case
of CONFIG_KVM_GUEST. However, it is possible to have it turned
off, while CONFIG_KVM_CLOCK is turned on. This is also a paravirt
case, and should be shown accordingly.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
On emergency_restart, we may need to use an NMI to disable virtualization
on all CPUs. We do that using nmi_shootdown_cpus() if VMX is enabled.
Note: With this patch, we will run the NMI stuff only when the CPU where
emergency_restart() was called has VMX enabled. This should work on most
cases because KVM enables VMX on all CPUs, but we may miss the small
window where KVM is doing that. Also, I don't know if all code using
VMX out there always enable VMX on all CPUs like KVM does. We have two
other alternatives for that:
a) Have an API that all code that enables VMX on any CPU should use
to tell the kernel core that it is going to enable VMX on the CPUs.
b) Always call nmi_shootdown_cpus() if the CPU supports VMX. This is
a bit intrusive and more risky, as it would run nmi_shootdown_cpus()
on emergency_reboot() even on systems where virtualization is never
enabled.
Finding a proper point to hook the nmi_shootdown_cpus() call isn't
trivial, as the non-emergency machine_restart() (that doesn't need the
NMI tricks) uses machine_emergency_restart() directly.
The solution to make this work without adding a new function or argument
to machine_ops was setting a 'reboot_emergency' flag that tells if
native_machine_emergency_restart() needs to do the virt cleanup or not.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We need to disable virtualization extensions on all CPUs before booting
the kdump kernel, otherwise the kdump kernel booting will fail, and
rebooting after the kdump kernel did its task may also fail.
We do it using cpu_emergency_vmxoff() and cpu_emergency_svm_disable(),
that should always work, because those functions check if the CPUs
support SVM or VMX before doing their tasks.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
For KVM can reuse the type define, and need them to support shadow MTRR.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, sparseirq: clean up Kconfig entry
x86: turn CONFIG_SPARSE_IRQ off by default
sparseirq: fix numa_migrate_irq_desc dependency and comments
sparseirq: add kernel-doc notation for new member in irq_desc, -v2
locking, irq: enclose irq_desc_lock_class in CONFIG_LOCKDEP
sparseirq, xen: make sure irq_desc is allocated for interrupts
sparseirq: fix !SMP building, #2
x86, sparseirq: move irq_desc according to smp_affinity, v7
proc: enclose desc variable of show_stat() in CONFIG_SPARSE_IRQ
sparse irqs: add irqnr.h to the user headers list
sparse irqs: handle !GENIRQ platforms
sparseirq: fix !SMP && !PCI_MSI && !HT_IRQ build
sparseirq: fix Alpha build failure
sparseirq: fix typo in !CONFIG_IO_APIC case
x86, MSI: pass irq_cfg and irq_desc
x86: MSI start irq numbering from nr_irqs_gsi
x86: use NR_IRQS_LEGACY
sparse irq_desc[] array: core kernel and x86 changes
genirq: record IRQ_LEVEL in irq_desc[]
irq.h: remove padding from irq_desc on 64bits
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
hrtimers: fix warning in kernel/hrtimer.c
x86: make sure we really have an hpet mapping before using it
x86: enable HPET on Fujitsu u9200
linux/timex.h: cleanup for userspace
posix-timers: simplify de_thread()->exit_itimers() path
posix-timers: check ->it_signal instead of ->it_pid to validate the timer
posix-timers: use "struct pid*" instead of "struct task_struct*"
nohz: suppress needless timer reprogramming
clocksource, acpi_pm.c: put acpi_pm_read_slow() under CONFIG_PCI
nohz: no softirq pending warnings for offline cpus
hrtimer: removing all ur callback modes, fix
hrtimer: removing all ur callback modes, fix hotplug
hrtimer: removing all ur callback modes
x86: correct link to HPET timer specification
rtc-cmos: export second NVRAM bank
Fixed up conflicts in sound/drivers/pcsp/pcsp.c and sound/core/hrtimer.c
manually.
* 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
stacktrace: provide save_stack_trace_tsk() weak alias
rcu: provide RCU options on non-preempt architectures too
printk: fix discarding message when recursion_bug
futex: clean up futex_(un)lock_pi fault handling
"Tree RCU": scalable classic RCU implementation
futex: rename field in futex_q to clarify single waiter semantics
x86/swiotlb: add default swiotlb_arch_range_needs_mapping
x86/swiotlb: add default phys<->bus conversion
x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
x86: add swiotlb allocation functions
swiotlb: consolidate swiotlb info message printing
swiotlb: support bouncing of HighMem pages
swiotlb: factor out copy to/from device
swiotlb: add arch hook to force mapping
swiotlb: allow architectures to override phys<->bus<->phys conversions
swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
rcu: fix rcutorture behavior during reboot
resources: skip sanity check of busy resources
swiotlb: move some definitions to header
swiotlb: allow architectures to override swiotlb pool allocation
...
Fix up trivial conflicts in
arch/x86/kernel/Makefile
arch/x86/mm/init_32.c
include/linux/hardirq.h
as per Ingo's suggestions.
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/xsave.c:162:5: warning: symbol 'restore_user_xstate' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/apic.c:270:5: warning: symbol 'x2apic_icr_read' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/bios_uv.c:28:18: warning: symbol 'uv_systab' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, reduce kernel size a bit, avoid sparse warnings
Fixes sparse warnings:
arch/x86/kernel/genx2apic_phys.c:164:6: warning: symbol 'x2apic_send_IPI_self' was not declared. Should it be static?
arch/x86/kernel/genx2apic_phys.c:169:6: warning: symbol 'init_x2apic_ldr' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/amd_iommu.c:1299:6: warning: symbol 'prealloc_protection_domains' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/amd_iommu_init.c:246:13: warning: symbol 'iommu_enable' was not declared. Should it be static?
arch/x86/kernel/amd_iommu_init.c:259:13: warning: symbol 'iommu_enable_event_logging' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Now that arch/x86/pci/pci.h is used in a number of other places as well,
move the lowlevel x86 pci definitions into the architecture include files.
(not to be confused with the existing arch/x86/include/asm/pci.h file,
which provides public details about x86 PCI)
Tested on: X86_32_UP, X86_32_SMP and X86_64_SMP
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/pci-gart_64.c:55:5: warning: symbol 'iommu_fullflush' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/io_apic.c:709:6: warning: symbol 'io_apic_sync' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix section mismatch warning
Commit b2bb855491 ("x86: Remove cpumask games
in x86/kernel/cpu/intel_cacheinfo.c") introduced get_cpu_leaves(), which
references __cpuinit cpuid4_cache_lookup().
Mark get_cpu_leaves() with a __cpuinit annotation.
Signed-off-by: Sergio Luis <sergio@larces.uece.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
Impact: extend functions with a (yet unused) parameter, update callsites
Some architectures need it - in preparation for highmem swiotlb.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix panic on null pointer with sparseirq
Some GCC versions seem to inline the weak global function,
when that function is empty.
Work it around, by making the functions return a (dummy) integer.
Signed-off-by: Yinghai <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce kernel size a bit, avoid sparse warning
Fixes sparse warning:
arch/x86/kernel/apic.c:103:5: warning: symbol 'disable_x2apic' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, avoid sparse warning
Include "../pci/pci.h" for port_cf9_safe
Fixes this sparse warning:
arch/x86/kernel/reboot.c:43:6: warning: symbol 'port_cf9_safe' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: New API
Like cpu_coregroup_map, but returns a (const) pointer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Impact: New APIs
The old node_to_cpumask/node_to_pcibus returned a cpumask_t: these
return a pointer to a struct cpumask. Part of removing cpumasks from
the stack.
Also makes __pcibus_to_node take a const pointer.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
all for_each_irq_desc() usage point have !desc check.
then its check can move into for_each_irq_desc() macro.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
enable_mtrr_cleanup is static, and is never set to anything but 0 or 1.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
On 32 bits, we may suffer IRQ 13, or supposedly we might have a buggy
implementation which gives spurious trap 16. We did not check for
this on 64 bits, but there is no reason we can't make the code the
same in both cases. Furthermore, this is presumably rare, so do the
spurious check last, instead of first.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, avoid warning on X86_64
Fixes this warning on X86_64:
CC arch/x86/kernel/traps.o
arch/x86/kernel/traps.c:695:5: warning: "CONFIG_X86_32" is not defined
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix a crash/hard-reboot on certain configs while enabling cpu runtime
On some archs, the boot of a secondary cpu can have an early fragile state.
On x86-64, the pda is not initialized on the first stage of a cpu boot but
it is needed to get the cpu number and the current task pointer. This data
is needed during tracing. As they were dereferenced at this stage, we got a
crash while tracing a cpu being enabled at runtime.
Some other archs like ia64 can have such kind of issue too.
Changes on v2:
We dropped the previous solution of a per-arch called function to guess the
current state of a cpu. That could slow down the tracing.
This patch removes the -pg flag on arch/x86/kernel/cpu/common.c where
the low level cpu boot functions exist, on start_secondary() and a helper
function used at this stage.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
lguest can be built as a module and makes use of this new symbol:
ERROR: "vector_used_by_percpu_irq" [drivers/lguest/lg.ko] undefined!
export it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These commits:
commit 95d313cf1c
Author: Mike Travis <travis@sgi.com>
Date: Tue Dec 16 17:33:54 2008 -0800
x86: Add cpu_mask_to_apicid_and
and
commit 6eeb7c5a99
Author: Mike Travis <travis@sgi.com>
Date: Tue Dec 16 17:33:55 2008 -0800
x86: update add-cpu_mask_to_apicid_and to use struct cpumask*
broke interrupt delivery on x2apic platforms. As x2apic cluster mode uses
logical delivery mode, we need to use logical apicid instead of physical apicid
in x2apic_cpu_mask_to_apicid_and()
Impact: fixes the broken interrupt delivery issue on generic x2apic platforms.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Mike Travis <travis@sgi.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix lguest, clean up
32-bit lguest used used_vectors to record vectors, but that model of
allocating vectors changed and got broken, after we changed vector
allocation to a per_cpu array.
Try enable that for 64bit, and the array is used for all vectors that
are not managed by vector_irq per_cpu array.
Also kill system_vectors[], that is now a duplication of the
used_vectors bitmap.
[ merged in cpus4096 due to io_apic.c cpumask changes. ]
[ -v2, fix build failure ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend performance counter support on x86 Intel CPUs
Modern Intel CPUs have 3 "fixed-function" performance counters, which
count these hardware events:
Instr_Retired.Any
CPU_CLK_Unhalted.Core
CPU_CLK_Unhalted.Ref
Add support for them to the performance counters subsystem.
Their use is transparent to user-space: the counter scheduler is
extended to automatically recognize the cases where a fixed-function
PMC can be utilized instead of a generic PMC. In such cases the
generic PMC is kept available for more counters.
The above fixed-function events map to these generic counter hw events:
PERF_COUNT_INSTRUCTIONS
PERF_COUNT_CPU_CYCLES
PERF_COUNT_BUS_CYCLES
(The 'bus' cycles are in reality often CPU-ish cycles, just with a fixed
frequency.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Generalize "bus cycles" hw events - and map them to CPU_CLK_Unhalted.Ref
on x86. (which is a good enough approximation)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Allow lowlevel ->enable() op to return an error if a counter can not be
added. This can be used to handle counter constraints.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Enumerate fixed-mode PMCs based on CPUID, and feed that into the
perfcounter code.
Does not use fixed-mode PMCs yet.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: refactor the x86 code for fixed-mode PMCs
Extend the data structures and rename the existing facilities
to allow for a 'generic' versus 'fixed' counter distinction.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: rename include file
We'll be providing an asm/perf_counter.h to the generic perfcounter code,
so use the already existing x86 file for this purpose and rename it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When ACPI is asked to find an MADT (APIC table)
and fails, then ACPI expects to run in PIC mode.
However, if an MP Table is was found, IRQs will be
registered as if an IOAPIC is being used, even
though ACPI is configuring interrupt links links for PIC mode.
In this scenario, disable MPS so that IRQs
are registered in PIC mode, consistent with ACPI.
http://bugzilla.kernel.org/show_bug.cgi?id=12257
Signed-off-by: Len Brown <len.brown@intel.com>
In the case of multiple FPU errors, prioritize the error codes,
instead of returning __SI_FAULT, which ends up pushing a 0 as the
error code to userspace, a POSIX violation.
For i386, we will simply return if there are no errors at all; for
x86-64 this is probably a "can't happen" (and the code should be
unified), but for this patch, return __SI_FAULT|SI_KERNEL if this ever
happens.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix deadlock
This is in response to the following bug report:
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12100
Subject : resume (S2R) broken by Intel microcode module, on A110L
Submitter : Andreas Mohr <andi@lisas.de>
Date : 2008-11-25 08:48 (19 days old)
Handled-By : Dmitry Adamushko <dmitry.adamushko@gmail.com>
[ The deadlock scenario has been discovered by Andreas Mohr ]
I think I might have a logical explanation why the system:
(http://bugzilla.kernel.org/show_bug.cgi?id=12100)
might hang upon resuming, OTOH it should have likely hanged each and every time.
(1) possible deadlock in microcode_resume_cpu() if either 'if' section is
taken;
(2) now, I don't see it in spec. and can't experimentally verify it (newer
ucodes don't seem to be available for my Core2duo)... but logically-wise, I'd
think that when read upon resuming, the 'microcode revision' (MSR 0x8B) should
be back to its original one (we need to reload ucode anyway so it doesn't seem
logical if a cpu doesn't drop the version)... if so, the comparison with
memcmp() for the full 'struct cpu_signature' is wrong... and that's how one of
the aforementioned 'if' sections might have been triggered - leading to a
deadlock.
Obviously, in my tests I simulated loading/resuming with the ucode of the same
version (just to see that the file is loaded/re-loaded upon resuming) so this
issue has never popped up.
I'd appreciate if someone with an appropriate system might give a try to the
2nd patch (titled "fix a comparison && deadlock...").
In any case, the deadlock situation is a must-have fix.
Reported-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Tested-by: Andreas Mohr <andi@lisas.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: move the BTS buffer accounting to the mlock bucket
Add alloc_locked_buffer() and free_locked_buffer() functions to mm/mlock.c
to kalloc a buffer and account the locked memory to current.
Account the memory for the BTS buffer to the tracer.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce new ptrace facility
Add arch_ptrace_untrace() function that is called when the tracer
detaches (either voluntarily or when the tracing task dies);
ptrace_disable() is only called on a voluntary detach.
Add ptrace_fork() and arch_ptrace_fork(). They are called when a
traced task is forked.
Clear DS and BTS related fields on fork.
Release DS resources and reclaim memory in ptrace_untrace(). This
releases resources already when the tracing task dies. We used to do
that when the traced task dies.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, avoid sparse warnings, reduce kernel size a bit
Fixes these sparse warnings:
arch/x86/kernel/cpu/common.c:869:6: warning: symbol 'boot_cpu_stack' was not declared. Should it be static?
arch/x86/kernel/cpu/common.c:910:6: warning: symbol 'boot_exception_stacks' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, avoid sparse warnings, reduce kernel size a bit
Fixes these sparse warnings:
arch/x86/kernel/cpu/perf_counter.c:44:11: warning: symbol 'intel_perfmon_event_map' was not declared. Should it be static?
arch/x86/kernel/cpu/perf_counter.c:54:11: warning: symbol 'max_intel_perfmon_events' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reduce kconfig variable scope and clean up
Bartlomiej pointed out that the config dependencies and comments are not right.
update it depend to NUMA, and fix some comments
Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On some machines it may be necessary to disable the saving/restoring
of the ACPI NVS memory region during hibernation/resume. For this
purpose, introduce new ACPI kernel command line option
acpi_sleep=s4_nonvs.
Based on a patch by Zhang Rui.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
Introduce new initcall for marking the ACPI NVS memory at startup, so
that it can be saved/restored during hibernation/resume.
Based on a patch by Zhang Rui.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
this warning:
arch/x86/kernel/io_apic.c: In function ‘ir_set_msi_irq_affinity’:
arch/x86/kernel/io_apic.c:3373: warning: ‘cfg’ may be used uninitialized in this function
triggers because the variable was truly uninitialized. We'd crash on
entering this code.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wrong cache sharing detection on platforms supporting > 8 bit apicid's
In the presence of extended topology eumeration leaf 0xb provided
by cpuid, 32bit extended initial_apicid in cpuinfo_x86 struct will be
updated by detect_extended_topology(). At this instance, we should also
reinit the apicid (which could also potentially be extended to 32bit).
With out this there will potentially be duplicate apicid's populated in the
per cpu's cpuinfo_x86 struct, resulting in wrong cache sharing topology etc
detected by init_intel_cacheinfo().
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
this warning:
arch/x86/kernel/microcode_amd.c: In function ‘apply_microcode_amd’:
arch/x86/kernel/microcode_amd.c:163: warning: cast from pointer to integer of different size
arch/x86/kernel/microcode_amd.c:163: warning: cast from pointer to integer of different size
triggers because we want to pass the address to the microcode MSR,
which is 64-bit even on 32-bit. Cast it explicitly to express this.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Use rt_sigframe_ia32 instead of rt_sigframe32.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup, avoid sparse warning
Included asm/idle.h for c1e_remove_cpu() declaration. Fixes this
sparse warning:
CHECK arch/x86/kernel/process.c
arch/x86/kernel/process.c:284:6: warning: symbol 'c1e_remove_cpu' was not declared. Should it be static?
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove annoying bootup printk
It's perfectly normal for no AMD GART to be present, e.g., if you have
Intel CPUs. None of the other iommu_init() functions makes noise when
it finds nothing.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new boot parameter
Use possible_cpus=NUM kernel parameter to extend the number of possible
cpus.
The ability to HOTPLUG ON cpus that are "possible" but not "present" is
dealt with in a later patch.
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: fix potential APIC crash
In determining the destination apicid, there are usually three cpumasks
that are considered: the incoming cpumask arg, cfg->domain and the
cpu_online_mask. Since we are just introducing the cpu_mask_to_apicid_and
function, make sure it includes the cpu_online_mask in it's evaluation.
[Added with this patch.]
There are two io_apic.c functions that did not previously use the
cpu_online_mask: setup_IO_APIC_irq and msi_compose_msg. Both of these
simply used cpu_mask_to_apicid(cfg->domain & TARGET_CPUS), and all but
one arch (NUMAQ[*]) returns only online cpus in the TARGET_CPUS mask,
so the behavior is identical for all cases.
[*: NUMAQ bug?]
Note that alloc_cpumask_var is only used for the 32-bit cases where
it's highly likely that the cpumask set size will be small and therefore
CPUMASK_OFFSTACK=n. But if that's not the case, failing the allocate
will cause the same return value as the default.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This done for conflict prevention: we merge it into the cpus4096 tree
because upcoming cpumask changes will touch apic.c that would collide
with x86/apic otherwise.
Impact: cleanup, move header file
Move arch/x86/kernel/sigframe.h to arch/x86/include/asm/sigframe.h.
It will be used in arch/x86/ia32/ia32_signal.c.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, prepare to move sigframe.h
On 32-bit, rename struct sigrame to struct sigframe_ia32, struct rt_sigframe
to struct rt_sigframe_ia32 and several structures.
And add helper macros to access the above data in arch/x86/kernel/signal.c.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move declarations of ia32_setup_rt_frame() and ia32_setup_frame() into
arch/x86/kernel/signal.c.
This is for future use of sigframe.h.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen will override these later on.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Xen will override these later on.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
swiotlb on 32 bit will be used by Xen domain 0 support.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add x86-specific swiotlb allocation functions. These are purely
default for the moment.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Remove cpumask_t's from stack.
Simple transition to work_on_cpu(), rather than cpumask games.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robert Richter <robert.richter@amd.com>
Cc: jacob.shin@amd.com
Impact: remove cpumask_t from stack.
We should not try to save and restore cpus_allowed on current.
We can't use work_on_cpu() here, since it's in the hotplug cpu path
(if anyone else tries to get the hotplug lock from a workqueue we
could deadlock against them).
Fortunately, we can just use smp_call_function_single() since the
function can run from an interrupt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Impact: use new API
Use the accessors rather than frobbing bits directly. Most of this is
in arch code I haven't even compiled, but is straightforward.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: cleanup, futureproof
In fact, all cpumask ops will only be valid (in general) for bit
numbers < nr_cpu_ids. So use that instead of NR_CPUS in various
places.
This is always safe: no cpu number can be >= nr_cpu_ids, and
nr_cpu_ids is initialized to NR_CPUS at boot.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
This patch simply changes cpumask_t to struct cpumask and similar
trivial modernizations.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: cleanup, remove on-stack cpumask.
The "map" arg is always cpu_online_mask. Importantly, set_affinity
always ands the argument with cpu_online_mask anyway, so we don't need
to do it in fixup_irqs(), avoiding a temporary.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: cleanup, consolidate patches, use new API
Consolidate the following into a single patch to adapt to new
sparseirq code in arch/x86/kernel/io_apic.c, add allocation of
cpumask_var_t's in domain and old_domain, and reduce further
merge conflicts. Only one file (arch/x86/kernel/io_apic.c) is
changed in all of these patches.
0006-x86-io_apic-change-irq_cfg-domain-old_domain-to.patch
0007-x86-io_apic-set_desc_affinity.patch
0008-x86-io_apic-send_cleanup_vector.patch
0009-x86-io_apic-eliminate-remaining-cpumask_ts-from-st.patch
0021-x86-final-cleanups-in-io_apic-to-use-new-cpumask-AP.patch
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: use updated APIs
Various API updates for x86:add-cpu_mask_to_apicid_and
(Note: separate because previous patch has been "backported" to 2.6.27.)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Impact: new API
Add a helper function that takes two cpumask's, and's them and then
returns the apicid of the result. This removes a need in io_apic.c
that uses a temporary cpumask to hold (mask & cfg->domain).
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup, better debugging
This has proven useful in debugging, *before* we try to use
for_each_possible_cpu(). It also now shows nr_cpumask_bits.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup, change parameter passing
* Change genapic interfaces to accept cpumask_t pointers where possible.
* Modify external callers to use cpumask_t pointers in function calls.
* Create new send_IPI_mask_allbutself which is the same as the
send_IPI_mask functions but removes smp_processor_id() from list.
This removes another common need for a temporary cpumask_t variable.
* Functions that used a temp cpumask_t variable for:
cpumask_t allbutme = cpu_online_map;
cpu_clear(smp_processor_id(), allbutme);
if (!cpus_empty(allbutme))
...
become:
if (!cpus_equal(cpu_online_map, cpumask_of_cpu(cpu)))
...
* Other minor code optimizations (like using cpus_clear instead of
CPU_MASK_NONE, etc.)
Applies to linux-2.6.tip/master.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Itroduce MCOUNT_SAVE/RESTORE_FRAME which allow us to
save a number of lines on source level.
Also fix a comment in ftrace.h.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve NUMA handling by migrating irq_desc on smp_affinity changes
if CONFIG_NUMA_MIGRATE_IRQ_DESC is set:
- make irq_desc to go with affinity aka irq_desc moving etc
- call move_irq_desc in irq_complete_move()
- legacy irq_desc is not moved, because they are allocated via static array
for logical apic mode, need to add move_desc_in_progress_in_same_domain,
otherwise it will not be moved ==> also could need two phases to get
irq_desc moved.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar wrote:
> allyes64 build failure:
>
> arch/x86/kernel/io_apic.c: In function ‘set_ir_ioapic_affinity_irq_desc’:
> arch/x86/kernel/io_apic.c:2295: error: incompatible type for argument 2 of
> ‘migrate_ioapic_irq_desc’
> arch/x86/kernel/io_apic.c: In function ‘ir_set_msi_irq_affinity’:
> arch/x86/kernel/io_apic.c:3205: error: incompatible type for argument 2 of
> ‘set_extra_move_desc’
> make[1]: *** wait: No child processes. Stop.
Here's a small patch to correct the build error with the post-merge tree.
Built and boot-tested. I'll will reset the follow on patches in my brand
new git tree to accommodate this change.
Fix two references in io_apic.c that were incorrect.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Call signal_fault() in error route of sys_sigreturn().
Change log level to KERN_EMERG if current is init.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crash
xpc needs to pass the physical address, not virtual.
Testing uncovered this problem. The virtual address happens to work
most of the time due to the way bios was masking off the node bits.
Passing the physical address makes it work all of the time.
Signed-off-by: Russ Anderson <rja@sgi.com>
Acked-by: Dean Nelson <dcn@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix disabled MCE after resume
Don't prevent multiple initialization of MCEs.
Back from early prehistory mcheck_init() has a reentry check. Presumably
that was needed in very old kernels to prevent it entering twice.
But as Andreas points out this prevents CPU hotplug (and therefore resume)
to correctly reinitialize MCEs when a AP boots again after being
offlined.
Just drop the check.
Reported-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix UV boot crash
This fixes a UV bug related to generating global memory addresses
on partitioned systems. Partition systems do not have physical memory
at address 0. Instead, a chunk of high memory is remapped by the chipset
so that it appears to be at address 0. This remapping is INVISIBLE to most
of the OS. The only OS functions that need to be aware of the remaping are
functions that directly interface to the chipset. The GRU is one example.
Also, delete a couple of unused macros related to global memory addresses.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, remove dead debug code
I ran across some old debugging code in vmi paravirt-ops code that was
already dead, but still potentially useful. After reviewing recent
changes to the way kernel page tables are allocated and initialized, and
the lack of bugs caught by this debugging code, I've concluded it is now
totally useless to have around, and it's already been #if 0'd for quite
some time.
There's no rush to get this in mainline, but it's also totally harmless,
so I'll let the x86 maintainers decide where it should be tucked. I've
been out of the mainstream dev loop for a couple months, so apologies if
I haven't got any protocol changes in order.
Remove mummified remains found in vmi_32.c
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: prepare the hpet code for Xen dom0 booting
When booting in Xen dom0, the hpet isn't really accessible, so make
sure the mapping is non-NULL before use.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, remove dead code
The last usage was removed by the patch set culminating in
| commit e3c449f526
| Author: Joerg Roedel <joerg.roedel@amd.com>
| Date: Wed Oct 15 22:02:11 2008 -0700
|
| x86, AMD IOMMU: convert driver to generic iommu_num_pages function
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reward non-stop TSCs with good TSC-based clocksources, etc.
Add support for CPUID_0x80000007_Bit8 on Intel CPUs as well. This bit means
that the TSC is invariant with C/P/T states and always runs at constant
frequency.
With Intel CPUs, we have 3 classes
* CPUs where TSC runs at constant rate and does not stop n C-states
* CPUs where TSC runs at constant rate, but will stop in deep C-states
* CPUs where TSC rate will vary based on P/T-states and TSC will stop in deep
C-states.
To cover these 3, one feature bit (CONSTANT_TSC) is not enough. So, add a
second bit (NONSTOP_TSC). CONSTANT_TSC indicates that the TSC runs at
constant frequency irrespective of P/T-states, and NONSTOP_TSC indicates
that TSC does not stop in deep C-states.
CPUID_0x8000000_Bit8 indicates both these feature bit can be set.
We still have CONSTANT_TSC _set_ and NONSTOP_TSC _not_set_ on some older Intel
CPUs, based on model checks. We can use TSC on such CPUs for time, as long as
those CPUs do not support/enter deep C-states.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: auto-enable HPET on Fujitsu u9200
HPET timer is listed in the ACPI table, but needs a quirk entry in order to
work. Unfortunately, the quirk code runs after first HPET hpet_enable() which
has already determined that the timer doesn't work (reads 0xFFFFFFFF). This
patch allows hpet_enable() to be called again after running the quirk code.
Signed-off-by: Janne Kulmala <janne.t.kulmala@tut.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead/incorrect code
Currently there is no chipset specific ucode. The checks are incorrect
anyway (e.g. pci device IDs are 16 bit and not 8 bit).
Thus I remove the stuff for the time being and will reintroduce it if
it's foreseeable that it is really needed.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix build warning
CC arch/x86/kernel/microcode_amd.o
arch/x86/kernel/microcode_amd.c: In function ‘request_microcode_fw’:
arch/x86/kernel/microcode_amd.c:393: warning: passing argument 2 of ‘generic_load_microcode’ discards qualifiers from pointer target type
(Respect "const" qualifier of firmware->data.)
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bug resulting in non-loaded AMD microcode
mc_header->processor_rev_id is a 2 byte value. Similar is true for
equiv_cpu in an equiv_cpu_entry -- only 2 bytes are of interest.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the reporting of empty BTS records
Correctly report a cleared BTS record as invalid. Used to be reported
as branch from 0 to 0.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Optimistically allocate a DS context. It is extremely unlikely that
one already existed. This simplifies the code a lot.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
VMI initialiation can relocate the fixmap, causing early_ioremap to
malfunction if it is initialized before the relocation. To fix this,
VMI activation is split into two phases; the detection, which must
happen before setting up ioremap, and the activation, which must happen
after parsing early boot parameters.
This fixes a crash on boot when VMI is enabled under VMware.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: restructure code
Change counter math from absolute values to clear delta logic.
We try to extract elapsed deltas from the raw hw counter - and put
that into the generic counter.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change calling convention of existing clock_event APIs
struct clock_event_timer's cpumask field gets changed to take pointer,
as does the ->broadcast function.
Another single-patch change. For safety, we BUG_ON() in
clockevents_register_device() if it's not set.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Impact: change existing irq_chip API
Not much point with gentle transition here: the struct irq_chip's
setaffinity method signature needs to change.
Fortunately, not widely used code, but hits a few architectures.
Note: In irq_select_affinity() I save a temporary in by mangling
irq_desc[irq].affinity directly. Ingo, does this break anything?
(Folded in fix from KOSAKI Motohiro)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: jeremy@xensource.com
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Impact: change calling convention of existing cpumask APIs
Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.
These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: paulus@samba.org
Cc: mingo@redhat.com
Cc: tony.luck@intel.com
Cc: ralf@linux-mips.org
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: cl@linux-foundation.org
Cc: srostedt@redhat.com
Impact: cleanup
Each SMP arch defines these themselves. Move them to a central
location.
Twists:
1) Some archs (m32, parisc, s390) set possible_map to all 1, so we add a
CONFIG_INIT_ALL_POSSIBLE for this rather than break them.
2) mips and sparc32 '#define cpu_possible_map phys_cpu_present_map'.
Those archs simply have phys_cpu_present_map replaced everywhere.
3) Alpha defined cpu_possible_map to cpu_present_map; this is tricky
so I just manipulate them both in sync.
4) IA64, cris and m32r have gratuitous 'extern cpumask_t cpu_possible_map'
declarations.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Mike Travis <travis@sgi.com>
Cc: ink@jurassic.park.msu.ru
Cc: rmk@arm.linux.org.uk
Cc: starvik@axis.com
Cc: tony.luck@intel.com
Cc: takata@linux-m32r.org
Cc: ralf@linux-mips.org
Cc: grundler@parisc-linux.org
Cc: paulus@samba.org
Cc: schwidefsky@de.ibm.com
Cc: lethal@linux-sh.org
Cc: wli@holomorphy.com
Cc: davem@davemloft.net
Cc: jdike@addtoit.com
Cc: mingo@redhat.com
Impact: remove obsolete code
The later versions of SimNow! actually all have serial console
emulation, so the direct interface isn't needed anymore. So remove
the undocumented simnow earlyprintk console.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: remove deprecated export
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We merge the irq/sparseirq, x86/quirks and x86/reboot trees into the
cpus4096 tree because the io-apic changes in the sparseirq change
conflict with the cpumask changes in the cpumask tree, and we
want to resolve those.
Impact: move most important x86 irq entry-points to a separate subsection
Annotate do_IRQ and smp_apic_timer_interrupt to put them into the .irqentry.text
subsection. These function will so be recognized as hardirq entrypoints for the
function-graph-tracer. We could also annotate other irq entries but the others
are far less important but they can be added on request.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: let the function-graph-tracer be aware of the irq entrypoints
Add a new .irqentry.text section to store the irq entrypoints functions
inside the same section. This way, the tracer will be able to signal
an interrupts triggering on output by recognizing these entrypoints.
Also, make this section recordable for dynamic tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
arch/x86/kernel/ds.c: In function 'ds_request':
arch/x86/kernel/ds.c:236: sorry, unimplemented: inlining failed in call to 'ds_get_context': recursive inlining
but the recursion here is scary ...
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the BTS bits from ptrace.c into ds.c.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make the ds code more debuggable
Turn BUG_ON's into WARN_ON_ONCE.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Introduce a proper enum for the 3 states of a counter:
PERF_COUNTER_STATE_OFF = -1
PERF_COUNTER_STATE_INACTIVE = 0
PERF_COUNTER_STATE_ACTIVE = 1
and rename counter->active to counter->state and propagate the
changes everywhere.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Rename them to better match up the usual IRQ disable/enable APIs:
hw_perf_disable_all() => hw_perf_save_disable()
hw_perf_restore_ctrl() => hw_perf_restore()
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new perf-counter type
The 'CPU clock' counter counts the amount of CPU clock time that is
elapsing, in nanoseconds. (regardless of how much of it the task is
spending on a CPU executing)
This counter type is a Linux kernel based abstraction, it is available
even if the hardware does not support native hardware performance counters.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: restructure code, introduce hw_ops driver abstraction
Introduce this abstraction to handle counter details:
struct hw_perf_counter_ops {
void (*hw_perf_counter_enable) (struct perf_counter *counter);
void (*hw_perf_counter_disable) (struct perf_counter *counter);
void (*hw_perf_counter_read) (struct perf_counter *counter);
};
This will be useful to support assymetric hw details, and it will also
be useful to implement "software counters". (Counters that count kernel
managed sw events such as pagefaults, context-switches, wall-clock time
or task-local time.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add group counters
This patch adds the "counter groups" abstraction.
Groups of counters behave much like normal 'single' counters, with a
few semantic and behavioral extensions on top of that.
A counter group is created by creating a new counter with the open()
syscall's group-leader group_fd file descriptor parameter pointing
to another, already existing counter.
Groups of counters are scheduled in and out in one atomic group, and
they are also roundrobin-scheduled atomically.
Counters that are member of a group can also record events with an
(atomic) extended timestamp that extends to all members of the group,
if the record type is set to PERF_RECORD_GROUP.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up new API
Thorough cleanup of the new perf counters API, we now get clean separation
of the various concepts:
- introduce perf_counter_hw_event to separate out the event source details
- move special type flags into separate attributes: PERF_COUNT_NMI,
PERF_COUNT_RAW
- extend the type to u64 and reserve it fully to the architecture in the
raw type case.
And make use of all these changes in the core and x86 perfcounters code.
Also change the syscall signature to:
asmlinkage int sys_perf_counter_open(
struct perf_counter_hw_event *hw_event_uptr __user,
pid_t pid,
int cpu,
int group_fd);
( Note that group_fd is unused for now - it's reserved for the counter
groups abstraction. )
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change syscall, cleanup
Make use of the new perf_counters event type.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix rare lost events problem
There are CPUs whose performance counters misbehave on CSTATE transitions,
so provide a way to just disable/enable them around deep idle methods.
(hw_perf_enable_all() is cheap on x86.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix spurious missed counter wakeups
In the case of NMI events, close a race window that can occur if an NMI
hits counter code that temporarily disables+enables a counter, and the NMI
leaks into the disabled section.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
these warnings:
arch/x86/kernel/paravirt-spinlocks.c: In function ‘default_spin_lock_flags’:
arch/x86/kernel/paravirt-spinlocks.c:12: warning: passing argument 1 of ‘__raw_spin_lock’ from incompatible pointer type
arch/x86/kernel/paravirt-spinlocks.c: At top level:
arch/x86/kernel/paravirt-spinlocks.c:11: warning: ‘default_spin_lock_flags’ defined but not used
showed that the prototype of default_spin_lock_flags() was confused about
what type spinlocks have.
the proper type on UP is raw_spinlock_t.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make perfcounter NMI and IRQ sequence more robust
Make __smp_perf_counter_interrupt() a bit more conservative: first disable
all counters, then read out the status. Most invocations are because there
are real events, so there's no performance impact.
Code flow gets a bit simpler as well this way.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement performance counters for x86 Intel CPUs.
It's simplified right now: the PERFMON CPU feature is assumed,
which is available in Core2 and later Intel CPUs.
The design is flexible to be extended to more CPU types as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup on 32-bit
Peter pointed this parameter can be changed.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Provide a way to pause the function graph tracer
As suggested by Steven Rostedt, the previous patch that prevented from
spinlock function tracing shouldn't use the raw_spinlock to fix it.
It's much better to follow lockdep with normal spinlock, so this patch
adds a new flag for each task to make the function graph tracer able
to be paused. We also can send an ftrace_printk whithout worrying of
the irrelevant traced spinlock during insertion.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: trace more functions
When the function graph tracer is configured, three more files are not
traced to prevent only four functions to be traced. And this impacts the
normal function tracer too.
arch/x86/kernel/process_64/32.c:
I had crashes when I let this file traced. After some debugging, I saw
that the "current" task point was changed inside__swtich_to(), ie:
"write_pda(pcurrent, next_p);" inside process_64.c Since the tracer store
the original return address of the function inside current, we had
crashes. Only __switch_to() has to be excluded from tracing.
kernel/module.c and kernel/extable.c:
Because of a function used internally by the function graph tracer:
__kernel_text_address()
To let the other functions inside these files to be traced, this patch
introduces the __notrace_funcgraph function prefix which is __notrace if
function graph tracer is configured and nothing if not.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
reorder exit path in __get_smp_config().
also move two print outs to acpi_process_madt
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: simplify code
Pass irq_desc and cfg around, instead of raw IRQ numbers - this way
we dont have to look it up again and again.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: sanitize MSI irq number ordering from top-down to bottom-up
Increase new MSI IRQs starting from nr_irqs_gsi (which is somewhere below
256), instead of decreasing from NR_IRQS. (The latter method can result
in confusingly high IRQ numbers - if NR_CPUS is set to a high value and
NR_IRQS scales up to a high value.)
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new feature
Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
NR_CPUS set to large values. The goal is to be able to scale up to much
larger NR_IRQS value without impacting the (important) common case.
To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
irq_desc pointers.
When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
this also makes the IRQ descriptors NUMA-local (to the site that calls
request_irq()).
This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
uses desc->chip_data for x86 to store irq_cfg.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix trampoline sizing bug, save space
While debugging a suspend-to-RAM related issue it occured to me that
if the trampoline code had grown past 4 KB, we would have been
allocating too little memory for it, since the 4 KB size of the
trampoline is hardcoded into arch/x86/kernel/e820.c . Change that
by making the kernel compute the trampoline size and allocate as much
memory as necessary.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add Celeron Core support to p4-clockmod.
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: Dave Jones <davej@redhat.com>
Add additional fsb values to pentium_core_get_frequency, from latest edition
(September 2008) of Intel 64 and IA-32 Architectures Software Develper's Manual,
Volume 3B: System Programming Guide, Part 2. Values added are to detect 800,
1067 and 1333 FSB types.
Signed-off-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br>
Signed-off-by: Dave Jones <davej@redhat.com>
p4-clockmod has a long history of abuse. It pretends to be a CPU
frequency scaling driver, even though it doesn't actually change
the CPU frequency, but instead just modulates the frequency with
wait-states.
The biggest misconception is that when running at the lower 'frequency'
p4-clockmod is saving power. This isn't the case, as workloads running
slower take longer to complete, preventing the CPU from entering deep C states.
However p4-clockmod does have a purpose. It can prevent overheating.
Having it hooked up to the cpufreq interfaces is the wrong way to achieve
cooling however. It should instead be hooked up to ACPI.
This diff introduces a means for a cpufreq driver to register with the
cpufreq core, but not present a sysfs interface.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
On those CPUs which are SpeedStep (EST) capable, we do not care at all if
p4-clockmod does not work, since a technically superior CPU frequency
management technology is to be used.
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: cleanup
1) The #ifdef CONFIG_HOTPLUG_CPU seems unnecessary these days.
2) The loop can simply skip over offline cpus, rather than creating a tmp mask.
3) set_mask is set to either a single cpu or all online cpus in a policy.
Since it's just used for set_cpus_allowed(), any offline cpus in a policy
don't matter, so we can just use cpumask_of_cpu() or the policy->cpus.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Just come across this when booting on an old hw..
Looks somewhat ugly, that single missing space ;)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot crash with numcpus=0 on certain systems
Fix early exception in __get_smp_config with nosmp.
Bail out early when there is no MP table.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The access to the iommu->need_sync member needs to be protected by the
iommu->lock. Otherwise this is a possible race condition. Fix it with
this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In some rare cases a request can arrive an IOMMU with its originial
requestor id even it is aliased. Handle this by setting the device table
entry to the same protection domain for the original and the aliased
requestor id.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: remove stale IOTLB entries
In the non-default nofullflush case the GART is only flushed when
next_bit wraps around. But it can happen that an unmap operation unmaps
memory which is behind the current next_bit location. If these addresses
are reused it may result in stale GART IO/TLB entries. Fix this by
setting the GART next_bit always behind an unmapped location.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Import: robustness checks
Add more checks in the function graph code to detect errors and
perhaps print out better information if a bug happens.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature, let entry function decide to trace or not
This patch lets the graph tracer entry function decide if the tracing
should be done at the end as well. This requires all function graph
entry functions return 1 if it should trace, or 0 if the return should
not be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: better dumpstack output
I noticed in my crash dumps and even in the stack tracer that a
lot of functions listed in the stack trace are simply
return_to_handler which is ftrace graphs way to insert its own
call into the return of a function.
But we lose out where the actually function was called from.
This patch adds in hooks to the dumpstack mechanism that detects
this and finds the real function to print. Both are printed to
let the user know that a hook is still in place.
This does give a funny side effect in the stack tracer output:
Depth Size Location (80 entries)
----- ---- --------
0) 4144 48 save_stack_trace+0x2f/0x4d
1) 4096 128 ftrace_call+0x5/0x2b
2) 3968 16 mempool_alloc_slab+0x16/0x18
3) 3952 384 return_to_handler+0x0/0x73
4) 3568 -240 stack_trace_call+0x11d/0x209
5) 3808 144 return_to_handler+0x0/0x73
6) 3664 -128 mempool_alloc+0x4d/0xfe
7) 3792 128 return_to_handler+0x0/0x73
8) 3664 -32 scsi_sg_alloc+0x48/0x4a [scsi_mod]
As you can see, the real functions are now negative. This is due
to them not being found inside the stack.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new ftrace_graph_stop function
While developing more features of function graph, I hit a bug that
caused the WARN_ON to trigger in the prepare_ftrace_return function.
Well, it was hard for me to find out that was happening because the
bug would not print, it would just cause a hard lockup or reboot.
The reason is that it is not safe to call printk from this function.
Looking further, I also found that it calls unregister_ftrace_graph,
which grabs a mutex and calls kstop machine. This would definitely
lock the box up if it were to trigger.
This patch adds a fast and safe ftrace_graph_stop() which will
stop the function tracer. Then it is safe to call the WARN ON.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: consistency change for function graph
This patch makes function graph record the mcount caller address
the same way the function tracer does.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
There exists macros for x86 asm to handle x86_64 and i386.
This patch updates function graph asm to use them.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge x86/dumpstack into tracing/ftrace because upcoming ftrace changes
depend on cleanups already in x86/dumpstack.
Also merge to latest upstream -rc.
Impact: remove stale IOTLB entries
In the non-default nofullflush case the GART is only flushed when
next_bit wraps around. But it can happen that an unmap operation unmaps
memory which is behind the current next_bit location. If these addresses
are reused it may result in stale GART IO/TLB entries. Fix this by
setting the GART next_bit always behind an unmapped location.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove CONFIG_APM_REAL_MODE_POWER_OFF like CONFIG_APM_POWER_OFF which
has been done for linux-2.2.14pre8 (http://lkml.org/lkml/1999/11/23/3).
Re-introducing CONFIG_APM_POWER_OFF got nack-ed. Stephen didn't bother
to remove CONFIG_APM_REAL_MODE_POWER_OFF, let's get rid of it now.
Reference: http://lkml.org/lkml/2008/5/7/97
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend and enable the function graph tracer to 64-bit x86
This patch implements the support for function graph tracer under x86-64.
Both static and dynamic tracing are supported.
This causes some small CPP conditional asm on arch/x86/kernel/ftrace.c I
wanted to use probe_kernel_read/write to make the return address
saving/patching code more generic but it causes tracing recursion.
That would be perhaps useful to implement a notrace version of these
function for other archs ports.
Note that arch/x86/process_64.c is not traced, as in X86-32. I first
thought __switch_to() was responsible of crashes during tracing because I
believed current task were changed inside but that's actually not the
case (actually yes, but not the "current" pointer).
So I will have to investigate to find the functions that harm here, to
enable tracing of the other functions inside (but there is no issue at
this time, while process_64.c stays out of -pg flags).
A little possible race condition is fixed inside this patch too. When the
tracer allocate a return stack dynamically, the current depth is not
initialized before but after. An interrupt could occur at this time and,
after seeing that the return stack is allocated, the tracer could try to
trace it with a random uninitialized depth. It's a prevention, even if I
hadn't problems with it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq.h: fix missing/extra kernel-doc
genirq: __irq_set_trigger: change pr_warning to pr_debug
irq: fix typo
x86: apic honour irq affinity which was set in early boot
genirq: fix the affinity setting in setup_irq
genirq: keep affinities set from userspace across free/request_irq()
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: always define DECLARE_PCI_UNMAP* macros
x86: fixup config space size of CPU functions for AMD family 11h
x86, bts: fix wrmsr and spinlock over kmalloc
x86, pebs: fix PEBS record size configuration
x86, bts: turn macro into static inline function
x86, bts: exclude ds.c from build when disabled
arch/x86/kernel/pci-calgary_64.c: change simple_strtol to simple_strtoul
x86: use limited register constraint for setnz
xen: pin correct PGD on suspend
x86: revert irq number limitation
x86: fixing __cpuinit/__init tangle, xsave_cntxt_init()
x86: fix __cpuinit/__init tangle in init_thread_xstate()
oprofile: fix an overflow in ppro code
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq:
[CPUFREQ] powernow-k8: ignore out-of-range PstateStatus value
[CPUFREQ] Documentation: Add Blackfin to list of supported processors
Impact: update comment
Clarify that too small aperture is valid reason for this code.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead code
If we take a closer look at the rff_trace/rff_action ret_from_fork code,
we have to realize that it does all the wrong things: for example it
checks the TIF flag - while later on jumping back to the ret-from-syscall
path - duplicating the check needlessly.
But checking for _TIF_SYSCALL_TRACE is completely unnecessary here because
we clear that flag for every freshly forked task. So the whole "tracing"
code here, for which there is a out of line jump optimization that makes
it even harder to read, is in reality completely dead code ...
Reported-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Cyrill Gorcunov <gorcunov@gmail.com>
We merge this branch because x86/debug touches code that we started
cleaning up in x86/irq. The two branches started out independent,
but as unexpected amount of activity went into x86/irq, they became
dependent. Resolve that by this cross-merge.
child_rip is called not by its name but indirectly
rather so make it global and aligned.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
entry_32.S is now the only user of KPROBE_ENTRY / KPROBE_END,
treewide. This patch reorders entry_64.S and explicitly generates
a separate section for functions that need the protection. The
generated code before and after the patch is equal.
The KPROBE_ENTRY and KPROBE_END macro's are removed too.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up assembly macros and annotations - with some object impact
entry_64.S is the only user of KPROBE_ENTRY / KPROBE_END on
x86_64. This patch reorders entry_64.S and explicitly generates
a separate section for functions that need the protection. The
generated code before and after the patch is equal.
Implicitly changing sections in assembly files makes it more
difficult to follow why the assembler is doing certain things.
For example,
.p2align 5
KPROBE_ENTRY(...)
was not doing what you would expect. Other section changes
(__ex_table, .fixup, .init.rodata) are done explicitly already.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make global variables and a function static
Fix following sparse warnings:
arch/x86/kernel/microcode_core.c:102:22: warning: symbol
'microcode_ops' was not declared. Should it be static?
arch/x86/kernel/microcode_core.c:206:24: warning: symbol
'microcode_pdev' was not declared. Should it be static?
arch/x86/kernel/microcode_core.c:322:6: warning: symbol
'microcode_update_cpu' was not declared. Should it be static?
arch/x86/kernel/microcode_intel.c:468:22: warning: symbol
'microcode_intel_ops' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new "power-tracer" ftrace plugin
This patch adds a C/P-state ftrace plugin that will generate
detailed statistics about the C/P-states that are being used,
so that we can look at detailed decisions that the C/P-state
code is making, rather than the too high level "average"
that we have today.
An example way of using this is:
mount -t debugfs none /sys/kernel/debug
echo cstate > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
sleep 1
echo 0 > /sys/kernel/debug/tracing/tracing_enabled
cat /sys/kernel/debug/tracing/trace | perl scripts/trace/cstate.pl > out.svg
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: more efficient code for ftrace graph tracer
This patch uses the dynamic patching, when available, to patch
the function graph code into the kernel.
This patch will ease the way for letting both function tracing
and function graph tracing run together.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: feature
This patch sets a C-like output for the function graph tracing.
For this aim, we now call two handler for each function: one on the entry
and one other on return. This way we can draw a well-ordered call stack.
The pid of the previous trace is loosely stored to be compared against
the one of the current trace to see if there were a context switch.
Without this little feature, the call tree would seem broken at
some locations.
We could use the sched_tracer to capture these sched_events but this
way of processing is much more simpler.
2 spaces have been chosen for indentation to fit the screen while deep
calls. The time of execution in nanosecs is printed just after closed
braces, it seems more easy this way to find the corresponding function.
If the time was printed as a first column, it would be not so easy to
find the corresponding function if it is called on a deep depth.
I plan to output the return value but on 32 bits CPU, the return value
can be 32 or 64, and its difficult to guess on which case we are.
I don't know what would be the better solution on X86-32: only print
eax (low-part) or even edx (high-part).
Actually it's thee same problem when a function return a 8 bits value, the
high part of eax could contain junk values...
Here is an example of trace:
sys_read() {
fget_light() {
} 526
vfs_read() {
rw_verify_area() {
security_file_permission() {
cap_file_permission() {
} 519
} 1564
} 2640
do_sync_read() {
pipe_read() {
__might_sleep() {
} 511
pipe_wait() {
prepare_to_wait() {
} 760
deactivate_task() {
dequeue_task() {
dequeue_task_fair() {
dequeue_entity() {
update_curr() {
update_min_vruntime() {
} 504
} 1587
clear_buddies() {
} 512
add_cfs_task_weight() {
} 519
update_min_vruntime() {
} 511
} 5602
dequeue_entity() {
update_curr() {
update_min_vruntime() {
} 496
} 1631
clear_buddies() {
} 496
update_min_vruntime() {
} 527
} 4580
hrtick_update() {
hrtick_start_fair() {
} 488
} 1489
} 13700
} 14949
} 16016
msecs_to_jiffies() {
} 496
put_prev_task_fair() {
} 504
pick_next_task_fair() {
} 489
pick_next_task_rt() {
} 496
pick_next_task_fair() {
} 489
pick_next_task_idle() {
} 489
------------8<---------- thread 4 ------------8<----------
finish_task_switch() {
} 1203
do_softirq() {
__do_softirq() {
__local_bh_disable() {
} 669
rcu_process_callbacks() {
__rcu_process_callbacks() {
cpu_quiet() {
rcu_start_batch() {
} 503
} 1647
} 3128
__rcu_process_callbacks() {
} 542
} 5362
_local_bh_enable() {
} 587
} 8880
} 9986
kthread_should_stop() {
} 669
deactivate_task() {
dequeue_task() {
dequeue_task_fair() {
dequeue_entity() {
update_curr() {
calc_delta_mine() {
} 511
update_min_vruntime() {
} 511
} 2813
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch changes the name of the "return function tracer" into
function-graph-tracer which is a more suitable name for a tracing
which makes one able to retrieve the ordered call stack during
the code flow.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A workaround for AMD CPU family 11h erratum 311 might cause that the
P-state Status Register shows a "current P-state" which is larger than
the "current P-state limit" in P-state Current Limit Register. For the
wrong P-state value there is no ACPI _PSS object defined and
powernow-k8/cpufreq can't determine the proper CPU frequency for that
state.
As a consequence this can cause a panic during boot (potentially with
all recent kernel versions -- at least I have reproduced it with
various 2.6.27 kernels and with the current .28 series), as an
example:
powernow-k8: Found 1 AMD Turion(tm)X2 Ultra DualCore Mobile ZM-82 processors (2 \
)
powernow-k8: 0 : pstate 0 (2200 MHz)
powernow-k8: 1 : pstate 1 (1100 MHz)
powernow-k8: 2 : pstate 2 (600 MHz)
BUG: unable to handle kernel paging request at ffff88086e7528b8
IP: [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
PGD 202063 PUD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 1
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.28-rc3-dirty #16
RIP: 0010:[<ffffffff80486361>] [<ffffffff80486361>] cpufreq_stats_update+0x4a/0\
f
Synaptics claims to have extended capabilities, but I'm not able to read them.<6\
6
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff88006e7528c0
RDX: 00000000ffffffff RSI: ffff88006e54af00 RDI: ffffffff808f056c
RBP: 00000000fffee697 R08: 0000000000000003 R09: ffff88006e73f080
R10: 0000000000000001 R11: 00000000002191c0 R12: ffff88006fb83c10
R13: 00000000ffffffff R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88006fb50740(0000) knlGS:0000000000000000
Unable to initialize Synaptics hardware.
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffff88086e7528b8 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 1, threadinfo ffff88006fb82000, task ffff88006fb816d0)
Stack:
ffff88006e74da50 0000000000000000 ffff88006e54af00 ffffffff804863c7
ffff88006e74da50 0000000000000000 00000000ffffffff 0000000000000000
ffff88006fb83c10 ffffffff8024b46c ffffffff808f0560 ffff88006fb83c10
Call Trace:
[<ffffffff804863c7>] ? cpufreq_stat_notifier_trans+0x51/0x83
[<ffffffff8024b46c>] ? notifier_call_chain+0x29/0x4c
[<ffffffff8024b561>] ? __srcu_notifier_call_chain+0x46/0x61
[<ffffffff8048496d>] ? cpufreq_notify_transition+0x93/0xa9
[<ffffffff8021ab8d>] ? powernowk8_target+0x1e8/0x5f3
[<ffffffff80486687>] ? cpufreq_governor_performance+0x1b/0x20
[<ffffffff80484886>] ? __cpufreq_governor+0x71/0xa8
[<ffffffff80484b21>] ? __cpufreq_set_policy+0x101/0x13e
[<ffffffff80485bcd>] ? cpufreq_add_dev+0x3f0/0x4cd
[<ffffffff8048577a>] ? handle_update+0x0/0x8
[<ffffffff803c2062>] ? sysdev_driver_register+0xb6/0x10d
[<ffffffff8056592c>] ? powernowk8_init+0x0/0x7e
[<ffffffff8048604c>] ? cpufreq_register_driver+0x8f/0x140
[<ffffffff80209056>] ? _stext+0x56/0x14f
[<ffffffff802c2234>] ? proc_register+0x122/0x17d
[<ffffffff802c23a0>] ? create_proc_entry+0x73/0x8a
[<ffffffff8025c259>] ? register_irq_proc+0x92/0xaa
[<ffffffff8025c2c8>] ? init_irq_proc+0x57/0x69
[<ffffffff807fc85f>] ? kernel_init+0x116/0x169
[<ffffffff8020cc79>] ? child_rip+0xa/0x11
[<ffffffff807fc749>] ? kernel_init+0x0/0x169
[<ffffffff8020cc6f>] ? child_rip+0x0/0x11
Code: 05 c5 83 36 00 48 c7 c2 48 5d 86 80 48 8b 04 d8 48 8b 40 08 48 8b 34 02 48\
RIP [<ffffffff80486361>] cpufreq_stats_update+0x4a/0x5f
RSP <ffff88006fb83b20>
CR2: ffff88086e7528b8
---[ end trace 0678bac75e67a2f7 ]---
Kernel panic - not syncing: Attempted to kill init!
In short, aftereffect of the wrong P-state is that
cpufreq_stats_update() uses "-1" as index for some array in
cpufreq_stats_update (unsigned int cpu)
{
...
if (stat->time_in_state)
stat->time_in_state[stat->last_index] =
cputime64_add(stat->time_in_state[stat->last_index],
cputime_sub(cur_time, stat->last_time));
...
}
Fortunately, the wrong P-state value is returned only if the core is
in P-state 0. This fix solves the problem by detecting the
out-of-range P-state, ignoring it, and using "0" instead.
Cc: Mark Langsdorf <mark.langsdorf@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: restructure DS memory allocation to be done by the usage site of DS
Require pre-allocated buffers in ds.h.
Move the BTS buffer allocation for ptrace into ptrace.c.
The pointer to the allocated buffer is stored in the traced task's
task_struct together with the handle returned by ds_request_bts().
Removes memory accounting code.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: generalize the DS code to shared buffers
Change the in-kernel ds.h interface to identify the tracer via a
handle returned on ds_request_~().
Tracers used to be identified via their task_struct.
The changes are required to allow DS to be shared between different
tasks, which is needed for perfmon2 and for ftrace.
For ptrace, the handle is stored in the traced task's task_struct.
This should probably go into a (arch-specific) ptrace context some
time.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sleeping-with-spinlock-held bugs/crashes
- Turn a wrmsr to write the DS_AREA MSR into a wrmsrl.
- Use irqsave variants of spinlocks.
- Do not allocate memory while holding spinlocks.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix DS hw enablement on 64-bit x86
Fix the PEBS record size in the DS configuration.
Reported-by: Stephane Eranian <eranian@googlemail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move the CONFIG guard from the .c file into the makefile.
Reported-by: Andi Kleen <andi-suse@firstfloor.org>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix theoretical option string parsing overflow
Since bridge is unsigned, it would seem better to use simple_strtoul that
simple_strtol.
A simplified version of the semantic patch that makes this change is as
follows: (http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r2@
long e;
position p;
@@
e = simple_strtol@p(...)
@@
position p != r2.p;
type T;
T e;
@@
e =
- simple_strtol@p
+ simple_strtoul
(...)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: muli@il.ibm.com
Cc: jdmason@kudzu.us
Cc: discuss@x86-64.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix build warning
this warning:
arch/x86/kernel/hpet.c:36: warning: ‘hpet_num_timers’ defined but not used
Triggers because hpet_num_timers is unused in the !CONFIG_PCI_MSI case.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reverts sequence of reboot fallbacks
Checkin 14d7ca5c57 changed the default
reboot method to "pci", a.k.a. port CF9. Unfortunately this has been
shown to cause lockups on at least two systems for which REBOOT_KBD
worked, both Thinkpads with Intel chipsets. Checkin
3889d0cea2 reverted the default, but did
not revert the fallback chain. This checkin reverts the fallback
chain; port CF9 is now only done by explicit "reboot=pci" or a future
potential DMI key.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: make global variable static
Fix this sparse warning:
arch/x86/kernel/hpet.c:36:18: warning: symbol 'hpet_num_timers' was
not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve backtrace quality
avoid the confusion in call trace because of the lack of padding at the
tail of function.
When do_exit gets called, the return address behind call instruction is
pushed into stack. If something get wrong in do_exit, for x86_64, the
entry "kernel_execve +0x00/0xXX" rather than "child_rip +0xYY/0xZZ" is
in the call trace.
That looks confusing, so add a u2d to make the return address still part
of the original call site. (This also catches any instances of us returning
from that function somehow.)
Signed-off-by: jia zhang <jia.zhang2008@gmail.com>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: move some code out of .kprobes.text
KPROBE_ENTRY switches code generation to .kprobes.text, and KPROBE_END
uses .popsection to get back to the previous section (.text, normally).
Also replace ENDPROC by END, for consistency.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup of entry_64.S
Except for the order and the place of the functions, this
patch should not change the generated code.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move recently introduced dwarf2 macros to dwarf2.h file.
It allow us to not duplicate them in assembly files.
Active usage of _cfi macros don't make assembly files
more obvious to understand but we already have a lot of
macros there which requires to search the definitions
of them *anyway*. But at least it make every cfi usage
one line shorter.
Also some code alignment is done.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix MSIx not enough irq numbers available regression
The manual revert of the sparse_irq patches missed to bring the number
of possible irqs back to the .27 status. This resulted in a regression
when two multichannel network cards were placed in a system with only
one IO_APIC - causing the networking driver to not have the right
IRQ and the device not coming up.
Remove the dynamic allocation logic leftovers and simply return
NR_IRQS in probe_nr_irqs() for now.
Fixes: http://lkml.org/lkml/2008/11/19/354
Reported-by: Jesper Dangaard Brouer <hawk@diku.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jesper Dangaard Brouer <hawk@diku.dk>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sparse build warning
Fix the following sparse warnings:
arch/x86/kernel/cpu/hypervisor.c:37:15: warning: symbol
'get_hypervisor_tsc_freq' was not declared. Should it be static?
arch/x86/kernel/cpu/hypervisor.c:53:16: warning: symbol
'init_hypervisor' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: "Alok N Kataria" <akataria@vmware.com>
Cc: "Dan Hecht" <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sparse build warning
Fix the following sparse warnings:
arch/x86/kernel/cpu/vmware.c:69:5: warning: symbol 'vmware_platform'
was not declared. Should it be static?
arch/x86/kernel/cpu/vmware.c:89:15: warning: symbol
'vmware_get_tsc_khz' was not declared. Should it be static?
arch/x86/kernel/cpu/vmware.c:107:16: warning: symbol
'vmware_set_feature_bits' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: "Alok N Kataria" <akataria@vmware.com>
Cc: "Dan Hecht" <dhecht@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Make the headers portion of signal_32.c and signal_64.c the same.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add new (default-off) tracing visualization feature
Usage example:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
.... run application ...
echo 0 >tracing_enabled
Then read one of 'trace','latency_trace','trace_pipe'.
To get the best output you can compile your userspace programs with
frame pointers (at least glibc + the app you are tracing).
Signed-off-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use deeper function tracing depth safely
Some tests showed that function return tracing needed a more deeper depth
of function calls. But it could be unsafe to store these return addresses
to the stack.
So these arrays will now be allocated dynamically into task_struct of current
only when the tracer is activated.
Typical scheme when tracer is activated:
- allocate a return stack for each task in global list.
- fork: allocate the return stack for the newly created task
- exit: free return stack of current
- idle init: same as fork
I chose a default depth of 50. I don't have overruns anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Simplify the irq-sampled stack overflow debug check:
- eliminate an #idef
- use WARN_ONCE() to emit a single warning (all bets are off
after the first such warning anyway)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make stack overflow debug check and printout narrower
stack_overflow_check() should consider the stack usage of pt_regs, and
thus it could warn us in advance. Additionally, it looks better for
the warning time to start at INITIAL_JIFFIES.
Assuming that rsp gets close to the check point before interrupt
arrives: when interrupt really happens, thread_info will be partly
overrode.
Signed-off-by: jia zhang <jia.zhang2008@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Reverts default reboot method.
Checkin 14d7ca5c57 changed the default
reboot method to "pci", a.k.a. port CF9. Unfortunately this has been
shown to cause lockups on at least two systems for which REBOOT_KBD
worked, both Thinkpads with Intel chipsets. This reverts the default
to REBOOT_KBD, while leaving the option to have "reboot=pci" specified
explicitly or via a DMI match.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix bootup crash
Even though it tested fine for me, there was still a bug in the
first patch: I have overlooked a call to ptregscall_common. This
patch fixes that, I think, but the code is never executed for
me while running a debian install... (I tested this by putting
an "1:jmp 1b" in there.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
DISABLE_INTERRUPTS(CLBR_NONE)/TRACE_IRQS_OFF is now always
executed just before paranoid_exit. Move it there.
Split out paranoidzeroentry, paranoiderrorentry, and
paranoidzeroentry_ist to get more readable macro's.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, shrink kernel image size
Also expand the paranoid_exit0 macro into nmi_exit inside the
nmi stub in the case of enabled irq-tracing.
This gives a few hundred bytes code size reduction.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The save_rest function completes a partial stack frame for use
by the PTREGSCALL macro. This also avoids the indirect call in
PTREGSCALLs.
This adds the macro movq_cfi_restore to hide the CFI_RESTORE
annotation when restoring a register from the stack frame.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
The break builds with older binutils (2.16.1):
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:282: Error: too many positional arguments
arch/x86/kernel/entry_64.S:283: Error: too many positional arguments
arch/x86/kernel/entry_64.S:284: Error: too many positional arguments
arch/x86/kernel/entry_64.S:285: Error: too many positional arguments
arch/x86/kernel/entry_64.S:286: Error: too many positional arguments
arch/x86/kernel/entry_64.S:287: Error: too many positional arguments
arch/x86/kernel/entry_64.S:288: Error: too many positional arguments
arch/x86/kernel/entry_64.S:289: Error: too many positional arguments
arch/x86/kernel/entry_64.S:290: Error: too many positional arguments
Took some time to figure out the detail that GAS chokes on: it's
negative offsets. Rearrange the calculations to make sure we never
go negative.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When we migrate an interrupt from one CPU to another, we set the
move_in_progress flag and clean up the vectors later once they're not
being used. If you're unlucky and call destroy_irq() before the vectors
become un-used, the move_in_progress flag is never cleared, which causes
the interrupt to become unusable.
This was discovered by Jesse Brandeburg for whom it manifested as an
MSI-X device refusing to use MSI-X mode when the driver was unloaded
and reloaded repeatedly.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This add-on patch to x86: move entry_64.S register saving out
of the macros visually cleans up the appearance of the code by
introducing some basic helper macro's. It also adds some cfi
annotations which were missing.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Annotate xsave_cntxt_init() as "can be called outside of __init".
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix incorrect __init annotation
This patch removes the following section mismatch warning. A patch set
was send previously (http://lkml.org/lkml/2008/11/10/407). But
introduce some other problem, reported by Rufus
(http://lkml.org/lkml/2008/11/11/46). Then Ingo Molnar suggest that,
it's best to remove __init from xsave_cntxt_init(void). Which is the
second patch in this series. Now, this one removes the following
warning.
WARNING: arch/x86/kernel/built-in.o(.cpuinit.text+0x2237): Section
mismatch in reference from the function cpu_init() to the function
.init.text:init_thread_xstate()
The function __cpuinit cpu_init() references
a function __init init_thread_xstate().
If init_thread_xstate is only used by cpu_init then
annotate init_thread_xstate with a matching annotation.
Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Here is a combined patch that moves "save_args" out-of-line for
the interrupt macro and moves "error_entry" mostly out-of-line
for the zeroentry and errorentry macros.
The save_args function becomes really straightforward and easy
to understand, with the possible exception of the stack switch
code, which now needs to copy the return address of to the
calling function. Normal interrupts arrive with ((~vector)-0x80)
on the stack, which gets adjusted in common_interrupt:
<common_interrupt>:
(5) addq $0xffffffffffffff80,(%rsp) /* -> ~(vector) */
(4) sub $0x50,%rsp /* space for registers */
(5) callq ffffffff80211290 <save_args>
(5) callq ffffffff80214290 <do_IRQ>
<ret_from_intr>:
...
An apic interrupt stub now look like this:
<thermal_interrupt>:
(5) pushq $0xffffffffffffff05 /* ~(vector) */
(4) sub $0x50,%rsp /* space for registers */
(5) callq ffffffff80211290 <save_args>
(5) callq ffffffff80212b8f <smp_thermal_interrupt>
(5) jmpq ffffffff80211f93 <ret_from_intr>
Similarly the exception handler register saving function becomes
simpler, without the need of any parameter shuffling. The stub
for an exception without errorcode looks like this:
<overflow>:
(6) callq *0x1cad12(%rip) # ffffffff803dd448 <pv_irq_ops+0x38>
(2) pushq $0xffffffffffffffff /* no syscall */
(4) sub $0x78,%rsp /* space for registers */
(5) callq ffffffff8030e3b0 <error_entry>
(3) mov %rsp,%rdi /* pt_regs pointer */
(2) xor %esi,%esi /* no error code */
(5) callq ffffffff80213446 <do_overflow>
(5) jmpq ffffffff8030e460 <error_exit>
And one for an exception with errorcode like this:
<segment_not_present>:
(6) callq *0x1cab92(%rip) # ffffffff803dd448 <pv_irq_ops+0x38>
(4) sub $0x78,%rsp /* space for registers */
(5) callq ffffffff8030e3b0 <error_entry>
(3) mov %rsp,%rdi /* pt_regs pointer */
(5) mov 0x78(%rsp),%rsi /* load error code */
(9) movq $0xffffffffffffffff,0x78(%rsp) /* no syscall */
(5) callq ffffffff80213209 <do_segment_not_present>
(5) jmpq ffffffff8030e460 <error_exit>
Unfortunately, this last type is more than 32 bytes. But the total space
savings due to this patch is about 2500 bytes on an smp-configuration,
and I think the code is clearer than it was before. The tested kernels
were non-paravirt ones (i.e., without the indirect call at the top of
the exception handlers).
Anyhow, I tested this patch on top of a recent -tip. The machine
was an 2x4-core Xeon at 2333MHz. Measured where the delays between
(almost-)adjacent rdtsc instructions. The graphs show how much
time is spent outside of the program as a function of the measured
delay. The area under the graph represents the total time spent
outside the program. Eight instances of the rdtsctest were
started, each pinned to a single cpu. The histogams are added.
For each kernel two measurements were done: one in mostly idle
condition, the other while running "bonnie++ -f", bound to cpu 0.
Each measurement took 40 minutes runtime. See the attached graphs
for the results. The graphs overlap almost everywhere, but there
are small differences.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Fix:
arch/x86/kernel/setup.c:592: warning: 'dmi_low_memory_corruption' defined but not used
this is only used if CONFIG_X86_RESERVE_LOW_64K is defined.
Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce size of the kernel image a bit
Fix:
arch/x86/kernel/genx2apic_uv_x.c:403: warning: 'uv_heartbeat_disable' defined but not used
the function is only used when CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Richard A. Holden III <aciddeath@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell Optiplex 330 appears to hang on reboot. This is resolved by adding
a quirk to set bios reboot.
Signed-off-by: Leann Ogasawara <leann.ogasawara@canonical.com>
Signed-off-by: Steve Conklin <steve.conklin@canonical.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
We can autodetect those system that need cluster apic, and update genapic
accordingly.
We can also remove wakeup.h for e7000, because it's default one is now
the same as overall default mach_wakecpu.h
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen the reach of the low-memory-protect DMI quirk
Phoenix BIOSes variously identify their vendor as "Phoenix Technologies,
LTD" or "Phoenix Technologies LTD" (without the comma.)
This patch makes the identification string in the bad_bios_dmi_table
more general (following a suggestion by Ingo Molnar), so that both
versions are handled.
Again, the patched file compiles cleanly and the patch has been tested
successfully on my machine.
Signed-off-by: Philipp Kohlbecher <xt28@gmx.de>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: makes device isolation the default for AMD IOMMU
Some device drivers showed double-free bugs of DMA memory while testing
them with AMD IOMMU. If all devices share the same protection domain
this can lead to data corruption and data loss. Prevent this by putting
each device into its own protection domain per default.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
this compiler warning:
arch/x86/kernel/ds.c: In function 'ds_request':
arch/x86/kernel/ds.c:368: warning: 'context' may be used uninitialized in this function
Shows that the code flow in ds_request() is buggy - it goes into
the unlock+release-context path even when the context is not allocated
yet.
First allocate the context, then do the other checks.
Also, take care with GFP allocations under the ds_lock spinlock.
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: help to find the better depth of trace
We decided to arbitrary define the depth of function return trace as
"20". Perhaps this is not enough. To help finding an optimal depth, we
measure now the overrun: the number of functions that have been missed
for the current thread. By default this is not displayed, we have to
do set a particular flag on the return tracer: echo overrun >
/debug/tracing/trace_options And the overrun will be printed on the
right.
As the trace shows below, the current 20 depth is not enough.
update_wall_time+0x37f/0x8c0 -> update_xtime_cache (345 ns) (Overruns: 2838)
update_wall_time+0x384/0x8c0 -> clocksource_get_next (1141 ns) (Overruns: 2838)
do_timer+0x23/0x100 -> update_wall_time (3882 ns) (Overruns: 2838)
tick_do_update_jiffies64+0xbf/0x160 -> do_timer (5339 ns) (Overruns: 2838)
tick_sched_timer+0x6a/0xf0 -> tick_do_update_jiffies64 (7209 ns) (Overruns: 2838)
vgacon_set_cursor_size+0x98/0x120 -> native_io_delay (2613 ns) (Overruns: 274)
vgacon_cursor+0x16e/0x1d0 -> vgacon_set_cursor_size (33151 ns) (Overruns: 274)
set_cursor+0x5f/0x80 -> vgacon_cursor (36432 ns) (Overruns: 274)
con_flush_chars+0x34/0x40 -> set_cursor (38790 ns) (Overruns: 274)
release_console_sem+0x1ec/0x230 -> up (721 ns) (Overruns: 274)
release_console_sem+0x225/0x230 -> wake_up_klogd (316 ns) (Overruns: 274)
con_flush_chars+0x39/0x40 -> release_console_sem (2996 ns) (Overruns: 274)
con_write+0x22/0x30 -> con_flush_chars (46067 ns) (Overruns: 274)
n_tty_write+0x1cc/0x360 -> con_write (292670 ns) (Overruns: 274)
smp_apic_timer_interrupt+0x2a/0x90 -> native_apic_mem_write (330 ns) (Overruns: 274)
irq_enter+0x17/0x70 -> idle_cpu (413 ns) (Overruns: 274)
smp_apic_timer_interrupt+0x2f/0x90 -> irq_enter (1525 ns) (Overruns: 274)
ktime_get_ts+0x40/0x70 -> getnstimeofday (465 ns) (Overruns: 274)
ktime_get_ts+0x60/0x70 -> set_normalized_timespec (436 ns) (Overruns: 274)
ktime_get+0x16/0x30 -> ktime_get_ts (2501 ns) (Overruns: 274)
hrtimer_interrupt+0x77/0x1a0 -> ktime_get (3439 ns) (Overruns: 274)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix wakeup_secondary_cpu with hotplug
We can not put that into x86_quirks, because that is __initdata.
So try to move that to genapic, and add update_genapic in x86_quirks.
later we even could use that stub to:
1. autodetect CONFIG_ES7000_CLUSTERED_APIC
2. more correct inquire_remote_apic with apic_verbosity setting.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix incorrectly marked unstable TSC clock
Patch (commit 0d12cdd "sched: improve sched_clock() performance") has
a regression on one of the test systems here.
With the patch, I see:
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 28 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to check_tsc_sync_source failed
Whereas, without the patch syncs pass fine on all CPUs:
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
Due to this, TSC is marked unstable, when it is not actually unstable.
This is because syncs in check_tsc_wrap() goes away due to this commit.
As per the discussion on this thread, correct way to fix this is to add
explicit syncs as below?
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix secondary-CPU wakeup/init path with numaq and es7000
While looking at wakeup_secondary_cpu for WAKE_SECONDARY_VIA_NMI:
|#ifdef WAKE_SECONDARY_VIA_NMI
|/*
| * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
| * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
| * won't ... remember to clear down the APIC, etc later.
| */
|static int __devinit
|wakeup_secondary_cpu(int logical_apicid, unsigned long start_eip)
|{
| unsigned long send_status, accept_status = 0;
| int maxlvt;
|...
| if (APIC_INTEGRATED(apic_version[phys_apicid])) {
| maxlvt = lapic_get_maxlvt();
I noticed that there is no warning about undefined phys_apicid...
because WAKE_SECONDARY_VIA_NMI and WAKE_SECONDARY_VIA_INIT can not be
defined at the same time. So NUMAQ is using wrong wakeup_secondary_cpu.
WAKE_SECONDARY_VIA_NMI, WAKE_SECONDARY_VIA_INIT and
WAKE_SECONDARY_VIA_MIP are variants of a weird and fragile
preprocessor-driven "HAL" mechanisms to specify the kind of secondary-CPU
wakeup strategy a given x86 kernel will use.
The vast majority of systems want to use INIT for secondary wakeup - NUMAQ
uses an NMI, (old-style-) ES7000 uses 'MIP' (a firmware driven in-memory
flag to let secondaries continue).
So convert these mechanisms to x86_quirks and add a
->wakeup_secondary_cpu() method to specify the rare exception
to the sane default.
Extend genapic accordingly as well, for 32-bit.
While looking further, I noticed that functions in wakecup.h for numaq
and es7000 are different to the default in mach_wakecpu.h - but smpboot.c
will only use default mach_wakecpu.h with smphook.h.
So we need to add mach_wakecpu.h for mach_generic, to properly support
numaq and es7000, and vectorize the following SMP init methods:
int trampoline_phys_low;
int trampoline_phys_high;
void (*wait_for_init_deassert)(atomic_t *deassert);
void (*smp_callin_clear_local_apic)(void);
void (*store_NMI_vector)(unsigned short *high, unsigned short *low);
void (*restore_NMI_vector)(unsigned short *high, unsigned short *low);
void (*inquire_remote_apic)(int apicid);
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
All blame goes to: color white,red "[^[:graph:]]+$"
in .nanorc ;).
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix es7000 build
CC arch/x86/kernel/es7000_32.o
arch/x86/kernel/es7000_32.c: In function find_unisys_acpi_oem_table:
arch/x86/kernel/es7000_32.c:255: error: implicit declaration of function acpi_get_table_with_size
arch/x86/kernel/es7000_32.c:261: error: implicit declaration of function early_acpi_os_unmap_memory
arch/x86/kernel/es7000_32.c: In function unmap_unisys_acpi_oem_table:
arch/x86/kernel/es7000_32.c:277: error: implicit declaration of function __acpi_unmap_table
make[1]: *** [arch/x86/kernel/es7000_32.o] Error 1
we applied one patch out of order...
| commit a73aaedd95
| Author: Yinghai Lu <yhlu.kernel@gmail.com>
| Date: Sun Sep 14 02:33:14 2008 -0700
|
| x86: check dsdt before find oem table for es7000, v2
|
| v2: use __acpi_unmap_table()
that patch need:
x86: use early_ioremap in __acpi_map_table
x86: always explicitly map acpi memory
acpi: remove final __acpi_map_table mapping before setting acpi_gbl_permanent_mmap
acpi/x86: introduce __apci_map_table, v4
submitted to the ACPI tree but not upstream yet.
fix it until those patches applied, need to revert this one
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix a problem where ds_request() returned an error without releasing the
ds lock.
Reported-by: Stephane Eranian <eranian@gmail.com>
Signed-off-by: Markus Metzger <markus.t.metzger@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds the support for dynamic tracing on the function return tracer.
The whole difference with normal dynamic function tracing is that we don't need
to hook on a particular callback. The only pro that we want is to nop or set
dynamically the calls to ftrace_caller (which is ftrace_return_caller here).
Some security checks ensure that we are not trying to launch dynamic tracing for
return tracing while normal function tracing is already running.
An example of trace with getnstimeofday set as a filter:
ktime_get_ts+0x22/0x50 -> getnstimeofday (2283 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1396 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1825 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1426 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1524 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1382 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1434 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1464 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1502 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1404 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1397 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1051 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1314 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1344 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1163 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1390 ns)
ktime_get_ts+0x22/0x50 -> getnstimeofday (1374 ns)
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix possible race condition in ftrace function return tracer
This fixes a possible race condition if index incrementation
is not immediately flushed in memory.
Thanks for Andi Kleen and Steven Rostedt for pointing out this issue
and give me this solution.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow archs more flexibility on dynamic ftrace implementations
Dynamic ftrace has largly been developed on x86. Since x86 does not
have the same limitations as other architectures, the ftrace interaction
between the generic code and the architecture specific code was not
flexible enough to handle some of the issues that other architectures
have.
Most notably, module trampolines. Due to the limited branch distance
that archs make in calling kernel core code from modules, the module
load code must create a trampoline to jump to what will make the
larger jump into core kernel code.
The problem arises when this happens to a call to mcount. Ftrace checks
all code before modifying it and makes sure the current code is what
it expects. Right now, there is not enough information to handle modifying
module trampolines.
This patch changes the API between generic dynamic ftrace code and
the arch dependent code. There is now two functions for modifying code:
ftrace_make_nop(mod, rec, addr) - convert the code at rec->ip into
a nop, where the original text is calling addr. (mod is the
module struct if called by module init)
ftrace_make_caller(rec, addr) - convert the code rec->ip that should
be a nop into a caller to addr.
The record "rec" now has a new field called "arch" where the architecture
can add any special attributes to each call site record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit e51af66308, which was
wrongly hoovered up and submitted about a month after a better fix had
already been merged.
The better fix is commit cbda1ba898
("PCI/iommu: blacklist DMAR on Intel G31/G33 chipsets"), where we do
this blacklisting based on the DMI identification for the offending
motherboard, since sometimes this chipset (or at least a chipset with
the same PCI ID) apparently _does_ actually have an IOMMU.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: Fix interrupt via the apicinterrupt macro
Checkin 939b787130 changed the
"interrupt" macro, but the "interrupt" macro is also invoked
indirectly from the "apicinterrupt" macro.
The "apicinterrupt" macro probably should have its own collection of
systematic stubs for the same reason the main IRQ code does; as is it
is a huge amount of replicated code.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Optimize a bit the function return tracer
This patch changes the calling convention of prepare_ftrace_return to
pass its arguments by register. This will optimize it a bit and
prepare it to support dynamic tracing.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove spinlocks and irq disabling in function return tracer.
I've tried to figure out all of the race condition that could happen
when the tracer pushes or pops a return address trace to/from the
current thread_info.
Theory:
_ One thread can only execute on one cpu at a time. So this code
doesn't need to be SMP-safe. Just drop the spinlock.
_ The only race could happen between the current thread and an
interrupt. If an interrupt is raised, it will increase the index of
the return stack storage and then execute until the end of the
tracing to finally free the index it used. We don't need to disable
irqs.
This is theorical. In practice, I've tested it with a two-core SMP and
had no problem at all. Perhaps -tip testing could confirm it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: name change of unlikely tracer and profiler
Ingo Molnar suggested changing the config from UNLIKELY_PROFILE
to BRANCH_PROFILING. I never did like the "unlikely" name so I
went one step farther, and renamed all the unlikely configurations
to a "BRANCH" variant.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add debug check
If none of the perfctrs are free when calculating cpu_khz we default to
using ctr 3 (ie, we just choose 3). This may lead to an incorrect tsc
freq value which can cause the system to be unstable.
To aid in future debugging, WARN the user of a potential problem.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.28' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
KVM: Fix pit memory leak if unable to allocate irq source id
KVM: ia64: fix vmm_spin_{un}lock for !CONFIG_SMP
KVM: VMX: Set IGMT bit in EPT entry
KVM: Require the PCI subsystem
x86: KVM guest: fix section mismatch warning in kvmclock.c
KVM: ia64: Use guest signal mask when blocking
KVM: MMU: increase per-vcpu rmap cache alloc size
Older versions of gas don't implement the C-style != operator, they
instead want the Pascal-style <> operator. Change != to <> so we
don't break compilation with those old versions of gas.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (47 commits)
ACPI: pci_link: remove acpi_irq_balance_set() interface
fujitsu-laptop: Add DMI callback for Lifebook S6420
ACPI: EC: Don't do transaction from GPE handler in poll mode.
ACPI: EC: lower interrupt storm treshold
ACPICA: Use spinlock for acpi_{en|dis}able_gpe
ACPI: EC: restart failed command
ACPI: EC: wait for last write gpe
ACPI: EC: make kernel messages more useful when GPE storm is detected
ACPI: EC: revert msleep patch
thinkpad_acpi: fingers off backlight if video.ko is serving this functionality
sony-laptop: fingers off backlight if video.ko is serving this functionality
msi-laptop: fingers off backlight if video.ko is serving this functionality
fujitsu-laptop: fingers off backlight if video.ko is serving this functionality
eeepc-laptop: fingers off backlight if video.ko is serving this functionality
compal: fingers off backlight if video.ko is serving this functionality
asus-acpi: fingers off backlight if video.ko is serving this functionality
Acer-WMI: fingers off backlight if video.ko is serving this functionality
ACPI video: if no ACPI backlight support, use vendor drivers
ACPI: video: Ignore devices that aren't present in hardware
Delete an unwanted return statement at evgpe.c
...
Impact: make nmi_shootdown_cpus() callable from preemptible context
We need to know on which CPU we are running on, and we don't want to be
preempted while doing this.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen nmi_shootdown_cpus() availability
The X86_LOCAL_APIC #ifdef was for kdump. For !SMP, the function simply
does nothing.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make nmi_shootdown_cpus() available to the rest of the x86 platform
Now nmi_shootdown_cpus() is ready to be used by non-kdump code also.
Move it to reboot.c.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make API available to the rest of x86 platform code
Add prototype to asm/reboot.h.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend nmi_shootdown_cpus() with a callback
The reboot code will use a different function on crash_nmi_callback().
Adding a function pointer parameter to nmi_shootdown_cpus() for that.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
For the kdump-specific code that was living on nmi_shootdown_cpus().
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This variable will be moved to non-kdump-specific code.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The NMI CPU-halting code will be used on non-kdump cases, also
(e.g. emergency_reboot when virtualization is enabled).
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bootup crash
the branch tracer missed arch/x86/vdso/vclock_gettime.c from
disabling tracing, which caused such bootup crashes:
[ 201.840097] init[1]: segfault at 7fffed3fe7c0 ip 00007fffed3fea2e sp 000077
also clean up the ugly ifdefs in arch/x86/kernel/vsyscall_64.c by
creating DISABLE_UNLIKELY_PROFILE facility for code to turn off
instrumentation on a per file basis.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new unlikely/likely profiler
Andrew Morton recently suggested having an in-kernel way to profile
likely and unlikely macros. This patch achieves that goal.
When configured, every(*) likely and unlikely macro gets a counter attached
to it. When the condition is hit, the hit and misses of that condition
are recorded. These numbers can later be retrieved by:
/debugfs/tracing/profile_likely - All likely markers
/debugfs/tracing/profile_unlikely - All unlikely markers.
# cat /debug/tracing/profile_unlikely | head
correct incorrect % Function File Line
------- --------- - -------- ---- ----
2167 0 0 do_arch_prctl process_64.c 832
0 0 0 do_arch_prctl process_64.c 804
2670 0 0 IS_ERR err.h 34
71230 5693 7 __switch_to process_64.c 673
76919 0 0 __switch_to process_64.c 639
43184 33743 43 __switch_to process_64.c 624
12740 64181 83 __switch_to process_64.c 594
12740 64174 83 __switch_to process_64.c 590
# cat /debug/tracing/profile_unlikely | \
awk '{ if ($3 > 25) print $0; }' |head -20
44963 35259 43 __switch_to process_64.c 624
12762 67454 84 __switch_to process_64.c 594
12762 67447 84 __switch_to process_64.c 590
1478 595 28 syscall_get_error syscall.h 51
0 2821 100 syscall_trace_leave ptrace.c 1567
0 1 100 native_smp_prepare_cpus smpboot.c 1237
86338 265881 75 calc_delta_fair sched_fair.c 408
210410 108540 34 calc_delta_mine sched.c 1267
0 54550 100 sched_info_queued sched_stats.h 222
51899 66435 56 pick_next_task_fair sched_fair.c 1422
6 10 62 yield_task_fair sched_fair.c 982
7325 2692 26 rt_policy sched.c 144
0 1270 100 pre_schedule_rt sched_rt.c 1261
1268 48073 97 pick_next_task_rt sched_rt.c 884
0 45181 100 sched_info_dequeued sched_stats.h 177
0 15 100 sched_move_task sched.c 8700
0 15 100 sched_move_task sched.c 8690
53167 33217 38 schedule sched.c 4457
0 80208 100 sched_info_switch sched_stats.h 270
30585 49631 61 context_switch sched.c 2619
# cat /debug/tracing/profile_likely | awk '{ if ($3 > 25) print $0; }'
39900 36577 47 pick_next_task sched.c 4397
20824 15233 42 switch_mm mmu_context_64.h 18
0 7 100 __cancel_work_timer workqueue.c 560
617 66484 99 clocksource_adjust timekeeping.c 456
0 346340 100 audit_syscall_exit auditsc.c 1570
38 347350 99 audit_get_context auditsc.c 732
0 345244 100 audit_syscall_entry auditsc.c 1541
38 1017 96 audit_free auditsc.c 1446
0 1090 100 audit_alloc auditsc.c 862
2618 1090 29 audit_alloc auditsc.c 858
0 6 100 move_masked_irq migration.c 9
1 198 99 probe_sched_wakeup trace_sched_switch.c 58
2 2 50 probe_wakeup trace_sched_wakeup.c 227
0 2 100 probe_wakeup_sched_switch trace_sched_wakeup.c 144
4514 2090 31 __grab_cache_page filemap.c 2149
12882 228786 94 mapping_unevictable pagemap.h 50
4 11 73 __flush_cpu_slab slub.c 1466
627757 330451 34 slab_free slub.c 1731
2959 61245 95 dentry_lru_del_init dcache.c 153
946 1217 56 load_elf_binary binfmt_elf.c 904
102 82 44 disk_put_part genhd.h 206
1 1 50 dst_gc_task dst.c 82
0 19 100 tcp_mss_split_point tcp_output.c 1126
As you can see by the above, there's a bit of work to do in rethinking
the use of some unlikelys and likelys. Note: the unlikely case had 71 hits
that were more than 25%.
Note: After submitting my first version of this patch, Andrew Morton
showed me a version written by Daniel Walker, where I picked up
the following ideas from:
1) Using __builtin_constant_p to avoid profiling fixed values.
2) Using __FILE__ instead of instruction pointers.
3) Using the preprocessor to stop all profiling of likely
annotations from vsyscall_64.c.
Thanks to Andrew Morton, Arjan van de Ven, Theodore Tso and Ingo Molnar
for their feed back on this patch.
(*) Not ever unlikely is recorded, those that are used by vsyscalls
(a few of them) had to have profiling disabled.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This removes the acpi_irq_balance_set() interface from the PCI
interrupt link driver.
x86 used acpi_irq_balance_set() to tell the PCI interrupt link
driver to configure links to minimize IRQ sharing. But the link
driver can easily figure out whether to turn on IRQ balancing
based on the IRQ model (PIC/IOAPIC/etc), so we can get rid of
that external interface.
It's better for the driver to figure this out at init-time. If
we set it externally via the x86 code, the interface reduces
modularity, and we depend on the fact that acpi_process_madt()
happens before we process the kernel command line.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Changes reboot behavior.
If port CF9 seems to be safe to touch, attempt it before trying the
keyboard controller. Port CF9 is not available on all chipsets (a
significant but decreasing number of modern chipsets don't implement
it), but port CF9 itself should in general be safe to poke (no ill
effects if unimplemented) on any system which has PCI Configuration
Method #1 or #2, as it falls inside the PCI configuration port range
in both cases. No chipset without PCI is known to have port CF9,
either, although an explicit "pci=bios" would mean we miss this and
therefore don't use port CF9. An explicit "reboot=pci" can be used to
force the use of port CF9.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Move the IRQ stub generation to assembly to simplify it and for
consistency with 32 bits. Doing it in a C file with asm() statements
doesn't help clarity, and it prevents some optimizations.
Shrink the IRQ stubs down to just over four bytes per (we fit seven
into a 32-byte chunk.) This shrinks the total icache consumption of
the IRQ stubs down to an even kilobyte, if all of them are in active
use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips.
To get the most effect, cache-align the IRQ stubs.
This makes the 64-bit code match changes already done to the 32-bit
code, and should open up irqinit*.c for unification.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Shrink the IRQ stubs on 32 bits down to just over four bytes per (we
fit seven into a 32-byte chunk.) This shrinks the total icache
consumption of the IRQ stubs down to an even kilobyte, if all of them
are in active use.
The downside is that we end up with a double jump, which could have a
negative effect on some pipelines. The double jump is always inside
the same cacheline on any modern chips (the exception being
486/Elan/Geode which have only 16-byte cachelines, but are unlikely to
have too many interrupt sources.)
To get the most effect, cache-align the IRQ stubs.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Don't generate interrupt stubs for interrupt vectors below
FIRST_EXTERNAL_VECTOR, and make the table of interrupt vectors
(interrupt[]) __initconst. Both of these changes both conserve memory
and improve consistency with 64 bits.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
WARNING: arch/x86/kernel/built-in.o(.text+0x1722c): Section mismatch
in reference from the function kvm_setup_secondary_clock() to the
function .devinit.text:setup_secondary_APIC_clock()
The function kvm_setup_secondary_clock() references
the function __devinit setup_secondary_APIC_clock().
This is often because kvm_setup_secondary_clock lacks a __devinit
annotation or the annotation of setup_secondary_APIC_clock is wrong.
Signed-off-by: Md.Rakib H. Mullick <rakib.mullick@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: avoid spurious lapic timer events on shutdown
The apic timer might be close to firing when it is shutdown. We can
not really disable the timer - we just mask the interrupt. That way we
can get an extra interrupt when it is reenabled. Set the counter to
max on shutdown to avoid this.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: really halt all CPUs on halt
Function machine_halt (resp. native_machine_halt) is empty for x86
architectures. When command 'halt -f' is invoked, the message "System
halted." is displayed but this is not really true because all CPUs are
still running.
There are also similar inconsistencies for other arches (some uses
power-off for halt or forever-loop with IRQs enabled/disabled).
IMO there should be used the same approach for all architectures OR
what does the message "System halted" really mean?
This patch fixes it for x86.
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/ftrace.c: In function 'ftrace_return_to_handler':
arch/x86/kernel/ftrace.c:112: error: implicit declaration of function 'cpu_clock'
cpu_clock() is implicitly included via a number of ways, but its real
location is sched.h. (Build failure is triggerable if enough other
kernel components are turned off.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix double entry creation in /proc
There is a collision between two UV functions:
both uv_ptc_init() and gru_proc_init() try to make /proc/sgi_uv
So move it's creation to a single place: uv_system_init()
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/ftrace.c: Assembler messages:
arch/x86/kernel/ftrace.c:140: Error: missing ')'
arch/x86/kernel/ftrace.c:140: Error: junk `(%ebp))' after expression
arch/x86/kernel/ftrace.c:141: Error: missing ')'
arch/x86/kernel/ftrace.c:141: Error: junk `(%ebp))' after expression
the [parent_replaced] is used in an =rm fashion, so that constraint
is correct in isolation - but [parent_old] aliases register %0 and uses
it in an addressing mode that is only valid with registers - so change
the constraint from =rm to =r.
This fixes the build failure.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add infrastructure for function-return tracing
Add low level support for ftrace return tracing.
This plug-in stores return addresses on the thread_info structure of
the current task.
The index of the current return address is initialized when the task
is the first one (init) and when a process forks (the child). It is
not needed when a task does a sys_execve because after this syscall,
it still needs to return on the kernel functions it called.
Note that the code of return_to_handler has been suggested by Steven
Rostedt as almost all of the ideas of improvements in this V3.
For purpose of security, arch/x86/kernel/process_32.c is not traced
because __switch_to() changes the current task during its execution.
That could cause inconsistency in the stored return address of this
function even if I didn't have any crash after testing with tracing on
this function enabled.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some functions that may be called from this handler require that
interrupts are disabled. Also, combining IRQF_DISABLED and
IRQF_SHARED does not reliably disable interrupts in a handler, so
remove IRQF_SHARED from the irq flags (this irq is not shared anyway).
Signed-off-by: Matt Fleming <mjf@gentoo.org>
Cc: mingo@elte.hu
Cc: venkatesh.pallipadi@intel.com
Cc: "Will Newton" <will.newton@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In hpet_next_event() we check that the value we just wrote to
HPET_Tn_CMP(timer) has reached the chip. Currently, we're checking that
the value we wrote to HPET_Tn_CMP(timer) is in HPET_T0_CMP, which, if
timer is anything other than timer 0, is likely to fail.
Signed-off-by: Matt Fleming <mjf@gentoo.org>
Cc: mingo@elte.hu
Cc: venkatesh.pallipadi@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
It is possible to flood the console with call traces if the WARN_ON
condition is true because of the frequency with which this function is
called.
Signed-off-by: Matt Fleming <mjf@gentoo.org>
Cc: mingo@elte.hu
Cc: venkatesh.pallipadi@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: cleanup
It saves us some source lines and shift the code a bit righter.
And a multiline comment style is fixed too :-)
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
lapic_timer_setup is self-protected with local_irq_save/restore
no need to use them in caller and levt is the per-cpu variable so
no concurrent access from another cpu.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: widen BTS/PEBS ptrace enablement to more CPU models
Move BTS initialisation out of an #ifdef CONFIG_X86_64 guard.
Assume core2 BTS and DS layout for future models of family 6 processors.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
setup_ioapic_dest() is called after the non boot cpus have been
brought up. It sets the irq affinity of all already configured
interrupts to all cpus and ignores affinity settings which were
done by the early bootup code.
If the IRQ_NO_BALANCING or IRQ_AFFINITY_SET flags are set then use the
affinity mask from the irq descriptor and not TARGET_CPUS.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Move rdtsc_barrier() use to vsyscall_64.c where it's relied on,
and point out its role in the context of its use.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched_clock() uses cycles_2_ns() needlessly - which is an irq-disabling
variant of __cycles_2_ns().
Most of the time sched_clock() is called with irqs disabled already.
The few places that call it with irqs enabled need to be updated.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit c7ffa6c262.
the assumptio of this change was that this would not break
any existing machine. Andrey Borzenkov reported troubles with
the ACPI reboot method: the system would hang on reboot, necessiating
a power cycle. Probably more systems are affected as well.
Also, there are patches queued up for v2.6.29 to disable virtualization
on emergency_restart() - which was the original motivation of
this change.
Reported-by: Andrey Borzenkov <arvidjaar@mail.ru>
Bisected-by: Andrey Borzenkov <arvidjaar@mail.ru>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Lazy flushing needs to take care of the unmap path too which is not yet
implemented and leads to stale IO/TLB entries. This is fixed by this
patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Impact: fix rare x2apic hang
On x86, x2apic mode accesses for sending IPI's don't have serializing
semantics. If the IPI receivner refers(in lock-free fashion) to some
memory setup by the sender, the need for smp_mb() before sending the
IPI becomes critical in x2apic mode.
Add the smp_mb() in native_flush_tlb_others() before sending the IPI.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
set fpstate field of signal context at setup_sigcontext().
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: quick start and stop of function tracer
This patch adds a way to disable the function tracer quickly without
the need to run kstop_machine. It adds a new variable called
function_trace_stop which will stop the calls to functions from mcount
when set. This is just an on/off switch and does not handle recursion
like preempt_disable().
It's main purpose is to help other tracers/debuggers start and stop tracing
fuctions without the need to call kstop_machine.
The config option HAVE_FUNCTION_TRACE_MCOUNT_TEST is added for archs
that implement the testing of the function_trace_stop in the mcount
arch dependent code. Otherwise, the test is done in the C code.
x86 is the only arch at the moment that supports this.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix boot hang on 32-bit systems with more than 224 IO-APIC pins
On some 32-bit systems with a lot of IO-APICs probe_nr_irqs() can
return a value larger than NR_IRQS. This will lead to probe_irq_on()
overrunning the irq_desc array.
I hit this when running net-next-2.6 (close to 2.6.28-rc3) on a
Supermicro dual Xeon system. NR_IRQS is 224 but probe_nr_irqs() detects
5 IOAPICs and returns 240. Here are the log messages:
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
Tue Nov 4 16:53:47 2008 IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x02] address[0xfec81000] gsi_base[24])
Tue Nov 4 16:53:47 2008 IOAPIC[1]: apic_id 2, version 32, address 0xfec81000, GSI 24-47
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x03] address[0xfec81400] gsi_base[48])
Tue Nov 4 16:53:47 2008 IOAPIC[2]: apic_id 3, version 32, address 0xfec81400, GSI 48-71
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x04] address[0xfec82000] gsi_base[72])
Tue Nov 4 16:53:47 2008 IOAPIC[3]: apic_id 4, version 32, address 0xfec82000, GSI 72-95
Tue Nov 4 16:53:47 2008 ACPI: IOAPIC (id[0x05] address[0xfec82400] gsi_base[96])
Tue Nov 4 16:53:47 2008 IOAPIC[4]: apic_id 5, version 32, address 0xfec82400, GSI 96-119
Tue Nov 4 16:53:47 2008 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
Tue Nov 4 16:53:47 2008 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Tue Nov 4 16:53:47 2008 Enabling APIC mode: Flat. Using 5 I/O APICs
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Should permit VMware detection on older platforms where the
vendor is changed. Could theoretically cause a regression if some
weird serial number scheme contains the string "VMware" by pure
chance. Seems unlikely, especially with the mixed case.
In some user configured cases, VMware may choose not to put a VMware specific
DMI string, but the product serial key is always there and is VMware specific.
Add a interface to check the serial key, when checking for VMware in the DMI
information.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: do not do function-tracing in the early-printk code
this is useful when earlyprintk=vga,keep is used to debug tracer
plugins.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix udelay when "notsc" boot parameter is passed
With notsc passed on commandline, tsc may not be used for
udelays, make sure that we do not use tsc_khz to calculate
the lpj value in such cases.
Reported-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix possible failure to calibrate the TSC on Vmware near 4 GHz
The current version of the code to get the tsc frequency from
the VMware hypervisor, will be broken on processor with frequency
(4G-1) HZ, because on such processors eax will have UINT_MAX
and that would be legitimate.
We instead check that EBX did change to decide if we were able to
read the frequency from the hypervisor.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timekeeping on Vmware (or with tsc=reliable).
This is achieved by resetting the CLOCKSOURCE_MUST_VERIFY flag.
We add a tsc=reliable commandline option to enable this.
This enables legacy hardware without HPET, LAPIC, or ACPI timers
to enter high-resolution timer mode.
Along with that have extended this to be used in virtualization environement
too. Now we also set this flag if the X86_FEATURE_TSC_RELIABLE bit is set.
This is important since there is a wrap-around problem with the acpi_pm timer.
The acpi_pm counter is just 24bits and this can overflow in ~4 seconds. With
the NO_HZ kernels in virtualized environment, there can be situations when
the guest is descheduled for longer duration, as a result we may miss the wrap
of the acpi counter. When TSC is used as a clocksource and acpi_pm timer is
being used as the watchdog clocksource this error in acpi_pm results in TSC
being marked as unstable, and essentially results in time dropping in chunks
of 4 seconds whenever this wrap is missed. Since the virtualized TSC is
reliable on VMware, we should always use the TSCs clocksource on VMware, so
we skip the verfication at runtime, by checking for the feature bit.
Since we reset the flag for mgeode systems too, i have combined
the mgeode case with the feature bit check.
Signed-off-by: Jeff Hansen <jhansen@cardaccess-inc.com>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
Use the synthetic TSC_RELIABLE bit to workaround virtualization anomalies.
Virtual TSCs can be kept nearly in sync, but because the virtual TSC
offset is set by software, it's not perfect. So, the TSC
synchronization test can fail. Even then the TSC can be used as a
clocksource since the VMware platform exports a reliable TSC to the
guest for timekeeping purposes. Use this bit to check if we need to
skip the TSC sync checks.
Along with this also set the CONSTANT_TSC bit when on VMware, since we
still want to use TSC as clocksource on VM running over hardware which
has unsynchronized TSC's (opteron's), since the hypervisor will take
care of providing consistent TSC to the guest.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Changes timebase calibration on Vmware.
v3->v2 : Abstract the hypervisor detection and feature (tsc_freq) request
behind a hypervisor.c file
v2->v1 : Add a x86_hyper_vendor field to the cpuinfo_x86 structure.
This avoids multiple calls to the hypervisor detection function.
This patch adds function to detect if we are running under VMware.
The current way to check if we are on VMware is following,
# check if "hypervisor present bit" is set, if so read the 0x40000000
cpuid leaf and check for "VMwareVMware" signature.
# if the above fails, check the DMI vendors name for "VMware" string
if we find one we query the VMware hypervisor port to check if we are
under VMware.
The DMI + "VMware hypervisor port check" is needed for older VMware products,
which don't implement the hypervisor signature cpuid leaf.
Also note that since we are checking for the DMI signature the hypervisor
port should never be accessed on native hardware.
This patch also adds a hypervisor_get_tsc_freq function, instead of
calibrating the frequency which can be error prone in virtualized
environment, we ask the hypervisor for it. We get the frequency from
the hypervisor by accessing the hypervisor port if we are running on VMware.
Other hypervisors too can add code to the generic routine to get frequency on
their platform.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Signed-off-by: Dan Hecht <dhecht@vmware.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix AMDC1E and XTOPOLOGY conflict in cpufeature
x86: build fix
This makes the late e820 resources use 'insert_resource_expand_to_fit()'
instead of doing a 'reserve_region_with_split()', and also avoids
marking them as IORESOURCE_BUSY.
This results in us being perfectly happy to use pre-existing PCI
resources even if they were marked as being in a reserved region, while
still avoiding any _new_ allocations in the reserved regions. It also
makes for a simpler and more accurate resource tree.
Example resource allocation from Jonathan Corbet, who has firmware that
has an e820 reserved entry that covered a big range (e0000000-fed003ff),
and that had various PCI resources in it set up by firmware.
With old kernels, the reserved range would force us to re-allocate all
pre-existing PCI resources, and his reserved range would end up looking
like this:
e0000000-fed003ff : reserved
fec00000-fec00fff : IOAPIC 0
fed00000-fed003ff : HPET 0
where only the pre-allocated special regions (IOAPIC and HPET) were kept
around.
With 2.6.28-rc2, which uses 'reserve_region_with_split()', Jonathan's
resource tree looked like this:
e0000000-fe7fffff : reserved
fe800000-fe8fffff : PCI Bus 0000:01
fe800000-fe8fffff : reserved
fe900000-fe9d9aff : reserved
fe9d9b00-fe9d9bff : 0000:00:1f.3
fe9d9b00-fe9d9bff : reserved
fe9d9c00-fe9d9fff : 0000:00:1a.7
fe9d9c00-fe9d9fff : reserved
fe9da000-fe9dafff : 0000:00:03.3
fe9da000-fe9dafff : reserved
fe9db000-fe9dbfff : 0000:00:19.0
fe9db000-fe9dbfff : reserved
fe9dc000-fe9dffff : 0000:00:1b.0
fe9dc000-fe9dffff : reserved
fe9e0000-fe9fffff : 0000:00:19.0
fe9e0000-fe9fffff : reserved
fea00000-fea7ffff : 0000:00:02.0
fea00000-fea7ffff : reserved
fea80000-feafffff : 0000:00:02.1
fea80000-feafffff : reserved
feb00000-febfffff : 0000:00:02.0
feb00000-febfffff : reserved
fec00000-fed003ff : reserved
fec00000-fec00fff : IOAPIC 0
fed00000-fed003ff : HPET 0
and because the reserved entry had been split and moved into the
individual resources, and because it used the IORESOURCE_BUSY flag, the
drivers that actually wanted to _use_ those resources couldn't actually
attach to them:
e1000e 0000:00:19.0: BAR 0: can't reserve mem region [0xfe9e0000-0xfe9fffff]
HDA Intel 0000:00:1b.0: BAR 0: can't reserve mem region [0xfe9dc000-0xfe9dffff]
with this patch, the resource tree instead becomes
e0000000-fed003ff : reserved
fe800000-fe8fffff : PCI Bus 0000:01
fe9d9b00-fe9d9bff : 0000:00:1f.3
fe9d9c00-fe9d9fff : 0000:00:1a.7
fe9d9c00-fe9d9fff : ehci_hcd
fe9da000-fe9dafff : 0000:00:03.3
fe9db000-fe9dbfff : 0000:00:19.0
fe9db000-fe9dbfff : e1000e
fe9dc000-fe9dffff : 0000:00:1b.0
fe9dc000-fe9dffff : ICH HD audio
fe9e0000-fe9fffff : 0000:00:19.0
fe9e0000-fe9fffff : e1000e
fea00000-fea7ffff : 0000:00:02.0
fea80000-feafffff : 0000:00:02.1
feb00000-febfffff : 0000:00:02.0
fec00000-fec00fff : IOAPIC 0
fed00000-fed003ff : HPET 0
ie the one reserved region now ends up surrounding all the PCI resources
that were allocated inside of it by firmware, and because it is not
marked BUSY, drivers have no problem attaching to the pre-allocated
resources.
Reported-and-tested-by: Jonathan Corbet <corbet@lwn.net>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robert Hancock <hancockr@shaw.ca>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
This patch cleans up the NMI safe code for dynamic ftrace as suggested
by Andrew Morton.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the kexec bootstrap code implementation from assembly to C
This patch transforms the kexec page tables setup code from assembler
code to C code in machine_kexec_prepare. This improves readability and
reduces code line number.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save .text size when kexec is built in but not loaded
This patch adds an architecture specific struct kimage_arch into
struct kimage. The pointers to page table pages used by kexec are
added to struct kimage_arch. The page tables pages are dynamically
allocated in machine_kexec_prepare instead of statically from BSS
segment. This will save up to 20k memory when kexec image is not
loaded.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: save kernel .text by loosening kexec page alignment
This patch removes PAGE_SIZE alignment from relocate_kernel(). Before
kexec jump patches are merged, control page is mapped to
relocate_kernel in kexec page tables, so relocate_kernel must be
PAGE_SIZE aligned. Now, control page is mapped to identity mapped
address, so relocate_kernel need not to be PAGE_SIZE aligned any
more. This can reduce a few KB from kernel text segement.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on certain UP configs
fix:
arch/x86/kernel/cpu/common.c: In function 'cpu_init':
arch/x86/kernel/cpu/common.c:1141: error: 'boot_cpu_id' undeclared (first use in this function)
arch/x86/kernel/cpu/common.c:1141: error: (Each undeclared identifier is reported only once
arch/x86/kernel/cpu/common.c:1141: error: for each function it appears in.)
Pull in asm/smp.h on UP, so that we get the definition of
boot_cpu_id.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix:
arch/x86/kernel/cpu/common.c: In function 'early_identify_cpu':
arch/x86/kernel/cpu/common.c:553: error: 'struct cpuinfo_x86' has no member named 'cpu_index'
as cpu_index is only available on SMP.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix /proc/cpuinfo output on x86/Voyager
Ever since
| commit 92cb7612ae
| Author: Mike Travis <travis@sgi.com>
| Date: Fri Oct 19 20:35:04 2007 +0200
|
| x86: convert cpuinfo_x86 array to a per_cpu array
We've had an extra field in cpuinfo_x86 which is cpu_index.
Unfortunately, voyager has never initialised this, although the only
noticeable impact seems to be that /proc/cpuinfo shows all zeros for
the processor ids.
Anyway, fix this by initialising the boot CPU properly and setting the
index when the secondaries update.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix on x86/Voyager
Given commits like this:
| Author: Suresh Siddha <suresh.b.siddha@intel.com>
| Date: Tue Jul 29 10:29:19 2008 -0700
|
| x86, xsave: enable xsave/xrstor on cpus with xsave support
Which deliberately expose boot cpu dependence to pieces of the system,
I think it's time to explicitly have a variable for it to prevent this
continual misassumption that the boot CPU is zero.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix x86/Voyager boot
CONFIG_SMP is used for features which work on *all* x86 boxes.
CONFIG_X86_SMP is used for standard PC like x86 boxes (for things like
multi core and apics)
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add more debug info to /debugfs/tracing/dyn_ftrace_total_info
This patch adds dynamic ftrace NMI update statistics to the
/debugfs/tracing/dyn_ftrace_total_info stat file.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crashes that can occur in NMI handlers, if their code is modified
Modifying code is something that needs special care. On SMP boxes,
if code that is being modified is also being executed on another CPU,
that CPU will have undefined results.
The dynamic ftrace uses kstop_machine to make the system act like a
uniprocessor system. But this does not address NMIs, that can still
run on other CPUs.
One approach to handle this is to make all code that are used by NMIs
not be traced. But NMIs can call notifiers that spread throughout the
kernel and this will be very hard to maintain, and the chance of missing
a function is very high.
The approach that this patch takes is to have the NMIs modify the code
if the modification is taking place. The way this works is that just
writing to code executing on another CPU is not harmful if what is
written is the same as what exists.
Two buffers are used: an IP buffer and a "code" buffer.
The steps that the patcher takes are:
1) Put in the instruction pointer into the IP buffer
and the new code into the "code" buffer.
2) Set a flag that says we are modifying code
3) Wait for any running NMIs to finish.
4) Write the code
5) clear the flag.
6) Wait for any running NMIs to finish.
If an NMI is executed, it will also write the pending code.
Multiple writes are OK, because what is being written is the same.
Then the patcher must wait for all running NMIs to finish before
going to the next line that must be patched.
This is basically the RCU approach to code modification.
Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven
for his guidence on what is safe and what is not.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce nmi_watchdog=lapic and nmi_watchdog=ioapic aliases
Add sensible names as "lapic" and "ioapic" to
nmi_watchdog boot parameter. Sometimes it is not
that easy to recall what exactly nmi_watchdog=1
does mean so we allow the using of symbolic names here.
Old numeric values remain valid.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
ftrace: fix current_tracer error return
tracing: fix a build error on alpha
ftrace: use a real variable for ftrace_nop in x86
tracing/ftrace: make boot tracer select the sched_switch tracer
tracepoint: check if the probe has been registered
asm-generic: define DIE_OOPS in asm-generic
trace: fix printk warning for u64
ftrace: warning in kernel/trace/ftrace.c
ftrace: fix build failure
ftrace, powerpc, sparc64, x86: remove notrace from arch ftrace file
ftrace: remove ftrace hash
ftrace: remove mcount set
ftrace: remove daemon
ftrace: disable dynamic ftrace for all archs that use daemon
ftrace: add ftrace warn on to disable ftrace
ftrace: only have ftrace_kill atomic
ftrace: use probe_kernel
ftrace: comment arch ftrace code
ftrace: return error on failed modified text.
ftrace: dynamic ftrace process only text section
...
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: fix irqs on/off ip tracing
lockdep: minor fix for debug_show_all_locks()
x86: restore the old swiotlb alloc_coherent behavior
x86: use GFP_DMA for 24bit coherent_dma_mask
swiotlb: remove panic for alloc_coherent failure
xen: compilation fix of drivers/xen/events.c on IA64
xen: portability clean up and some minor clean up for xencomm.c
xen: don't reload cr3 on suspend
kernel/resource: fix reserve_region_with_split() section mismatch
printk: remove unused code from kernel/printk.c
Impact: fix AMD Family 11h boot hangs / USB device problems
The AMD Fam11h CPUs have a K8 northbridge. This northbridge is different
from other family's because it lacks GART support (as I just learned).
But the kernel implicitly expects a GART if it finds an AMD northbridge.
Fix this by removing the Fam11h northbridge id from the scan list of K8
northbridges. This patch also changes the message in the GART driver
about missing K8 northbridges to tell that the GART is missing which is
the correct information in this case.
Reported-by: Jouni Malinen <jkmalinen@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix kdump crash on 32-bit sparsemem kernels
Since linux-2.6.27, kdump has failed on i386 sparsemem kernel.
1st-kernel gets a panic just before switching to 2nd-kernel.
The cause is that a kernel accesses invalid mem_section by
page_to_pfn(image->swap_page) at machine_kexec().
image->swap_page is allocated if kexec for hibernation, but
it is not allocated if kdump. So if kdump, a kernel should
not access the mem_section corresponding to image->swap_page.
The attached patch fixes this invalid access.
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Cc: kexec-ml <kexec@lists.infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo said mtrr_cleanup() is big and ugly.
so break it up into more functions and make it more readable.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Change UV heartbeat function to use idle_cpu to determine cpu's
"idleness". Realign uv_hub definitions.
Signed-of-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
As promised, now that dumpstack_32 and dumpstack_64 have so many bits
in common, we should merge the in-sync bits into a common file, to
prevent them from diverging again.
This patch removes bits which are common between dumpstack_32.c and
dumpstack_64.c and places them in a common dumpstack.c which is built
for both 32 and 64 bit arches.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/Makefile | 2
arch/x86/kernel/dumpstack.c | 319 +++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/dumpstack.h | 39 +++++
arch/x86/kernel/dumpstack_32.c | 294 -------------------------------------
arch/x86/kernel/dumpstack_64.c | 285 ------------------------------------
5 files changed, 363 insertions(+), 576 deletions(-)
Impact: change NMI watchdog detection and disabling sequence
Currently, if the NMI watchdog fails using IOAPIC method, it'll only disable
interrupts on 8259 if the timer is passing thru it. This patch disables
NMI delivery on LINT0 if the NMI watchdog initial test fails, just for safety.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change/improve the way /proc/sys/kernel/nmi_watchdog works
This patch adds support to enable/disable IOAPIC NMI watchdog in runtime via
procfs.
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
now that the code is moved and converted to a work queue,
there's some minor cleanups that can be done.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: change the implementation of the debug feature
the periodic corruption checks are better off run from a work queue; there's
nothing time critical about them and this way the amount of
interrupt-context work is reduced.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The corruption check code is rather sizable and it's likely to grow over
time when we add checks for more types of corruptions (there's a few
candidates in kerneloops.org that I want to add checks for)... so lets move
it to its own file
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Before moving the code to it's own file, fix some style issues
in the corruption check code.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: avoid section mismatch warning, clean up
The dynamic ftrace determines which nop is safe to use at start up.
When it finds a safe nop for patching, it sets a pointer called ftrace_nop
to point to the code. All call sites are then patched to this nop.
Later, when tracing is turned on, this ftrace_nop variable is again used
to compare the location to make sure it is a nop before we update it to
an mcount call. If this fails just once, a warning is printed and ftrace
is disabled.
Rakib Mullick noted that the code that sets up the nop is a .init section
where as the nop itself is in the .text section. This is needed because
the nop is used later on after boot up. The problem is that the test of the
nop jumps back to the setup code and causes a "section mismatch" warning.
Rakib first recommended to convert the nop to .init.text, but as stated
above, this would fail since that text is used later.
The real solution is to extend Rabik's patch, and to make the ftrace_nop
into an array, and just save the code from the assembly to this array.
Now the section can stay as an init section, and we have a nop to use
later on.
Reported-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: on SGI UV platforms, fix boot crash
UV initialization is currently called too late to call alloc_bootmem_pages().
The current sequence is:
start_kernel()
mem_init()
free_all_bootmem() <--- discard of bootmem
rest_init()
kernel_init()
smp_prepare_cpus()
native_smp_prepare_cpus()
uv_system_init() <--- uses alloc_bootmem_pages()
It should be calling kmalloc().
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: start per CPU heartbeat LED timers on SGI UV systems
The SGI UV system has no LEDS but uses one of the system controller
regs to indicate the online internal state of the cpu. There is a
heartbeat bit indicating that the cpu is responding to interrupts,
and an idle bit indicating whether the cpu is idle when the heartbeat
interrupt occurs. The current period is one second.
When a cpu panics, an error code is written by BIOS to this same reg.
This patchset provides the following:
* x86_64: Add base functionality for writing to the specific SCIR's
for each cpu.
* heartbeat: Invert "heartbeat" bit to indicate the cpu is
"interruptible". If the current thread is the idle thread,
then indicate system is "idle".
* if hotplug enabled, all bits are set (0xff) when the cpu is disabled.
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Thu, Oct 23, 2008 at 04:09:52PM -0700, Alexander Beregalov wrote:
> arch/x86/kernel/built-in.o: In function `iommu_setup':
> pci-dma.c:(.init.text+0x36ad): undefined reference to `forbid_dac'
> pci-dma.c:(.init.text+0x36cc): undefined reference to `forbid_dac'
> pci-dma.c:(.init.text+0x3711): undefined reference to `forbid_dac
This patch partially reverts a patch to add IOMMU support to ia64. The
forbid_dac variable was incorrectly moved to quirks.c, which isn't built
when PCI is disabled.
Tested-by: "Alexander Beregalov" <a.beregalov@gmail.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This restores the old swiotlb alloc_coherent behavior (before the
alloc_coherent rewrite):
http://lkml.org/lkml/2008/8/12/200
The old alloc_coherent avoids GFP_DMA allocation first and if the
allocated address is not fit for the device's coherent_dma_mask, then
dma_alloc_coherent does GFP_DMA allocation. If it fails,
alloc_coherent calls swiotlb_alloc_coherent (in short, we rarely used
swiotlb_alloc_coherent).
After the alloc_coherent rewrite, dma_alloc_coherent
(include/asm-x86/dma-mapping.h) directly calls swiotlb_alloc_coherent.
It means that we possibly can't handle a device having dma_masks >
24bit < 32bits since swiotlb_alloc_coherent doesn't have the above
GFP_DMA retry mechanism.
This patch fixes x86's swiotlb alloc_coherent to use the GFP_DMA retry
mechanism, which dma_generic_alloc_coherent() provides now
(pci-nommu.c and GART IOMMU driver also use
dma_generic_alloc_coherent).
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix section mismatch warning - apic_x2apic_phys
x86: fix section mismatch warning - apic_x2apic_cluster
x86: fix section mismatch warning - apic_x2apic_uv_x
x86: fix section mismatch warning - apic_physflat
x86: fix section mismatch warning - apic_flat
x86: memtest fix use of reserve_early()
x86 syscall.h: fix argument order
x86/tlb_uv: remove strange mc146818rtc include
x86: remove redundant KERN_DEBUG on pr_debug
x86: do_boot_cpu - check if we have ESR register
x86: MAINTAINERS change for AMD microcode patch loader
x86/proc: fix /proc/cpuinfo cpu offline bug
x86: call dmi-quirks for HP Laptops after early-quirks are executed
x86, kexec: fix hang on i386 when panic occurs while console_sem is held
MCE: Don't run 32bit machine checks with interrupts on
x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: make variables static
* 'x86/um-header' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits)
x86: canonicalize remaining header guards
x86: drop double underscores from header guards
x86: Fix ASM_X86__ header guards
x86, um: get rid of uml-config.h
x86, um: get rid of arch/um/Kconfig.arch
x86, um: get rid of arch/um/os symlink
x86, um: get rid of excessive includes of uml-config.h
x86, um: get rid of header symlinks
x86, um: merge Kconfig.i386 and Kconfig.x86_64
x86, um: get rid of sysdep symlink
x86, um: trim the junk from uml ptrace-*.h
x86, um: take vm-flags.h to sysdep
x86, um: get rid of uml asm/arch
x86, um: get rid of uml highmem.h
x86, um: get rid of uml unistd.h
x86, um: get rid of system.h -> system.h include
x86, um: uml atomic.h is not needed anymore
x86, um: untangle uml ldt.h
x86, um: get rid of more uml asm/arch uses
x86, um: remove dead header (uml module-generic.h; never used these days)
...
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (123 commits)
dock: make dock driver not a module
ACPI: fix ia64 build warning
ACPI: hack around sysfs warning with link order
ACPI suspend: fix build warning when CONFIG_ACPI_SLEEP=n
intel_menlo: fix build warning
panasonic-laptop: fix build
ACPICA: Update version to 20080926
ACPICA: Add support for zero-length buffer-to-string conversions
ACPICA: New: Validation for predefined ACPI methods/objects
ACPICA: Fix for implicit return compatibility
ACPICA: Fixed a couple memory leaks associated with "implicit return"
ACPICA: Optimize buffer allocation procedure
ACPICA: Fix possible memory leak, error exit path
ACPICA: Fix fault after mem allocation failure in AML parser
ACPICA: Remove unused ACPI register bit definition
ACPICA: Update version to 20080829
ACPICA: Fix possible memory leak in acpi_ns_get_external_pathname
ACPICA: Cleanup for internal Reference Object
ACPICA: Update comments - no functional changes
ACPICA: Update for Reference ACPI_OPERAND_OBJECT
...
* git://git.infradead.org/iommu-2.6:
Admit to maintaining VT-d, for my sins.
dmar: fix uninitialised 'ret' variable in dmar_parse_dev()
intel-iommu: use coherent_dma_mask in alloc_coherent
amd_iommu: fix nasty bug that caused ILLEGAL_DEVICE_TABLE_ENTRY errors
intel-iommu: IA64 support
dmar: remove the quirk which disables dma-remapping when intr-remapping enabled
dmar: Use queued invalidation interface for IOTLB and context invalidation
dmar: context cache and IOTLB invalidation using queued invalidation
dmar: use spin_lock_irqsave() in qi_submit_sync()
The entire file of ftrace.c in the arch code needs to be marked
as notrace. It is much cleaner to do this from the Makefile with
CFLAGS_REMOVE_ftrace.o.
[ powerpc already had this in its Makefile. ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The arch dependent function ftrace_mcount_set was only used by the daemon
start up code. This patch removes it.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Andrew Morton suggested using the proper API for reading and writing
kernel areas that might fault.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add comments to explain what is happening in the x86 arch ftrace code.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Have the ftrace_modify_code return error values:
-EFAULT on error of reading the address
-EINVAL if what is read does not match what it expected
-EPERM if the write fails to update after a successful match.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change header guards named "ASM_X86__*" to "_ASM_X86_*" since:
a. the double underscore is ugly and pointless.
b. no leading underscore violates namespace constraints.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup only, no functionality changed
WARNING: vmlinux.o(.data+0xc008): Section mismatch in reference from the variable apic_x2apic_phys to the function .init.text:x2apic_acpi_madt_oem_check()
The variable apic_x2apic_phys references
the function __init x2apic_acpi_madt_oem_check()
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>