Use knowledge about EFLAGS layout (bits 22:63 are always 0) to distingush
EFLAGS word and kernel address in the spin lock stack frame.
Signed-off-by: Andi Kleen <ak@suse.de>
A few trivial spelling and grammar mistakes picked up in
"arch/x86_64/aperture.c", "arch/x86_64/crash.c" and
"arch/x86_64/apic.c". I think all are correct fixes but am ever aware
of my fallibility :o) This is my first patch submission so all
feedback is appreciated, esp. WRT CCing to Linus, Andi and
trivial@kernel.org, is this correct? And which is the most appropriate
kernel version to diff against? If any.
Should apply cleanly to 2.6.18-rc1
Signed-off-by: Adam Henley <adamazing@gmail.com>
Signed-off-by: Andi Kleen <ak@suse.de>
- adam
Hello,
Following my discussion with Andi. Here is a patch that introduces
two new TIF flags to simplify the context switch code in __switch_to().
The idea is to minimize the number of cache lines accessed in the common
case, i.e., when neither the debug registers nor the I/O bitmap are used.
This patch covers the x86-64 modifications. A patch for i386 follows.
Changelog:
- add TIF_DEBUG to track when debug registers are active
- add TIF_IO_BITMAP to track when I/O bitmap is used
- modify __switch_to() to use the new TIF flags
<signed-off-by>: eranian@hpl.hp.com
Signed-off-by: Andi Kleen <ak@suse.de>
For NUMA optimization and some other algorithms it is useful to have a fast
to get the current CPU and node numbers in user space.
x86-64 added a fast way to do this in a vsyscall. This adds a generic
syscall for other architectures to make it a generic portable facility.
I expect some of them will also implement it as a faster vsyscall.
The cache is an optimization for the x86-64 vsyscall optimization. Since
what the syscall returns is an approximation anyways and user space
often wants very fast results it can be cached for some time. The norma
methods to get this information in user space are relatively slow
The vsyscall is in a better position to manage the cache because it has direct
access to a fast time stamp (jiffies). For the generic syscall optimization
it doesn't help much, but enforce a valid argument to keep programs
portable
I only added an i386 syscall entry for now. Other architectures can follow
as needed.
AK: Also added some cleanups from Andrew Morton
Signed-off-by: Andi Kleen <ak@suse.de>
This patch adds a vgetcpu vsyscall, which depending on the CPU RDTSCP
capability uses either the RDTSCP or CPUID to obtain a CPU and node
numbers and pass them to the program.
AK: Lots of changes over Vojtech's original code:
Better prototype for vgetcpu()
It's better to pass the cpu / node numbers as separate arguments
to avoid mistakes when going from SMP to NUMA.
Also add a fast time stamp based cache using a user supplied
argument to speed things more up.
Use fast method from Chuck Ebbert to retrieve node/cpu from
GDT limit instead of CPUID
Made sure RDTSCP init is always executed after node is known.
Drop printk
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch adds initalization of the RDTSCP auxilliary values to CPU numbers
to time.c. If RDTSCP is available, the MSRs are written with the respective
values. It can be later used to initalize per-cpu timekeeping variables.
AK: Some cleanups. Move externs into headers and fix CPU hotplug.
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
AK: This redoes the changes I temporarily reverted.
Intel now has support for Architectural Performance Monitoring Counters
( Refer to IA-32 Intel Architecture Software Developer's Manual
http://www.intel.com/design/pentium4/manuals/253669.htm ). This
feature is present starting from Intel Core Duo and Intel Core Solo processors.
What this means is, the performance monitoring counters and some performance
monitoring events are now defined in an architectural way (using cpuid).
And there will be no need to check for family/model etc for these architectural
events.
Below is the patch to use this performance counters in nmi watchdog driver.
Patch handles both i386 and x86-64 kernels.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
After a crash we should wait for NMI IPI event and not for external NMI or
NMI watchdog tick.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
When a unknown NMI happened the panic would claim a NMI watchdog timeout.
Also it would check the variable set by nmi_watchdog=panic and panic then.
Fix up the panic message to be generic
Unconditionally panic on unknown NMI when panic on unknown nmi is enabled.
Noticed by Jan Beulich
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Making NMI suspend/resume work with SMP. We use CPU hotplug to offline
APs in SMP suspend/resume. Only BSP executes sysdev's .suspend/.resume
method. APs should follow CPU hotplug code path.
And:
+From: Don Zickus <dzickus@redhat.com>
Makes the start/stop paths of nmi watchdog more robust to handle the
suspend/resume cases more gracefully.
AK: I merged the two patches together
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Clean up some of the output messages on the nmi error paths to make more
sense when they are displayed. This is mainly a cosmetic fix and
shouldn't impact any normal code path.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
To quote Alan Cox:
The default Linux behaviour on an NMI of either memory or unknown is to
continue operation. For many environments such as scientific computing
it is preferable that the box is taken out and the error dealt with than
an uncorrected parity/ECC error get propogated.
A small number of systems do generate NMI's for bizarre random reasons
such as power management so the default is unchanged. In other respects
the new proc/sys entry works like the existing panic controls already in
that directory.
This is separate to the edac support - EDAC allows supported chipsets to
handle ECC errors well, this change allows unsupported cases to at least
panic rather than cause problems further down the line.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Adds a new /proc/sys/kernel/nmi_watchdog call that will enable/disable the
nmi watchdog.
By entering a non-zero value here, a user can enable the nmi watchdog to
monitor the online cpus in the system. By entering a zero value here, a
user can disable the nmi watchdog and free up a performance counter which
could then be utilized by the oprofile subsystem, otherwise oprofile may be
short a counter when in use.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Adds a new /proc/sys/kernel/nmi call that will enable/disable the nmi
watchdog.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Removes the un/set_nmi_callback and reserve/release_lapic_nmi functions as
they are no longer needed. The various subsystems are modified to register
with the die_notifier instead.
Also includes compile fixes by Andrew Morton.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
We need TIF_RESTORE_SIGMASK in order to support ppoll() and pselect()
system calls. This patch originally came from Andi, and was based
heavily on David Howells' implementation of same on i386. I fixed a typo
which was causing do_signal() to use the wrong signal mask.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch cleans up the NMI interrupt path. Instead of being gated by if
the 'nmi callback' is set, the interrupt handler now calls everyone who is
registered on the die_chain and additionally checks the nmi watchdog,
reseting it if enabled. This allows more subsystems to hook into the NMI if
they need to (without being block by set_nmi_callback).
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch includes the changes to make the nmi watchdog on x86_64 SMP
aware. A bunch of code was moved around to make it simpler to read. In
addition, it is now possible to determine if a particular NMI was the result
of the watchdog or not. This feature allows the kernel to filter out
unknown NMIs easier.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Adds basic infrastructure to allow subsystems to reserve performance
counters on the x86 chips. Only UP kernels are supported in this patch to
make reviewing easier. The SMP portion makes a lot more changes.
Think of this as a locking mechanism where each bit represents a different
counter. In addition, each subsystem should also reserve an appropriate
event selection register that will correspond to the performance counter it
will be using (this is mainly neccessary for the Pentium 4 chips as they
break the 1:1 relationship to performance counters).
This will help prevent subsystems like oprofile from interfering with the
nmi watchdog.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Now that the tfm is passed directly to setkey instead of the ctx, we no
longer need to pass the &tfm->crt_flags pointer.
This patch also gets rid of a few unnecessary checks on the key length
for ciphers as the cipher layer guarantees that the key length is within
the bounds specified by the algorithm.
Rather than testing dia_setkey every time, this patch does it only once
during crypto_alloc_tfm. The redundant check from crypto_digest_setkey
is also removed.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This reverts commits 11012d419c and
40dd2d20f2, which allowed us to use the
MMIO accesses for PCI config cycles even without the area being marked
reserved in the e820 memory tables.
Those changes were needed for EFI-environment Intel macs, but broke some
newer Intel 965 boards, so for now it's better to revert to our old
2.6.17 behaviour and at least avoid introducing any new breakage.
Andi Kleen has a set of patches that work with both EFI and the broken
Intel 965 boards, which will be applied once they get wider testing.
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Edgar Hucek <hostmaster@ed-soft.at>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Take default arch/*/kernel/audit.c to lib/, have those with special
needs (== biarch) define AUDIT_ARCH in their Kconfig.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's possible to get an invalid page fault in kernel mode when we try to
write out segments from vsyscall32 when dumping core for a 32bit process if
the vsyscall32 DSO is not mapped in its address space (which can happen if,
for example, ulimit -v 100 is run).
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The values in init_tss.ist[] can change when an IST event occurs. Save
the original IST values for checking stack addresses when debugging or
doing stack traces.
Signed-off-by: Keith Owens <kaos@ocs.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As a replacement for the earlier removal of the e820 MCFG check
we blacklist the Intel SDV with the original BIOS bug that
motivated that check. On those machines don't use MMCONFIG.
This also adds a new pci=mmconf parameter to override the blacklist.
Cc: Greg KH <gregkh@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Noticed by Jan Beulich.
When the kernel was moved from 1MB to 2MB in 2.6.17 the kernel reservation
code wasn't adjusted and it still reserved starting with 1MB. This means 1MB always
were lost.
This patch fixes this by reserving only starting with _text.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The unwinder fallback logic still had potential for falling through to
the legacy stack trace code without printing an indication (at once
serving as a separator) of this.
Further, the stack pointer retrieval for the fallback should be as
restrictive as possible (in order to avoid having the legacy stack
tracer try to access invalid memory). The patch tightens that, but
this could certainly be further improved.
Also making the call_trace command line option now conditional upon
CONFIG_STACK_UNWIND (as it's meaningless otherwise).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One open question: Should these added pushes perhaps be made
conditional upon CONFIG_STACK_UNWIND or CONFIG_UNWIND_INFO?
[AK: Not needed -- these are all very slow paths]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The check for the MCFG table being reserved in the e820 map was originally
added to detect a broken BIOS in a preproduction Intel SDV. However it also
breaks the Apple x86 Macs, which can't supply this properly, but need
a working MCFG. With this patch they wouldn't use the MCFG and not work.
After some discussion I think it's best to remove the heuristic again.
It also failed on some other boxes (although it didn't cause much
problems there because old style port access for PCI config space
still works as fallback), but the preproduction SDVs can just use
pci=nommcfg. Supporting production machines properly is more
important.
Edgar Hucek did all the debugging work.
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Edgar Hucek <hostmaster@ed-soft.at>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously the message was "Fatal exception: panic_on_oops", as introduced
in a recent patch whith removed a somewhat dangerous call to ssleep() in
the panic_on_oops path. However, Paul Mackerras suggested that this was
somewhat confusing, leadind people to believe that it was panic_on_oops
that was the root cause of the fatal exception. On his suggestion, this
patch changes the message to simply "Fatal exception". A suitable oops
message should already have been displayed.
Signed-off-by: Simon Horman <horms@verge.net.au>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If CONFIG_IOMMU_DEBUG is set force_iommu defaults to 1. In the case
where no HW IOMMU is present in the machine and we end up using nommu,
leaving force_iommu set to 1 causes dma_alloc_coherent to do the wrong
thing. Therefore, if we end up using nommu, make sure force_iommu is
0.
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Re-add backlink for old style unwinder to stack switching. Add proper
stack frame and CFI annotations to call_softirq
This prevents a oops when backtracing with fallback through the
interrupt stack top.
Suggested by Jan Beulich and Herbert Xu wanted it in 2.6.18.
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The latest toolchains can produce a new ELF section in DSOs and
dynamically-linked executables. The new section ".gnu.hash" replaces
".hash", and allows for more efficient runtime symbol lookups by the
dynamic linker. The new ld option --hash-style={sysv|gnu|both} controls
whether to produce the old ".hash", the new ".gnu.hash", or both. In some
new systems such as Fedora Core 6, gcc by default passes --hash-style=gnu
to the linker, so that a standard invocation of "gcc -shared" results in
producing a DSO with only ".gnu.hash". The new ".gnu.hash" sections need
to be dealt with the same way as ".hash" sections in all respects; only the
dynamic linker cares about their contents. To work with older dynamic
linkers (i.e. preexisting releases of glibc), a binary must have the old
".hash" section. The --hash-style=both option produces binaries that a new
dynamic linker can use more efficiently, but an old dynamic linker can
still handle.
The new section runs afoul of the custom linker scripts used to build vDSO
images for the kernel. On ia64, the failure mode for this is a boot-time
panic because the vDSO's PT_IA_64_UNWIND segment winds up ill-formed.
This patch addresses the problem in two ways.
First, it mentions ".gnu.hash" in all the linker scripts alongside ".hash".
This produces correct vDSO images with --hash-style=sysv (or old tools),
with --hash-style=gnu, or with --hash-style=both.
Second, it passes the --hash-style=sysv option when building the vDSO
images, so that ".gnu.hash" is not actually produced. This is the most
conservative choice for compatibility with any old userland. There is some
concern that some ancient glibc builds (though not any known old production
system) might choke on --hash-style=both binaries. The optimizations
provided by the new style of hash section do not really matter for a DSO
with a tiny number of symbols, as the vDSO has. If someone wants to use
=gnu or =both for their vDSO builds and worry less about that
compatibility, just change the option and the linker script changes will
make any choice work fine.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Use hotplug version of register_cpu_notifier in late init functions.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch is part of an effort to unify the panic_on_oops behaviour across
all architectures that implement it.
It was pointed out to me by Andi Kleen that if an oops has occured in
interrupt context, then calling sleep() in the oops path will only cause a
panic, and that it would be really better for it not to be in the path at
all.
This patch removes the ssleep() call and reworks the console message
accordinly. I have a slght concern that the resulting console message is
too long, feedback welcome.
For powerpc it also unifies the 32bit and 64bit behaviour.
Fror x86_64, this patch only updates the console message, as ssleep() is
already not present.
Signed-off-by: Horms <horms@verge.net.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Zankel <chris@zankel.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
One of my original comments in machine_kexec was unclear
and this should fix it.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Horms <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was broken before. But having it is important as possible hardware
bug workaround.
And previously there was no way to force swiotlb if there is another IOMMU.
Side effect is that iommu=force won't force swiotlb anymore even if there
isn't another IOMMU.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As Travis Betak points out it accesses the wrong northbridge subfunction
now. Switch back to the old code.
Cc: "Travis Betak" <betak@mpdtxmail.amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Calgary hits a NULL pointer dereference when booting in a multi-chassis
NUMA system. See Redhat bugzilla number 198498, found by Konrad
Rzeszutek (konradr@redhat.com).
There are many issues that had to be resolved to fix this problem.
Firstly when I originally wrote the code to handle NUMA systems, I
had a large misunderstanding that was not corrected until now. That was
that I thought the "number of nodes online" referred to number of
physical systems connected. So that if NUMA was disabled, there
would only be 1 node and it would only show that node's PCI bus.
In reality if NUMA is disabled, the system displays all of the
connected chassis as one node but is only ignorant of the delays
in accessing main memory. Therefore, references to num_online_nodes()
and MAX_NUMNODES are incorrect and need to be set to the maximum
number of nodes that can be accessed (which are 8). I created a
variable, MAX_NUM_CHASSIS, and set it to 8 to fix this.
Secondly, when walking the PCI in detect_calgary, the code only
checked the first "slot" when looking to see if a device is present.
This will work for most cases, but unfortunately it isn't always the
case. In the NUMA MXE drawers, there are USB devices present on the
3rd slot (with slot 1 being empty). So, to work around this, all
slots (up to 8) are scanned to see if there are any devices present.
Lastly, the bus is being enumerated on large systems in a different
way the we originally thought. This throws the ugly logic we had
out the window. To more elegantly handle this, I reorganized the
kva array to be sparse (which removed the need to have any bus number
to kva slot logic in tce.c) and created a secondary space array to
contain the bus number to phb mapping.
With these changes Calgary boots on an x460 with 4 nodes with and
without NUMA enabled.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixed off-by-one error in detect_calgary and calgary_init which will
cause arrays to overflow. Also, removed impossible to hit BUG_ON.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On Intel systems generally the TSC stops in C3 or deeper,
so don't use it there. Follows similar logic on i386.
This should fix problems on Meroms.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The dwarf2 unwinder currently often gets stuck because a lot
of assembly code doesn't have proper dwarf2 annotiation yet.
This currently often happens with __down. Should fix this by
adding proper dwarf2 annotation to all inline assembly. However
until that's done we need a quick fix for 2.6.18 to avoid
incomplete backtraces.
So when this happens dump the rest of the stack with the old unwinder
instead of silently not dumping it. There was already a optional
"both" mode that dumped both, but that was too ugly.
I also clarified the headers for the different backtraces a bit.
Also add a clear error message for missing dwarf2
annotation that people can work on.
And I removed a dead variable left over from Ingo's changes.
Cc: mingo@elte.hu
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When int 0x80 is called from long mode r8-r11 would leak out of the
kernel (or rather they would be filled with some values from
the kernel stack). I don't think it's a security issue because
the values come from the fixed stack frame which should be near
always user registers from a previous interrupt.
Still better fix it.
Longer term the register save macros need to be cleaned up
to avoid such mistakes in the future.
Original analysis from Richard Brunner, fix by me.
Cc: Richard.Brunner@amd.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fixes a obscure user space triggerable crash during oprofiling.
Oprofile calls profile_pc from NMIs even when user_mode(regs) is not true and
the program counter is inside the kernel lock section. This opens
a race - when a user program jumps to a kernel lock address and
a NMI happens before the illegal page fault exception is raised
and the program has a unmapped esp or ebp then the kernel could
oops. NMIs have a higher priority than exceptions so that could
happen.
Add user_mode checks to i386/x86-64 profile_pc to prevent that.
Cc: John Levon <levon@movementarian.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We can't safely directly access an compat_alloc_user_space() pointer
with the siginfo copy functions. Bounce it through the stack.
Noticed by Al Viro using sparse
[ This was only added post 2.6.17, not in any released kernel ]
Cc: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently ia32 binaries behave differently with respect to enabling
READ_IMPLIES_EXEC. On i386 a binary with the exec_stack flag set is
executed with READ_IMPLIES_EXEC enabled as well. The same binary
executes without READ_IMPLIES_EXEC on x86-64.
This causes binaries that work on i386 to fail on x86-64 which goes
somewhat against the whole 32 bit emulation idea.
It has been argued that READ_IMPLIES_EXEC should not be enabled at all
for binaries that have the exec_stack flag. Which is probably a valid
point. However until this is clarified I think x86-64 should behave the
same for ia32 binaries as i386.
The following patch brings x86-64 in sync with i386 for ia32 binaries.
Signed-off-by: Markus Schoder <lists@gammarayburst.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
screen_info.h doesn't have anything to do with the tty layer and shouldn't be
included by tty.h. This patches removes the include and modifies all users to
directly include screen_info.h. struct screen_info is mainly used to
communicate with the console drivers in drivers/video/console. Note that this
patch touches every arch and I have no way of testing it. If there is a
mistake the worst thing that will happen is a compile error.
[akpm@osdl.org: fix arm build]
[akpm@osdl.org: fix alpha build]
Signed-off-by: Jon Smirl <jonsmir@gmail.com>
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
{un}register_die_notifier() is used by kdb... document this so that future
"remove dead export" rounds can skip this export.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up lockdep on-stack-completion initializer. (This also removes the
dependency on waitqueue_lock_key.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues
implicitly initialized by DECLARE_COMPLETION(). Annotate on-stack completions
accordingly.
Has no effect on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make use of local_irq_enable_in_hardirq() API to annotate places that enable
hardirqs in hardirq context.
Has no effect on non-lockdep kernels.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 uses spinlocks very early - earlier than start_kernel(). So call
lockdep_init() from the arch setup code.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Framework to generate and save stacktraces quickly, without printing anything
to the console. x86_64 support.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Document stack frame nesting internals some more.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Beautify x86_64 stacktraces to be more readable.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
include/linux/version.h contained both actual KERNEL version
and UTS_RELEASE that contains a subset from git SHA1 for when
kernel was compiled as part of a git repository.
This had the unfortunate side-effect that all files including version.h
would be recompiled when some git changes was made due to changes SHA1.
Split it out so we keep independent parts in separate files.
Also update checkversion.pl script to no longer check for UTS_RELEASE.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Use the new IRQF_ constants and remove the SA_INTERRUPT define
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add __start_rodata and __end_rodata to sections.h to avoid extern
declarations. Needed by s390 code (see following patch).
[akpm@osdl.org: update architectures]
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined
sets of syscalls. Infrastructure, a couple of classes (with 32bit counterparts
for biarch targets) and actual tie-in on i386, amd64 and ia64.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
skb_release_data() no longer has any users in other files.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several KConfig files had 'similarity' and 'independent' spelled incorrectly...
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations.
(Most architectures had it defined to NOP anyway.)
NOTE: ia64 needs testing. i386 and x86_64 tested.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the
irq_desc[NR_IRQS].affinity field.
[akpm@osdl.org: sparc64 build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch-queue improves the generic IRQ layer to be truly generic, by adding
various abstractions and features to it, without impacting existing
functionality.
While the queue can be best described as "fix and improve everything in the
generic IRQ layer that we could think of", and thus it consists of many
smaller features and lots of cleanups, the one feature that stands out most is
the new 'irq chip' abstraction.
The irq-chip abstraction is about describing and coding and IRQ controller
driver by mapping its raw hardware capabilities [and quirks, if needed] in a
straightforward way, without having to think about "IRQ flow"
(level/edge/etc.) type of details.
This stands in contrast with the current 'irq-type' model of genirq
architectures, which 'mixes' raw hardware capabilities with 'flow' details.
The patchset supports both types of irq controller designs at once, and
converts i386 and x86_64 to the new irq-chip design.
As a bonus side-effect of the irq-chip approach, chained interrupt controllers
(master/slave PIC constructs, etc.) are now supported by design as well.
The end result of this patchset intends to be simpler architecture-level code
and more consolidation between architectures.
We reused many bits of code and many concepts from Russell King's ARM IRQ
layer, the merging of which was one of the motivations for this patchset.
This patch:
rename desc->handler to desc->chip.
Originally i did not want to do this, because it's a big patch. But having
both "desc->handler", "desc->handle_irq" and "action->handler" caused a
large degree of confusion and made the code appear alot less clean than it
truly is.
I have also attempted a dual approach as well by introducing a
desc->chip alias - but that just wasnt robust enough and broke
frequently.
So lets get over with this quickly. The conversion was done automatically
via scripts and converts all the code in the kernel.
This renaming patch is the first one amongst the patches, so that the
remaining patches can stay flexible and can be merged and split up
without having some big monolithic patch act as a merge barrier.
[akpm@osdl.org: build fix]
[akpm@osdl.org: another build fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Memory hotplug code of i386 adds memory to only highmem. So, if
CONFIG_HIGHMEM is not set, CONFIG_MEMORY_HOTPLUG shouldn't be set.
Otherwise, it causes compile error.
In addition, many architecture can't use memory hotplug feature yet. So, I
introduce CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We recently changed x86 to handle more than 256 IRQs. Add a check in do_IRQ()
just to make sure that nothing went wrong with that implementation.
[chrisw@sous-sol.org: do x86_64 too]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@muc.de>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
/sys/devices/system/cpu/ control the MC/SMT power savings policy for the
scheduler.
Based on the values (1-enable, 0-disable) for these controls, sched groups
cpu power will be determined for different domains. When power savings
policy is enabled and under light load conditions, scheduler will minimize
the physical packages/cpu cores carrying the load and thus conserving
power(with a perf impact based on the workload characteristics... see OLS
2005 CMP kernel scheduler paper for more details..)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.
__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a
band-aid solution to solve that problem. In the process, i undid all the
changes you both were making to ensure that these notifiers were available
only at init time (unless CONFIG_HOTPLUG_CPU is defined).
We deferred the real fix to 2.6.18. Here is a set of patches that fixes the
XFS problem cleanly and makes the cpu notifiers available only at init time
(unless CONFIG_HOTPLUG_CPU is defined).
If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run
time.
This patch reverts the notifier_call changes made in 2.6.17
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.
Use these constants in core arch., mm, driver, and fs code.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the limit of 256 interrupt vectors by changing the value stored in
orig_{e,r}ax to be the complemented interrupt vector. The orig_{e,r}ax
needs to be < 0 to allow the signal code to distinguish between return from
interrupt and return from syscall. With this change applied, NR_IRQS can
be > 256.
Xen extends the IRQ numbering space to include room for dynamically
allocated virtual interrupts (in the range 256-511), which requires a more
permissive interface to do_IRQ.
Signed-off-by: Ian Pratt <ian.pratt@xensource.com>
Signed-off-by: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the name of old add_memory() to arch_add_memory. And use node id to
get pgdat for the node at NODE_DATA().
Note: Powerpc's old add_memory() is defined as __devinit. However,
add_memory() is usually called only after bootup.
I suppose it may be redundant. But, I'm not well known about powerpc.
So, I keep it. (But, __meminit is better at least.)
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial:
typo fixes
Clean up 'inline is not at beginning' warnings for usb storage
Storage class should be first
i386: Trivial typo fixes
ixj: make ixj_set_tone_off() static
spelling fixes
fix paniced->panicked typos
Spelling fixes for Documentation/atomic_ops.txt
move acknowledgment for Mark Adler to CREDITS
remove the bouncing email address of David Campbell
Applies to git & 2.6.17-rc6 after CONFIG_DEBUG_STACKOVERFLOW patch
uses same stack-zeroing mechanism as on i386 to discover maximum stack
excursions.
Signed-off-by: Eric Sandeen <sandeen@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Take two, now without spurious whitespace :( Applies to git & 2.6.17-rc6
CONFIG_DEBUG_STACKOVERFLOW existed for x86_64 in 2.4, but seems to have gone AWOL in 2.6.
I've pretty much just copied this over from the 2.4 code, with
appropriate tweaks for the 2.6 kernel, plus a bugfix. I'd personally
rather see it printed out the way other arches do it, i.e.
bytes-remaining-until-overflow, rather than having to do the subtraction
yourself. Also, only 128 bytes remaining seems awfully late to issue a
warning. But I'll start here :)
Signed-off-by: Eric Sandeen <sandeen@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Intel now has support for Architectural Performance Monitoring Counters
( Refer to IA-32 Intel Architecture Software Developer's Manual
http://www.intel.com/design/pentium4/manuals/253669.htm ). This
feature is present starting from Intel Core Duo and Intel Core Solo processors.
What this means is, the performance monitoring counters and some performance
monitoring events are now defined in an architectural way (using cpuid).
And there will be no need to check for family/model etc for these architectural
events.
Below is the patch to use this performance counters in nmi watchdog driver.
Patch handles both i386 and x86-64 kernels.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On some i386/x86_64 systems, sending an NMI IPI as a broadcast will
reset the system. This seems to be a BIOS bug which affects machines
where one or more cpus are not under OS control. It occurs on HT
systems with a version of the OS that is not compiled without HT
support. It also occurs when a system is booted with max_cpus=n where
2 <= n < cpus known to the BIOS. The fix is to always send NMI IPI as
a mask instead of as a broadcast.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Appended patch fixes the "APIC error on CPUX: 00(40)" observed during bootup.
From SDM Vol-3A "Valid Interrupt Vectors" section:
"When an illegal vector value (0-15) is written to an LVT entry
and the delivery mode is Fixed, the APIC may signal an illegal
vector error, with out regard to whether the mask bit is set
or whether an interrupt is actually seen on input."
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow stack growth so the 'enter' instruction works. Also
fixes problem in compat_sys_kexec_load() which could allocate
more than 128 bytes using compat_alloc_user_space().
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Use tail call from clear_user to __clear_user to save some code size
- Use standard memcpy for forward memmove
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Only exports for assembler files are left in x8664_ksyms.c
Originally inspired by a patch from Al Viro
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
x86_64 and i386 behave inconsistently when sending an IPI on vector 2
(NMI_VECTOR). Make both behave the same, so IPI 2 is sent as NMI.
The crash code was abusing send_IPI_allbutself() by passing a code
instead of a vector, it only worked because crash knew about the
internal code of send_IPI_allbutself(). Change crash to use NMI_VECTOR
instead, and remove the comment about how crash was abusing the function.
This patch is a pre-requisite for fixing the problem where sending an
IPI as NMI would reboot some Dell Xeon systems. I cannot fix that
problem while crash continus to abuse send_IPI_allbutself().
It also removes the inconsistency between i386 and x86_64 for
NMI_VECTOR. That will simplify all the RAS code that needs to bring
all the cpus to a clean stop, even when one or more cpus are spinning
disabled.
Signed-off-by: Keith Owens <kaos@sgi.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It turned out that the following change is needed when the speaker is
compiled as a module.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch removes the no longer used sys32_ni_syscall()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently CONFIG_REORDER uses -ffunction-sections for all code;
however, creating a separate section for each function is not useful
for modules and just adds bloat. Moving this option from CFLAGS to
CFLAGS_KERNEL shrinks module object files (e.g., the module tree for a
kernel built with most drivers as modules shrinked from 54M to 46M),
and decreases the number of sysfs files in /sys/module/*/sections/
directories.
Signed-off-by: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Defaulting to a value not evenly divisible by four makes little sense,
as four values are displayed per line (and hence the rest of the line
would otherwise be wasted).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
With (significantly) more than 10 CPUs online, the column headings
drifted off the positions of the column contents with growing CPU
numbers.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The APIC ID returned by hard_smp_processor_id can be beyond
NR_CPUS and then overflow the x86_cpu_to_apic[] array.
Add a check for overflow. If it happens then the slow loop below
will catch.
Bug pointed out by Doug Thompson
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Getting phys_proc_id and cpu_core_id information to be printed at boot
time for AMD processors. Also matching the Node related boot time
information that gets printed for Intel and AMD processors for NUMA
configurations.
Signed-off-by: Rohit Seth <rohitseth@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.
Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.
Cc: Nick Piggin <npiggin@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Len Brown <len.brown@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No red zone possible/needed on the alternative stack.
It caused confusion.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- fix an off-by-one error in phys_pmd_init()
- prevent phys_pmd_init() from removing mappings established earlier
- fix the direct mapping early printk to in fact show the end of the range
- remove an apparently orphan comment
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up mce_amd.c for readability and remove code no
longer needed.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add support for mce threshold registers found in future
AMD family 0x10 processors. Backwards compatible with
family 0xF hardware.
AK: fixed build on !SMP
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add support for extended APIC LVT found in future AMD processors.
Signed-off-by: Jacob Shin <jacob.shin@amd.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
After writing the CFG register, the first value written to the T0_CMP
register is the value at which next interrupt should be triggered, every
value after that sets the period of the interrupt. For that reason, the code
needs to write the value twice - to set both the phase and period.
[AK: I had already figured it out by myself, but it's still useful
to have a comment for this.]
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch makes use of the newly added conversion constants
in time.h to x86-64 time.c. The code gets significantly easier
to understand.
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove #ifdefed code to manually enable HPET on AMD8111, where the
BIOS doesn't have ACPI HPET tables and doesn't enable it for us.
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch adds the X86_FEATURE_RDTSCP #define, so that kernel code can
check for the feature easily and also fixes the location of the "rdtscp"
string in the cpuinfo tables.
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename oem_force_hpet_timer to apic_is_clustered_box, to give the
function a better fitting name - it really isn't at all about HPET.
Signed-off-by: Vojtech Pavlik <vojtech@suse.cz>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Most of the fields of cpuinfo are defined in cpuinfo_x86 structure.
This patch moves the phys_proc_id and cpu_core_id for each processor to
cpuinfo_x86 structure as well.
Signed-off-by: Rohit Seth <rohitseth@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch hooks Calgary into the build, the x86-64 IOMMU
initialization paths, and introduces the Calgary specific bits. The
implementation draws inspiration from both PPC (which has support for
the same chip but requires firmware support which we don't have on
x86-64) and gart. Calgary is different from gart in that it support a
translation table per PHB, as opposed to the single gart aperture.
Changes from previous version:
* Addition of boot-time disablement for bus-level translation/isolation
(e.g, enable userspace DMA for things like X)
* Usage of newer IOMMU abstraction functions
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch creates a new interface for IOMMUs by adding a centralized
location for IOMMU allocation (for translation tables/apertures) and
IOMMU initialization. In creating these, code was moved around for
abstraction, uniformity, and consiceness.
Take note of the move of the iommu_setup bootarg parsing code to
__setup. This is enabled by moving back the location of the aperture
allocation/detection to mem init (which while ugly, was already the
location of the swiotlb_init).
While a slight departure from the previous patch, I belive this provides
the true intention of the previous versions of the patch which changed
this code. It also makes the addition of the upcoming calgary code much
cleaner than previous patches.
[AK: Removed one broken change. iommu_setup still has to be called
early]
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
swiotlb relies on the gart specific iommu_aperture variable to know if
we discovered a hardware IOMMU before swiotlb initialization. Introduce
iommu_detected to do the same thing, but in a HW IOMMU neutral manner,
in preparation for adding the Calgary HW IOMMU.
Signed-Off-By: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-Off-By: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
pud_offset_k() equivalent to pud_offset() now. Pointed out by Jan Beulich
Similar for __pud_offset_ok, which needs a small change in the callers.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Clean up arch/{i386,x86_64}/boot/compressed/misc.c a bit to reduce their
differences. Should have zero effect on code generation.
Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If no unwinding is possible at all for a certain exception instance,
fall back to the old style call trace instead of not showing any trace
at all.
Also, allow setting the stack trace mode at the command line.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Adjust the CFA offset for 64- and 32-bit syscall entries so that the five
slots pre-subtracted from the stack pointer do not appear to reside outside
of the current frame.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the switching to/from the IRQ stack so that unwind annotations can
be added for it without requiring CFA expressions.
AK: I cleaned it up a bit, making it unconditional and removing the
obsolete DEBUG_INFO full frame code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
These are the x86_64-specific pieces to enable reliable stack traces. The
only restriction with this is that it currently cannot unwind across the
interrupt->normal stack boundary, as that transition is lacking proper
annotation.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In x86_64 architecture, if device driver with msi function
gets 0xee vector by assign_irq_vector() function, system will
crash if this interrupt happens. It is because 0xee interrupt
entry is empty. This patch modifies this. This patch is based
on 2.6.17-rc6.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Rename the GART_IOMMU option to IOMMU to make clear it's not
just for AMD
- Rewrite the help text to better emphatise this fact
- Make it an embedded option because too many people get it wrong.
To my astonishment I discovered the aacraid driver tests this
symbol directly. This looks quite broken to me - it's an internal
implementation detail of the PCI DMA API. Can the maintainer
please clarify what this test was intended to do?
Cc: linux-scsi@vger.kernel.org
Cc: alan@redhat.com
Cc: markh@osdl.org
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously it would only work in the first 32bit system call, not during
early process setup.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
include/asm-x86_64/gart-mapping.h is only ever used in
arch/x86_64/kernel/setup.c and none of its contents are referenced.
Looks to be leftover cruft not removed in the dma_ops patch.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It was originally added for 2.4 oprofile, but 2.6 oprofile doesn't
need that anymore. Shouldn't be any use in tree anymore and it doesn't
make much sense to export the ia32 syscalls when the main syscalls
are not exported.
I think Adrian Bunk asked for removing it several times.
Also included hunk from Adrian to remove the .globl ia32_sys_call_table
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Early development of x86-64 Linux was in CVS, but that hasn't been
the case for a long time now. Remove the obsolete $Id$s.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since END()/ENDPROC() are now available, add respective annotations to
x86_64's entry.S. This should help debugging activities.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Sometimes e.g. with crashme the compat layer warnings can be noisy.
Add a way to turn them off by gating all output through compat_printk
that checks a global sysctl. The default is not changed.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix
initcall at 0xffffffff806c5b89: pci_iommu_init+0x0/0x53c(): returned with error code -1
Return -ENODEV instead when the IOMMU is not used.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since assign_irq_vector() can be called at runtime, its access of static
variables should be protected by a lock.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- Factor out the duplicated access/cache code into a single file
* Shared between i386/x86-64.
- Share flush code between AGP and IOMMU
* Fix a bug: AGP didn't wait for end of flush before
- Drop 8 northbridges limit and allocate dynamically
- Add lock to serialize AGP and IOMMU GART flushes
- Add PCI ID for next AMD northbridge
- Random related cleanups
The old K8 NUMA discovery code is unchanged. New systems
should all use SRAT for this.
Cc: "Navin Boppuri" <navin.boppuri@newisys.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A trivial change to have gart_unmap_sg call gart_unmap_single directly,
instead of bouncing through the dma_unmap_single wrapper in
dma-mapping.h.
This change required moving the gart_unmap_single above gart_unmap_sg,
and under gart_map_single (which seems a more logical place that its
current location IMHO).
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously we would just silently provide 64 bit services
for this to 32bit processes.
I also added all the other cases explicitely to the ptrace
compat wrapper to make sure this doesn't happen again.
And removed one bogus check in the wrapper.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow search for a contiguous block of iommu space to cross the next_bit
marker if we have already committed ourselves to flushing the gart.
There shouldn't be any reason why we'd restrict the search.
Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
enable large bzImages on x86_64. (fix is from x86's build.c) Using this
patch i have successfully built and booted an allyesconfig 13MB+ bzImage
on x86_64 too:
$ size64 vmlinux
text data bss dec hex filename
23444831 8202642 3439360 35086833 21761f1 vmlinux
-rw-rw-r-- 1 mingo mingo 13121740 Apr 19 09:32 arch/x86_64/boot/bzImage
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changes are largely identical to the i386 version:
* alternative #define are moved to the new alternative.h file.
* one new elf section with pointers to the lock prefixes which can be
nop'ed out for non-smp.
* two new elf sections simliar to the "classic" alternatives to
replace SMP code with simpler UP code.
* fixup headers to use alternative.h instead of defining their own
LOCK / LOCK_PREFIX macros.
The patch reuses the i386 version of the alternatives code to avoid code
duplication. The code in alternatives.c was shuffled around a bit to
reduce the number of #ifdefs needed. It also got some tweaks needed for
x86_64 (vsyscall page handling) and new features (noreplacement option
which was x86_64 only up to now). Debug printk's are changed from
compile-time to runtime.
Loosely based on a early version from Bastian Blank <waldi@debian.org>
Signed-off-by: Gerd Hoffmann <kraxel@suse.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Intel systems report the cache level data from CPUID 4 in sysfs.
Add a CPUID 4 emulation for AMD CPUs to report the same
information for them. This allows programs to read this
information in a uniform way.
The AMD way to report this is less flexible so some assumptions
are hardcoded (e.g. no L3)
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously the apicid<->coreid split was computed based on the max
number of cores. Now use a new CPUID AMD defined for that. On most
systems right now it should be 0 and the old method will be used.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
vSMPowered systems use apic_cluster too. Forcing apic_physflat works
on these systems too, but only if we change phys_pkg_id to use
hard_smp_prcoessor_id() instead of cpuid_ebx. I am guessing other
multichassi cluster systems would need this too.
Signed-off-by: ravikiran thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Enable some hwmon drivers as modules and tulip and stack unwinding
Kernel image should be somewhat bigger now because of the unwind
information being included, but you'll get exact backtraces for that.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- make firmware edid independent from framebuffer (No need to choose
framebuffer just to disable this option
- enable this option in X86_64
- check if VBE/DDC function is implemented before calling actual function
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently in the do_page_fault() code path, we call notify_die(DIE_PAGE_FAULT,
...) to notify the page fault. Since notify_die() is highly overloaded, this
page fault notification is currently being sent to all the components
registered with register_die_notification() which uses the same die_chain to
loop for all the registered components which is unnecessary.
In order to optimize the do_page_fault() code path, this critical page fault
notification is now moved to different call chain and the test results showed
great improvements.
And the kprobes which is interested in this notifications, now registers onto
this new call chain only when it need to, i.e Kprobes now registers for page
fault notification only when their are an active probes and unregisters from
this page fault notification when no probes are active.
I have incorporated all the feedback given by Ananth and Keith and everyone,
and thanks for all the review feedback.
This patch:
Overloading of page fault notification with the notify_die() has performance
issues(since the only interested components for page fault is kprobes and/or
kdb) and hence this patch introduces the new notifier call chain exclusively
for page fault notifications their by avoiding notifying unnecessary
components in the do_page_fault() code path.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- written on init only, accessed for every timer read --> __read_mostly
- fix broken sentence
Signed-off-by: Andreas Mohr <andi@lisas.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In a testament to the utter simplicity and logic of the English
language ;-), I found a single correct use - in kernel/panic.c - and
10-15 incorrect ones.
Signed-Off-By: Lee Revell <rlrevell@joe-job.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
The wrapper routines are required when asmlinkage differs from the usual
calling convention. So we need to have them. However, by rearranging
the parameters, they will get optimised away to a single jump for most
people.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Up until now algorithms have been happy to get a context pointer since
they know everything that's in the tfm already (e.g., alignment, block
size).
However, once we have parameterised algorithms, such information will
be specific to each tfm. So the algorithm API needs to be changed to
pass the tfm structure instead of the context pointer.
This patch is basically a text substitution. The only tricky bit is
the assembly routines that need to get the context pointer offset
through asm-offsets.h.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This reverts commits
3e3318dee0 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages
b6370d96e0 [PATCH] swsusp: i386 mark special saveable/unsaveable pages
ce4ab0012b [PATCH] swsusp: add architecture special saveable pages support
because not only do they apparently cause page faults on x86, the
infrastructure doesn't compile on powerpc.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch will fix a boot memory reservation bug that trashes memory on
the ES7000 when loading the kdump crash kernel.
The code in arch/x86_64/kernel/setup.c to reserve boot memory for the crash
kernel uses the non-numa aware "reserve_bootmem" function instead of the
NUMA aware "reserve_bootmem_generic". I checked to make sure that no other
function was using "reserve_bootmem" and found none, except the ones that
had NUMA ifdef'ed out.
I have tested this patch only on an ES7000 with NUMA on and off (numa=off)
in a single (non-NUMA) and multi-cell (NUMA) configurations.
Signed-off-by: Amul Shah <amul.shah@unisys.com>
Looks-good-to: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (65 commits)
ACPI: suppress power button event on S3 resume
ACPI: resolve merge conflict between sem2mutex and processor_perflib.c
ACPI: use for_each_possible_cpu() instead of for_each_cpu()
ACPI: delete newly added debugging macros in processor_perflib.c
ACPI: UP build fix for bugzilla-5737
Enable P-state software coordination via _PDC
P-state software coordination for speedstep-centrino
P-state software coordination for acpi-cpufreq
P-state software coordination for ACPI core
ACPI: create acpi_thermal_resume()
ACPI: create acpi_fan_suspend()/acpi_fan_resume()
ACPI: pass pm_message_t from acpi_device_suspend() to root_suspend()
ACPI: create acpi_device_suspend()/acpi_device_resume()
ACPI: replace spin_lock_irq with mutex for ec poll mode
ACPI: Allow a WAN module enable/disable on a Thinkpad X60.
sem2mutex: acpi, acpi_link_lock
ACPI: delete unused acpi_bus_drivers_lock
sem2mutex: drivers/acpi/processor_perflib.c
ACPI add ia64 exports to build acpi_memhotplug as a module
ACPI: asus_acpi_init(): propagate correct return value
...
Manual resolve of conflicts in:
arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c
arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c
include/acpi/processor.h
flush_tlb_all uses on_each_cpu, which will disable/enable interrupt.
In suspend/resume time, this will make interrupt wrongly enabled.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pages (Reserved/ACPI NVS/ACPI Data) below end_pfn will be saved/restored by S4
currently. We should mark 'Reserved' pages not saveable.
Pages (Reserved/ACPI NVS/ACPI Data) above end_pfn will not be saved/restored
by S4 currently. We should save the 'ACPI NVS/ACPI Data' pages.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)
Add support for move_pages() on i386 and also add the compat functions
necessary to run 32 bit binaries on x86_64.
Add compat_sys_move_pages to the x86_64 32bit binary layer. Note that it is
not up to date so I added the missing pieces. Not sure if this is done the
right way.
[akpm@osdl.org: compile fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Consolidate the various arch-specific implementations of pxm_to_node() and
node_to_pxm() into a single generic version.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6: (27 commits)
[PATCH] PCI: nVidia quirk to make AER PCI-E extended capability visible
[PATCH] PCI: fix issues with extended conf space when MMCONFIG disabled because of e820
[PATCH] PCI: Bus Parity Status sysfs interface
[PATCH] PCI: fix memory leak in MMCONFIG error path
[PATCH] PCI: fix error with pci_get_device() call in the mpc85xx driver
[PATCH] PCI: MSI-K8T-Neo2-Fir: run only where needed
[PATCH] PCI: fix race with pci_walk_bus and pci_destroy_dev
[PATCH] PCI: clean up pci documentation to be more specific
[PATCH] PCI: remove unneeded msi code
[PATCH] PCI: don't move ioapics below PCI bridge
[PATCH] PCI: cleanup unused variable about msi driver
[PATCH] PCI: disable msi mode in pci_disable_device
[PATCH] PCI: Allow MSI to work on kexec kernel
[PATCH] PCI: AMD 8131 MSI quirk called too late, bus_flags not inherited ?
[PATCH] PCI: Move various PCI IDs to header file
[PATCH] PCI Bus Parity Status-broken hardware attribute, EDAC foundation
[PATCH] PCI: i386/x86_84: disable PCI resource decode on device disable
[PATCH] PCI ACPI: Rename the functions to avoid multiple instances.
[PATCH] PCI: don't enable device if already enabled
[PATCH] PCI: Add a "enable" sysfs attribute to the pci devices to allow userspace (Xorg) to enable devices without doing foul direct access
...
The AGP default doesn't work well with other selects, so use a select for
GART_IOMMU as well. Remove a redundant default for SWIOTLB as well.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On 15 Jun 2006 03:45:10 +0200, Andi Kleen wrote:
> Anyways I would say that if the BIOS can't get MCFG right then
> it's likely not been validated on that board and shouldn't be used.
According to Petr Vandrovec:
... "What is important (and checked) is address of MMCONFIG reported by MCFG
table... Unfortunately code does not bother with printing that address :-(
"Another problem is that code has hardcoded that MMCONFIG area is 256MB large.
Unfortunately for the code PCI specification allows any power of two between 2MB
and 256MB if vendor knows that such amount of busses (from 2 to 128) will be
sufficient for system. With notebook it is quite possible that not full 8 bits
are implemented for MMCONFIG bus number."
So here is a patch. Unfortunately my system still fails the test because
it doesn't reserve any part of the MMCONFIG area, but this may fix others.
Booted on x86_64, only compiled on i386. x86_64 still remaps the max area
(256MB) even though only 2MB is checked... but 2.6.16 had no check at all
so it is still better.
PCI: reduce size of x86 MMCONFIG reserved area check
1. Print the address of the MMCONFIG area when the test for that area
being reserved fails.
2. Only check if the first 2MB is reserved, as that is the minimum.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
From: "Andy Currid" <ACurrid@nvidia.com>
This patch fixes a kernel panic during boot that occurs on NVIDIA platforms
that have HPET enabled.
When HPET is enabled, the standard timer IRQ is routed to IOAPIC pin 2 and is
advertised as such in the ACPI APIC table - but an earlier workaround in the
kernel was ignoring this override. The fix is to honor timer IRQ overrides
from ACPI when HPET is detected on an NVIDIA platform.
Signed-off-by: Andy Currid <acurrid@nvidia.com>
Cc: "Brown, Len" <len.brown@intel.com>
Cc: "Yu, Luming" <luming.yu@intel.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
int_ret_from_syscall already does syscall exit tracing, so
no need to do it again in the caller.
This caused problems for UML and some other special programs doing
syscall interception.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Robert Hentosh <robert_hentosh@dell.com>
Actually, we just stumbled on a different bug found in find_e820_area() in
e820.c. The following code does not handle the edge condition correctly:
while (bad_addr(&addr, size) && addr+size < ei->addr + ei->size)
;
last = addr + size;
if ( last > ei->addr + ei->size )
continue;
The second statement in the while loop needs to be a <= b so that it is the
logical negavite of the if (a > b) outside it. It needs to read:
while (bad_addr(&addr, size) && addr+size <= ei->addr + ei->size)
;
In the case that failed bad_addr was returning an address that is exactly size
bellow the end of the e820 range.
AK: Again together with the earlier avoid edma fix this fixes
boot on a Dell PE6850/16GB
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Daniel Yeisley <dan.yeisley@unisys.com>
It is possible to boot a Unisys ES7000 with CPUs from multiple cells, and not
also include the memory from those cells. This can create a scenario where
node 0 has cpus, but no associated memory. The system will boot fine in a
configuration where node 0 has memory, but nodes 2 and 3 do not.
[AK: I rechecked the code and generic code seems to indeed handle that already.
Dan's original patch had a change for mm/slab.c that seems to be already in now.]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: "Jan Beulich" <jbeulich@novell.com>
The PM timer code updates vxtime.last_tsc, but this update was done
incorrectly in two ways:
- offset_delay being in microseconds requires multiplying with cpu_mhz
rather than cpu_khz
- the multiplication of offset_delay and cpu_khz (both being 32-bit
values) on most current CPUs would overflow (observed value of the
delay was approximately 4000us, yielding an overflow for frequencies
starting a little above 1GHz)
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Complaining about the IOMMU not compiled in doesn't make sense
here because it is clearly compiled in.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
ia32_setup_arg_pages would ignore the passed in random stack top
and use its own static value.
Now it uses the 8bit of randomness native i386 would use too.
This indirectly fixes mmap randomization for 32bit processes too,
which depends on the stack randomization.
Should also give slightly better virtual cache colouring and
possibly better performance with HyperThreading.
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Problem:
If we put a probe onto a callq instruction and the probe is executed,
kernel panic of Bad RIP value occurs.
Root cause:
If resume_execution() found 0xff at first byte of p->ainsn.insn, it must
check the _second_ byte. But current resume_execution check _first_ byte
again.
I changed it checks second byte of p->ainsn.insn.
Kprobes on i386 don't have this problem, because the implementation is a
little bit different from x86_64.
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Satoshi Oshima <soshima@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Extends an earlier patch from John Blackwood to more exception handlers
that also run on the exception stacks.
Expand the use of preempt_conditional_{sti,cli} to all cases where
interrupts are to be re-enabled during exception handling while running
on an IST stack.
Based on original patch from Jan Beulich.
Cc: John Blackwood <john.blackwood@ccur.com>
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This fixes some boot failures on Dell and Unisys systems with memory
hotadd added.
- Set hotadd_percent to 0 by default. This means anybody using hotadd
memory needs to specify the value on the command line. That's
because there are lots of Intel boxes which have a bogus hotplug area
in their SRAT and they would waste a lot of memory before.
- Fix calculation of how much memory to use when the hotplug area
exceeds hotadd_percent
- Fix fallback when the
- Fix fallback if memory hotadd is not compiled in.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This triggers for b44's 1GB DMA workaround which tries to map
first and then bounces.
The 32bit heuristic is reasonable because the IOMMU doesn't attempt
to handle < 32bit masks anyways.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Based on analysis&patch from Robert Hentosch
Observed on a Dell PE6850 with 16GB
The problem occurs very early on, when the kernel allocates space for the
temporary memory map called bootmap. The bootmap overlaps the EBDA region.
EBDA region is not historically reserved in the e820 mapping. When the
bootmap is freed it marks the EBDA region as usable.
If you notice in setup.c there is already code to work around the EBDA
in reserve_ebda_region(), this check however occurs after the bootmap
is allocated and doesn't prevent the bootmap from using this range.
AK: I redid the original patch. Thanks also to Jan Beulich for
spotting some mistakes.
Cc: Robert_Hentosch@dell.com
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Playing with NMI watchdog on x86_64, I discovered that it didn't
do what I expected. It always panic-ed, even when it didn't
happen from interrupt context. This patch solves that
problem for me. Also, in this case, do_exit() will be called
with interrupts disabled, I believe. Would it be wise to also
call local_irq_enable() after nmi_exit()?
[Yes I added it -AK]
Currently, on x86_64, any NMI watchdog timeout will cause a panic
because the irq count will always be set to be in an interrupt
when do_exit() is called from die_nmi(). If we add nmi_exit() to
the die_nmi() call (since the nmi will never exit "normally")
it seems to solve this problem. The following small program
can be used to trigger the NMI watchdog to reproduce this:
main ()
{
iopl(3);
for (;;) asm("cli");
}
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I noticed this when poking around in this area.
The oops_begin() function in x86_64 would only conditionally claim
the die_lock if the call is nested, but oops_end() would always
release the spinlock. This patch adds a nest count for the die lock
so that the release of the lock is only done on the final oops_end().
Signed-off-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The IOMMU code can only deal with 8 northbridges. Error out when
more are found.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch addresses a problem with ACPI SCI interrupt entry, which gets
re-used, and the IRQ is assigned to another unrelated device. The patch
corrects the code such that SCI IRQ is skipped and duplicate entry is
avoided. Second issue came up with VIA chipset, the problem was caused by
original patch assigning IRQs starting 16 and up. The VIA chipset uses
4-bit IRQ register for internal interrupt routing, and therefore cannot
handle IRQ numbers assigned to its devices. The patch corrects this
problem by allowing PCI IRQs below 16.
Cc: len.brown@intel.com
Signed-off by: Natalie Protasevich <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* 'audit.b10' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
[PATCH] Audit Filter Performance
[PATCH] Rework of IPC auditing
[PATCH] More user space subject labels
[PATCH] Reworked patch for labels on user space messages
[PATCH] change lspp ipc auditing
[PATCH] audit inode patch
[PATCH] support for context based audit filtering, part 2
[PATCH] support for context based audit filtering
[PATCH] no need to wank with task_lock() and pinning task down in audit_syscall_exit()
[PATCH] drop task argument of audit_syscall_{entry,exit}
[PATCH] drop gfp_mask in audit_log_exit()
[PATCH] move call of audit_free() into do_exit()
[PATCH] sockaddr patch
[PATCH] deal with deadlocks in audit_free()
The PC Speaker driver's ->probe() routine doesn't even get called in the
64-bit kernels. The reason for that is that the arch code apparently has
to explictly add a "pcspkr" platform device in order for the driver core to
call the ->probe() routine. arch/i386/kernel/setup.c unconditionally adds
a "pcspkr" device, but the x86_64 kernel has no code at all related to the
PC Speaker.
The patch below copies the relevant code from i386 to x86_64, which makes
the PC Speaker work for me on x86_64.
Cc: Dmitry Torokhov <dtor_core@ameritech.net>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Few of the notifier_chain_register() callers use __init in the definition
of notifier_call. It is incorrect as the function definition should be
available after the initializations (they do not unregister them during
initializations).
This patch fixes all such usages to _not_ have the notifier_call __init
section.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We do this by removing a micro-optimization that tries to avoid grabbing
the iommu_bitmap_lock spinlock and using a bus-locked operation.
This still races with other simultaneous alloc_iommu or free_iommu(size >
1) which both use bus-unlocked operations.
The end result of this race is eventually ending up with an
iommu_gart_bitmap that has bits errornously set all over, making large
contiguous iommu space allocations fail with 'PCI-DMA: Out of IOMMU space'.
Signed-off-by: Mike Waychison <mikew@google.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This quietens warnings and actually fixes a bug. The unwind tables would
come out wrong without -32, causing pthread cancellation during them to
crash in the gcc runtime.
The problem seems to only happen with newer binutils (it doesn't happen
with 2.16.91.0.2 but happens wit 2.16.91.0.5)
Thanks to David Altobelli <david.altobelli@hp.com> and Brian Baker
<Brian.B@hp.com> for test case and initial analysis.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Seems we are trying to init the node_mem_map when we don't need to, for
example when SPARSEMEM is enabled. This causes the error below during
compilation. Use CONFIG_FLAT_NODE_MEM_MAP to gate allocation and init.
arch/x86_64/mm/numa.c: In function `setup_node_zones':
arch/x86_64/mm/numa.c:191: error: structure has no member
named `node_mem_map'
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
AMD K7/K8 CPUs only save/restore the FOP/FIP/FDP x87 registers in FXSAVE
when an exception is pending. This means the value leak through
context switches and allow processes to observe some x87 instruction
state of other processes.
This was actually documented by AMD, but nobody recognized it as
being different from Intel before.
The fix first adds an optimization: instead of unconditionally
calling FNCLEX after each FXSAVE test if ES is pending and skip
it when not needed. Then do a x87 load from a kernel variable to
clear FOP/FIP/FDP.
This means other processes always will only see a constant value
defined by the kernel in their FP state.
I took some pain to make sure to chose a variable that's already
in L1 during context switch to make the overhead of this low.
Also alternative() is used to patch away the new code on CPUs
who don't need it.
Patch for both i386/x86-64.
The problem was discovered originally by Jan Beulich. Richard
Brunner provided the basic code for the workarounds, with contribution
from Jan.
This is CVE-2006-1056
Cc: richard.brunner@amd.com
Cc: jbeulich@novell.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andrew Morton pointed out that compiler might not inline the functions
marked for inline in kprobes. There-by allowing the insertion of probes
on these kprobes routines, which might cause recursion.
This patch removes all such inline and adds them to kprobes section
there by disallowing probes on all such routines. Some of the routines
can even still be inlined, since these routines gets executed after the
kprobes had done necessay setup for reentrancy.
Signed-off-by: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
tee was already there for some reason for native 64bit, but
sys_sync_file_range was missing. Also add it to the compat layer.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o Start booting into the capture kernel after an Oops if system is in a
unrecoverable state. System will boot into the capture kernel, if one is
pre-loaded by the user, and capture the kernel core dump.
o One of the following conditions should be true to trigger the booting of
capture kernel.
- panic_on_oops is set.
- pid of current thread is 0
- pid of current thread is 1
- Oops happened inside interrupt context.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
dmi_scan.c is arch-independent and is used by i386, x86_64, and ia64.
Currently all three arches compile it from arch/i386, which means that ia64
and x86_64 depend on things in arch/i386 that they wouldn't otherwise care
about.
This is simply "mv arch/i386/kernel/dmi_scan.c drivers/firmware/" (removing
trailing whitespace) and the associated Makefile changes. All three
architectures already set CONFIG_DMI in their top-level Kconfig files.
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Andi Kleen <ak@muc.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andrey Panin <pazke@orbita1.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Surprising that it still worked at all with this - yes it was
tested.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nobody should pass NULL here. Could in theory make it a BUG,
but the NULL pointer oops will do as well.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As pointed out by Linus it is useless now because entry.S should
handle it correctly in all cases.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Bugzilla Bug 6299:
A pixel size of 8 bits produces wrong logo colors in x86_64.
The driver has 2 methods for setting the color map, using the protected
mode interface provided by the video BIOS and directly writing to the VGA
registers. The former is not supported in x86_64 and the latter is enabled
only in i386.
Fix by enabling the latter method in x86_64 only if supported by the BIOS.
If both methods are unsupported, change the visual of vesafb to
STATIC_PSEUDOCOLOR.
Signed-off-by: Antonino Daplas <adaplas@pol.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While cleaning up parisc_ksyms.c earlier, I noticed that strpbrk wasn't
being exported from lib/string.c. Investigating further, I noticed a
changeset that removed its export and added it to _ksyms.c on a few more
architectures. The justification was that "other arches do it."
I think this is wrong, since no architecture currently defines
__HAVE_ARCH_STRPBRK, there's no reason for any of them to be exporting it
themselves. Therefore, consolidate the export to lib/string.c.
Signed-off-by: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Current implementations define NODES_SHIFT in include/asm-xxx/numnodes.h for
each arch. Its definition is sometimes configurable. Indeed, ia64 defines 5
NODES_SHIFT values in the current git tree. But it looks a bit messy.
SGI-SN2(ia64) system requires 1024 nodes, and the number of nodes already has
been changeable by config. Suitable node's number may be changed in the
future even if it is other architecture. So, I wrote configurable node's
number.
This patch set defines just default value for each arch which needs multi
nodes except ia64. But, it is easy to change to configurable if necessary.
On ia64 the number of nodes can be already configured in generic ia64 and SN2
config. But, NODES_SHIFT is defined for DIG64 and HP'S machine too. So, I
changed it so that all platforms can be configured via CONFIG_NODES_SHIFT. It
would be simpler.
See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=114358010523896&w=2
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Or rather compute it based on the table length automatically.
This also has the intended side effect of not warning for new system calls
anymore.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix CONFIG_REORDER.
The value of cflags-y was assined to CFLAGS before cflags-y was assigned
the value used for CONFIG_REORDER.
Use cflags-y for all CFLAGS options in the Makefile to avoid this
happening again.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In linux-2.6.16, we have noticed a problem where the gs base value
returned from an arch_prtcl(ARCH_GET_GS, ...) call will be incorrect if:
- the current/calling task has NOT set its own gs base yet to a
non-zero value,
- some other task that ran on the same processor previously set their
own gs base to a non-zero value.
In this situation, the ARCH_GET_GS code will read and return the
MSR_KERNEL_GS_BASE msr register.
However, since the __switch_to() code does NOT load/zero the
MSR_KERNEL_GS_BASE register when the task that is switched IN has a zero
next->gs value, the caller of arch_prctl(ARCH_GET_GS, ...) will get back
the value of some previous tasks's gs base value instead of 0.
Change the arch_prctl() ARCH_GET_GS code to only read and return
the MSR_KERNEL_GS_BASE msr register if the 'gs' register of the calling
task is non-zero.
Side note: Since in addition to using arch_prctl(ARCH_SET_GS, ...),
a task can also setup a gs base value by using modify_ldt() and write
an index value into 'gs' from user space, the patch below reads
'gs' instead of using thread.gs, since in the modify_ldt() case,
the thread.gs value will be 0, and incorrect value would be returned
(the task->thread.gs value).
When the user has not set its own gs base value and the 'gs'
register is zero, then the MSR_KERNEL_GS_BASE register will not be
read and a value of zero will be returned by reading and returning
'task->thread.gs'.
The first patch shown below is an attempt at implementing this
approach.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the HPET timer is enabled, the clock can drift by ~3 seconds a day.
This is due to the HPET timer not being initialized with the correct
setting (still using PIT count).
If HZ changes, this drift can become even more pronounced.
HPET patch initializes tick_nsec with correct tick_nsec settings for
HPET timer.
Vojtech comments:
"It's not entirely correct (it assumes the HPET ticks totally
exactly), but it's significantly better than assuming the PIT error
there."
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mostly to get better handling when a extended config space
access has to fallback to Type1.
Cc: gregkh@suse.de
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Previously only the first bus would be checked against Type 1.
Why 16? Checking all would need too much memory and we
can assume that systems with more than 16 busses have better than
average quality BIOS.
This is an additional defense against bad MCFG tables.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Intel EM64T CPUs handle uncanonical return addresses differently
from AMD CPUs.
The exception is reported in the SYSRET, not the next instruction.
This leads to the kernel exception handler running on the user stack
with the wrong GS because the kernel didn't expect exceptions
on this instruction.
This version of the patch has the teething problems that plagued an earlier
version fixed.
This is CVE-2006-0744
Thanks to Ernie Petrides and Asit B. Mallick for analysis and initial
patches.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Machine checks can stall the machine for a long time and
it's not good to trigger the nmi watchdog during that.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Needed for other checks later in ACPI.
Pointed out by Len Brown
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch introduces a user for the e820_all_mapped function:
There have been several machines that don't have a working MMCONFIG,
often because of a buggy MCFG table in the ACPI bios. This patch adds a
simple sanity check that detects a whole bunch of these cases, and when
it detects it, linux now boots rather than crash-and-burns.
The accuracy of this detection can in principle be improved if there was
a "is this entire range in e820 with THIS attribute", but no such
function exist and the complexity needed for this is not really worth
it; this simple check already catches most cases anyway.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce a e820_all_mapped() function which checks if the entire range
<start,end> is mapped with type.
This is done by moving the local start variable to the end of each
known-good region; if at the end of the function the start address is
still before end, there must be a part that's not of the correct type;
otherwise it's a good region.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Rename e820_mapped to e820_any_mapped since it tests if any part of the
range is mapped according to the type.
Later steps will introduce e820_all_mapped which will check if the
entire range is mapped with the type. Both have their merit.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The node setup code would try to allocate the node metadata in the node
itself, but that fails if there is no memory in there.
This can happen with memory hotplug when the hotplug area defines an so
far empty node.
Now use bootmem to try to allocate the mem_map in other nodes.
And if it fails don't panic, but just ignore the node.
To make this work I added a new __alloc_bootmem_nopanic function that
does what its name implies.
TBD should try to use nearby nodes here. Currently we just use any.
It's hard to do it better because bootmem doesn't have proper fallback
lists yet.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
From: Keith Mannthey, Andi Kleen
Implement memory hotadd without sparsemem. The memory in the SRAT
hotadd area is just preserved instead and can be activated later.
There are a few restrictions:
- Only one continuous hotadd area allowed per node
The main problem is dealing with the many buggy SRAT tables
that are out there. The strategy here is to reject anything
suspicious.
Originally from Keith Mannthey, with several hacks and changes by AK
and also contributions from Andrew Morton
[ TBD: Problems pointed out by KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>:
1) Goto's rebuild_zonelist patch will not work if CONFIG_MEMORY_HOTPLUG=n.
Rebuilding zonelist is necessary when the system has just memory <
4G at boot, and hot add memory > 4G. because x86_64 has DMA32,
ZONE_NORAML is not included into zonelist at boot time if system
doesn't have memory >4G at boot.
[AK: should just force the higher zones at boot time when SRAT tells us]
2) zone and node's spanned_pages and present_pages are not incremented.
They should be.
For example, our server (ia64/Fujitsu PrimeQuest) can equip memory
from 4G to 1T(maybe 2T in future), and SRAT will *always* say we have
possible 1T +memory. (Microsoft requires "write all possible memory
in SRAT") When we reserve memmap for possible 1T memory, Linux will
not work well in +minimum 4G configuraion ;)
[AK: needs limiting to 5-10% of max memory]
]
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Memory hotadd doesn't need SPARSEMEM, but can be handled by just preallocating
mem_maps. This only needs some untangling of ifdefs to enable the necessary
code even without SPARSEMEM.
Originally from Keith Mannthey, hacked by AK.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Just call IRET always, no need for any special cases.
Needed for the next bug fix.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Otherwise, illegal configurations like X86_VOYAGER=y, PCI=y are
possible.
This patch also fixes the options select'ing ACPI to also select PCI.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Len Brown <len.brown@intel.com>
The only user of get_wchan is the proc fs - and proc can't be built modular.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The boot cmdline is parsed in parse_early_param() and
parse_args(,unknown_bootoption).
And __setup() is used in obsolete_checksetup().
start_kernel()
-> parse_args()
-> unknown_bootoption()
-> obsolete_checksetup()
If __setup()'s callback (->setup_func()) returns 1 in
obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
handled.
If ->setup_func() returns 0, obsolete_checksetup() tries other
->setup_func(). If all ->setup_func() that matched a parameter returns 0,
a parameter is seted to argv_init[].
Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
If the app doesn't ignore those arguments, it will warning and exit.
This patch fixes a wrong usage of it, however fixes obvious one only.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>