Currently the IOMMU table's lock protects both the bitmap and access
to the hardware's TCE table. Access to the TCE table is synchronized
through the bitmap; therefore, only hold the lock while modifying the
bitmap. This gives a yummy 10-15% reduction in CPU utilization for
netperf on a large SMP machine.
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No actual code was harmed in the production of this patch.
Thanks to Andrew Morton <akpm@linux-foundation.org> for telling me
about checkpatch.pl.
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup unneeded macros used for register space address calculation.
Now we are using the EBDA to find the space address.
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This works around a bug where DMAs that have the same addresses as
some MEM regions do not go through. Not clear yet if this is due to a
mis-configuration or something deeper.
[akpm@linux-foundation.org: coding style fixlet]
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide seperate versions for Calgary and CalIOC2
Also print out the PCIe Root Complex Status on CalIOC2 errors
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CalIOC2 is a PCI-e implementation of the Calgary logic. Most of the
programming details are the same, but some differ, e.g., TCE cache
flush. This patch introduces CalIOC2 support - detection and various
support routines. It's not expected to work yet (but will with
follow-on patches).
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... in preparation for doing it differently for CalIOC2.
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Calgary and CalIOC2 share most of the same logic. Introduce struct
cal_chipset_ops for quirks and tce flush logic which are
[akpm@linux-foundation.org: make calgary_chip_ops static]
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... will be used by CalIOC2 later
Signed-off-by: Muli Ben-Yehuda <muli@il.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The Rise CPUs were only very short-lived, and there are no reports of
anyone both owning one and running Linux on it.
Googling for the printk string "CPU: Rise iDragon" didn't find any dmesg
available online.
If it turns out that against all expectations there are actually users
reverting this patch would be easy.
This patch will make the kernel images smaller by a few bytes for all
i386 users.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On x86_64 kernel, level triggered irq migration gets initiated in the
context of that interrupt(after executing the irq handler) and following
steps are followed to do the irq migration.
1. mask IOAPIC RTE entry; // write to IOAPIC RTE
2. EOI; // processor EOI write
3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
// and interrupt vector due to per cpu vector
// allocation.
4. unmask IOAPIC RTE entry; // write to IOAPIC RTE
Because of the per cpu vector allocation in x86_64 kernels, when the irq
migrates to a different cpu, new vector(corresponding to the new cpu) will
get allocated.
An EOI write to local APIC has a side effect of generating an EOI write for
level trigger interrupts (normally this is a broadcast to all IOAPICs).
The EOI broadcast generated as a side effect of EOI write to processor may
be delayed while the other IOAPIC writes (step 3 and 4) can go through.
Normally, the EOI generated by local APIC for level trigger interrupt
contains vector number. The IOAPIC will take this vector number and search
the IOAPIC RTE entries for an entry with matching vector number and clear
the remote IRR bit (indicate EOI). However, if the vector number is
changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
is received later. This will cause the remote IRR to get stuck causing the
interrupt hang (no more interrupt from this RTE).
Current x86_64 kernel assumes that remote IRR bit is cleared by the time
IOAPIC RTE is reprogrammed. Fix this assumption by checking for remote IRR
bit and if it still set, delay the irq migration to the next interrupt
arrival event(hopefully, next time remote IRR bit will get cleared before
the IOAPIC RTE is reprogrammed).
Initial analysis and patch from Nanhai.
Clean up patch from Suresh.
Rewritten to be less intrusive, and to contain a big fat comment by Eric.
[akpm@linux-foundation.org: fix comments]
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Nanhai Zou <nanhai.zou@intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Keith Packard <keith.packard@intel.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This helps to reduce the frequency at which the CPU must be taken out of a
lower-power state.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Acked-by: Tim Hockin <thockin@hockin.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch (as921) adds code to the show_regs() routine in i386 and x86_64
to print the contents of the debug registers along with all the others.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following section mismatch warnings were reported by Andrey Borzenkov:
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:amd_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x967a) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:cyrix_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x967f) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text:centaur_init_mtrr from .text between 'mtrr_bp_init' (at offset 0x9684) and 'mtrr_attrib_to_str'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa735) and 'generic_get_mtrr'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa749) and 'generic_get_mtrr'
WARNING: arch/i386/kernel/built-in.o - Section mismatch: reference to .init.text: from .text between 'get_mtrr_state' (at offset 0xa770) and 'generic_get_mtrr'
It was tracked down to a few functions missing __init tag.
Compile tested only.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 59f4e7d572 fixed machine rebooting
on Truxton's machine (when no keyboard was present). But it broke it on
Lee's machine.
The patch reinstates the old (pre-59f4e7d572980a521b7bdba74ab71b21f5995538)
code and if that doesn't work out, try the new,
post-59f4e7d572980a521b7bdba74ab71b21f5995538 code instead.
Cc: Lee Garrett <lee-in-berlin@web.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Background:
The MCE handler has several paths that it can take, depending on various
conditions of the MCE status and the value of the 'tolerant' knob. The
exact semantics are not well defined and the code is a bit twisty.
Description:
This patch makes the MCE handler's behavior more clear by documenting the
behavior for various 'tolerant' levels. It also fixes or enhances
several small things in the handler. Specifically:
* If RIPV is set it is not safe to restart, so set the 'no way out'
flag rather than the 'kill it' flag.
* Don't panic() on correctable MCEs.
* If the _OVER bit is set *and* the _UC bit is set (meaning possibly
dropped uncorrected errors), set the 'no way out' flag.
* Use EIPV for testing whether an app can be killed (SIGBUS) rather
than RIPV. According to docs, EIPV indicates that the error is
related to the IP, while RIPV simply means the IP is valid to
restart from.
* Don't clear the MCi_STATUS registers until after the panic() path.
This leaves the status bits set after the panic() so clever BIOSes
can find them (and dumb BIOSes can do nothing).
This patch also calls nonseekable_open() in mce_open (as suggested by akpm).
Result:
Tolerant levels behave almost identically to how they always have, but
not it's well defined. There's a slightly higher chance of panic()ing
when multiple errors happen (a good thing, IMHO). If you take an MBE and
panic(), the error status bits are not cleared.
Alternatives:
None.
Testing:
I used software to inject correctable and uncorrectable errors. With
tolerant = 3, the system usually survives. With tolerant = 2, the system
usually panic()s (PCC) but not always. With tolerant = 1, the system
always panic()s. When the system panic()s, the BIOS is able to detect
that the cause of death was an MC4. I was not able to reproduce the
case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
That will be rare at best.
Signed-off-by: Tim Hockin <thockin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Background:
/dev/mcelog is typically polled manually. This is less than optimal for
situations where accurate accounting of MCEs is important. Calling
poll() on /dev/mcelog does not work.
Description:
This patch adds support for poll() to /dev/mcelog. This results in
immediate wakeup of user apps whenever the poller finds MCEs. Because
the exception handler can not take any locks, it can not call the wakeup
itself. Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
caught at the next return from interrupt or exit from idle, calling the
mce_user_notify() routine. This patch also disables the "fake panic"
path of the mce_panic(), because it results in printk()s in the exception
handler and crashy systems.
This patch also does some small cleanup for essentially unused variables,
and moves the user notification into the body of the poller, so it is
only called once per poll, rather than once per CPU.
Result:
Applications can now poll() on /dev/mcelog. When an error is logged
(whether through the poller or through an exception) the applications are
woken up promptly. This should not affect any previous behaviors. If no
MCEs are being logged, there is no overhead.
Alternatives:
I considered simply supporting poll() through the poller and not using
TIF_MCE_NOTIFY at all. However, the time between an uncorrectable error
happening and the user application being notified is *the*most* critical
window for us. Many uncorrectable errors can be logged to the network if
given a chance.
I also considered doing the MCE poll directly from the idle notifier, but
decided that was overkill.
Testing:
I used an error-injecting DIMM to create lots of correctable DRAM errors
and verified that my user app is woken up in sync with the polling interval.
I also used the northbridge to inject uncorrectable ECC errors, and
verified (printk() to the rescue) that the notify routine is called and the
user app does wake up. I built with PREEMPT on and off, and verified
that my machine survives MCEs.
[wli@holomorphy.com: build fix]
Signed-off-by: Tim Hockin <thockin@google.com>
Signed-off-by: William Irwin <bill.irwin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Background:
/dev/mcelog is a clear-on-read interface. It is currently possible for
multiple users to open and read() the device. Users are protected from
each other during any one read, but not across reads.
Description:
This patch adds support for O_EXCL to /dev/mcelog. If a user opens the
device with O_EXCL, no other user may open the device (EBUSY). Likewise,
any user that tries to open the device with O_EXCL while another user has
the device will fail (EBUSY).
Result:
Applications can get exclusive access to /dev/mcelog. Applications that
do not care will be unchanged.
Alternatives:
A simpler choice would be to only allow one open() at all, regardless of
O_EXCL.
Testing:
I wrote an application that opens /dev/mcelog with O_EXCL and observed
that any other app that tried to open /dev/mcelog would fail until the
exclusive app had closed the device.
Caveats:
None.
Signed-off-by: Tim Hockin <thockin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Insert the unclaimed MMCONFIG resources into the resource tree without the
IORESOURCE_BUSY flag during late initialization. This allows the MMCONFIG
regions to be visible in the iomem resource tree without interfering with
other system resources that were discovered during PCI initialization.
[akpm@linux-foundation.org: nanofixes]
Signed-off-by: Aaron Durbin <adurbin@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When we are in the emulated NUMA case, we need to make sure that all existing
apicid_to_node mappings that point to real node ID's now point to the
equivalent fake node ID's.
If we simply iterate over all apicid_to_node[] members for each node, we risk
remapping an entry if it shares a node ID with a real node. Since apicid's
may not be consecutive, we're forced to create an automatic array of
apicid_to_node mappings and then copy it over once we have finished remapping
fake to real nodes.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For NUMA emulation, our SLIT should represent the true NUMA topology of the
system but our proximity domain to node ID mapping needs to reflect the
emulated state.
When NUMA emulation has successfully setup fake nodes on the system, a new
function, acpi_fake_nodes() is called. This function determines the proximity
domain (_PXM) for each true node found on the system. It then finds which
emulated nodes have been allocated on this true node as determined by its
starting address. The node ID to PXM mapping is changed so that each fake
node ID points to the PXM of the true node that it is located on.
If the machine failed to register a SLIT, then we assume there is no special
requirement for emulated node affinity so we use the default LOCAL_DISTANCE,
which is newly exported to this code, as our measurement if the emulated nodes
appear in the same PXM. Otherwise, we use REMOTE_DISTANCE.
PXM_INVAL and NID_INVAL are also exported to the ACPI header file so that we
can compare node_to_pxm() results in generic code (in this case, the SRAT
code).
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The logic in e820_find_active_regions() for determining the true active
regions for an e820 entry given a range of PFN's is needed for
e820_hole_size() as well.
e820_hole_size() is called from the NUMA emulation code to determine the
reserved area within an address range on a per-node basis. Its logic should
duplicate that of finding active regions in an e820 entry because these are
the only true ranges we may register anyway.
[akpm@linux-foundation.org: cleanup]
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds caching of pgds and puds, pmds, pte. That way we can avoid costly
zeroing and initialization of special mappings in the pgd.
A second quicklist is useful to separate out PGD handling. We can carry the
initialized pgds over to the next process needing them.
Also clean up the pgd_list handling to use regular list macros. There is no
need anymore to avoid the lru field.
Move the add/removal of the pgds to the pgdlist into the constructor /
destructor. That way the implementation is congruent with i386.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Acked-by: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Every file should include the headers containing the prototypes for its
global functions.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Constrain __supported_pte_mask and NX handling to just the PAE kernel.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hence remove its handling in the opposite case.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
.. and adjust documentation to properly reflect options that are
x86-64 specific.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Consolidate the three 32-bit system call entry points so that they all
treat registers in similar ways.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Too many remote cpu references due to /proc/stat.
On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
On every call to kstat_irqs, the process brings in per-cpu data from all
online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
/proc/stat is parsed by common commands like top, who etc, causing lots
of cacheline transfers
This statistic seems useless. Other 'big iron' arches disable this.
AK: changed to remove for all SMP setups
AK: add comment
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove unused code and variables and do some codingstyle / whitespace
cleanups while at it.
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xtime can be initialized including the cmos update from the generic
timekeeping code. Remove the arch specific implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>