Add (set|get)_dr callbacks to x86_emulate_ops instead of calling
them directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
ljmp/lcall instruction operand contains address and segment.
It can be 10 bytes long. Currently we decode it as two different
operands. Fix it by introducing new kind of operand that can hold
entire far address.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Dst operand is already initialized during decoding stage. No need to
reinitialize.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This instruction does not need generic decoding for its dst operand.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce read cache which is needed for instruction that require more
then one exit to userspace. After returning from userspace the instruction
will be re-executed with cached read value.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
cr0.ts may change between entries, so we copy cr0 to HOST_CR0 before each
entry. That is slow, so instead, set HOST_CR0 to have TS set unconditionally
(which is a safe value), and issue a clts() just before exiting vcpu context
if the task indeed owns the fpu.
Saves ~50 cycles/exit.
Signed-off-by: Avi Kivity <avi@redhat.com>
Although we always allocate a new dirty bitmap in x86's get_dirty_log(),
it is only used as a zero-source of copy_to_user() and freed right after
that when memslot is clean. This patch uses clear_user() instead of doing
this unnecessary zero-source allocation.
Performance improvement: as we can expect easily, the time needed to
allocate a bitmap is completely reduced. In my test, the improved ioctl
was about 4 to 10 times faster than the original one for clean slots.
Furthermore, reducing memory allocations and copies will produce good
effects to caches too.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The arguments passed to OFW shouldn't be modified; update the 'args'
argument of olpc_ofw to reflect this. This saves us some later
casting away of consts.
Signed-off-by: Andres Salomon <dilinger@queued.net>
LKML-Reference: <20100628220029.1555ac24@debian>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Unconditionally printing EC debug messages was helpful when we were actually
debugging the EC, but during normal operation it can get pretty annoying.
Using pr_debug allows us finer-grained control.
Signed-off-by: Andres Salomon <dilinger@queued.net>
LKML-Reference: <20100616231928.16b539f0@dev.queued.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add package level thermal and power limit feature support.
The two MSRs and features are new starting with Intel's Sandy Bridge processor.
Please check Intel 64 and IA-32 Architectures SDMV Vol 3A 14.5.6 Power Limit
Notification and 14.6 Package Level Thermal Management.
This patch also fixes a bug which defines reverse THERM_INT_LOW_ENABLE bit and
THERM_INT_HIGH_ENABLE bit.
[ hpa: fixed up against current tip:x86/cpu ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
LKML-Reference: <1280448826-12004-2-git-send-email-fenghua.yu@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Use the stop machine context rather than IPI's to rendezvous all the cpus for
MTRR initialization that happens during cpu bringup or for MTRR modifications
during runtime.
This avoids deadlock scenario (reported by Prarit) like:
cpu A holds a read_lock (tasklist_lock for example) with irqs enabled
cpu B waits for the same lock with irqs disabled using write_lock_irq
cpu C doing set_mtrr() (during AP bringup for example), which will try to
rendezvous all the cpus using IPI's
This will result in C and A come to the rendezvous point and waiting
for B. B is stuck forever waiting for the lock and thus not
reaching the rendezvous point.
Using stop cpu (run in the process context of per cpu based keventd) to do
this rendezvous, avoids this deadlock scenario.
Also make sure all the cpu's are in the rendezvous handler before we proceed
with the local_irq_save() on each cpu. This lock step disabling irqs on all
the cpus will avoid other deadlock scenarios (for example involving
with the blocking smp_call_function's etc).
[ This problem is very old. Marking -stable only for 2.6.35 as the
stop_one_cpu_nowait() API is present only in 2.6.35. Any older
kernel interested in this fix need to do some more work in backporting
this patch. ]
Reported-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <1280515602.2682.10.camel@sbsiddha-MOBL3.sc.intel.com>
Acked-by: Prarit Bhargava <prarit@redhat.com>
Cc: stable@kernel.org [2.6.35]
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
commit 2ca1af9aa3285c6a5f103ed31ad09f7399fc65d7 "PCI: MSI: Remove
unsafe and unnecessary hardware access" changed read_msi_msg_desc() to
return the last MSI message written instead of reading it from the
device, since it may be called while the device is in a reduced
power state.
However, the pSeries platform code really does need to read messages
from the device, since they are initially written by firmware.
Therefore:
- Restore the previous behaviour of read_msi_msg_desc()
- Add new functions get_cached_msi_msg{,_desc}() which return the
last MSI message written
- Use the new functions where appropriate
Acked-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This DMI quirk turns on "pci=use_crs" for the ALiveSATA2-GLAN because
amd_bus.c doesn't handle this system correctly.
The system has a single HyperTransport I/O chain, but has two PCI host
bridges to buses 00 and 80. amd_bus.c learns the MMIO range associated
with buses 00-ff and that this range is routed to the HT chain hosted at
node 0, link 0:
bus: [00, ff] on node 0 link 0
bus: 00 index 1 [mem 0x80000000-0xfcffffffff]
This includes the address space for both bus 00 and bus 80, and amd_bus.c
assumes it's all routed to bus 00.
We find device 80:01.0, which BIOS left in the middle of that space, but
we don't find a bridge from bus 00 to bus 80, so we conclude that 80:01.0
is unreachable from bus 00, and we move it from the original, working,
address to something outside the bus 00 aperture, which does not work:
pci 0000:80:01.0: reg 10: [mem 0xfebfc000-0xfebfffff 64bit]
pci 0000:80:01.0: BAR 0: assigned [mem 0xfd00000000-0xfd00003fff 64bit]
The BIOS told us everything we need to know to handle this correctly,
so we're better off if we just pay attention, which lets us leave the
80:01.0 device at the original, working, address:
ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-7f])
pci_root PNP0A03:00: host bridge window [mem 0x80000000-0xff37ffff]
ACPI: PCI Root Bridge [PCI1] (domain 0000 [bus 80-ff])
pci_root PNP0A08:00: host bridge window [mem 0xfebfc000-0xfebfffff]
This was a regression between 2.6.33 and 2.6.34. In 2.6.33, amd_bus.c
was used only when we found multiple HT chains. 3e3da00c01, which
enabled amd_bus.c even on systems with a single HT chain, caused this
failure.
This quirk was written by Graham. If we ever enable "pci=use_crs" for
machines from 2006 or earlir, this quirk should be removed.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=16007
Cc: stable@kernel.org
Reported-by: Graham Ramsey <ramsey.graham@ntlworld.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The Linux kernel assigns BARs that a BIOS did not assign, most likely
to handle broken BIOSes that didn't enumerate the devices correctly.
On UV the BIOS purposely doesn't assign I/O BARs for certain devices/
drivers we know don't use them (examples, LSI SAS, Qlogic FC, ...).
We purposely don't assign these I/O BARs because I/O Space is a very
limited resource. There is only 64k of I/O Space, and in a PCIe
topology that space gets divided up into 4k chucks (this is due to
the fact that a pci-to-pci bridge's I/O decoder is aligned at 4k)...
Thus a system can have at most 16 cards with I/O BARs: (64k / 4k = 16)
SGI needs to scale to >16 devices with I/O BARs. So by not assigning
I/O BARs on devices we know don't use them, we can do that (iff the
kernel doesn't go and assign these BARs that the BIOS purposely didn't
assign).
This patch will not assign a resource to a device BAR if that BAR was
not assigned by the BIOS, and the kernel cmdline option 'pci=nobar'
was specified. This patch is closely modeled after the 'pci=norom'
option that currently exists in the tree.
Signed-off-by: Mike Habeck <habeck@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
pcibios_scan_specific_bus calls pci_scan_bus_on_node which is
__devinit. Mark pcibios_scan_specific_bus __devinit as well since
all users are now __init or __devinit.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This patch introduce a CONFIG_XEN_PVHVM compile time option to
enable/disable Xen PV on HVM support.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
The following two commits fixed a problem that x86 ioremap() doesn't handle
physical address higher than 32-bit properly in X86_32 PAE mode.
ffa71f33a8 (x86, ioremap: Fix incorrect
physical address handling in PAE mode)
35be1b716a (x86, ioremap: Fix normal
ram range check)
But these fixes are not enough, since pat_pagerange_is_ram() in PAT code
also has a same problem. This patch fixes it.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
LKML-Reference: <4C47DDCF.80300@jp.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
HW breakpoints events stopped working correctly with kgdb
as a result of commit: 018cbffe68
(Merge commit 'v2.6.33' into perf/core).
The regression occurred because the behavior changed for setting
NOTIFY_STOP as the return value to the die notifier if the breakpoint
was known to the HW breakpoint API. Because kgdb is using the HW
breakpoint API to register HW breakpoints slots, it must also now
implement the overflow_handler call back else kgdb does not get to see
the events from the die notifier.
The kgdb_ll_trap function will be changed to be general purpose code
which can allow an easy way to implement the hw_breakpoint API
overflow call back.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Dongdong Deng <dongdong.deng@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
We have two functions for doing exactly the same thing -- emulating
cmpxchg8b on 486 and older hardware -- with different calling
conventions, and yet doing the same thing. Drop the C version and use
the assembly version, via alternatives, for both the local and
non-local versions of cmpxchg8b.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <AANLkTikAmaDPji-TVDarmG1yD=fwbffcsmEU=YEuP+8r@mail.gmail.com>
Move cmpxchg emulation code from arch/x86/kernel/cpu (which is
otherwise CPU identification) to arch/x86/lib, where other emulation
code lives already.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <AANLkTikAmaDPji-TVDarmG1yD=fwbffcsmEU=YEuP+8r@mail.gmail.com>
Exprot the AMD errata definitions, since they are needed by kvm_amd.ko
if that is built as a module. Doing "make allmodconfig" during
testing would have caught this.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Hans Rosenfeld <hans.rosenfeld@amd.com>
LKML-Reference: <1280336972-865982-1-git-send-email-hans.rosenfeld@amd.com>
Remove the __xg() hack to create a memory barrier near xchg and
cmpxchg; it has been there since 1.3.11 but should not be necessary
with "asm volatile" and a "memory" clobber, neither of which were
there in the original implementation.
However, we *should* make this a volatile reference.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <AANLkTikAmaDPji-TVDarmG1yD=fwbffcsmEU=YEuP+8r@mail.gmail.com>
Use the AMD errata checking framework instead of open-coding the test.
Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
LKML-Reference: <1280336972-865982-3-git-send-email-hans.rosenfeld@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Remove check_c1e_idle() and use the new AMD errata checking framework
instead.
Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
LKML-Reference: <1280336972-865982-2-git-send-email-hans.rosenfeld@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Errata are defined using the AMD_LEGACY_ERRATUM() or AMD_OSVW_ERRATUM()
macros. The latter is intended for newer errata that have an OSVW id
assigned, which it takes as first argument. Both take a variable number
of family-specific model-stepping ranges created by AMD_MODEL_RANGE().
Iff an erratum has an OSVW id, OSVW is available on the CPU, and the
OSVW id is known to the hardware, it is used to determine whether an
erratum is present. Otherwise, the model-stepping ranges are matched
against the current CPU to find out whether the erratum applies.
For certain special errata, the code using this framework might have to
conduct further checks to make sure an erratum is really (not) present.
Signed-off-by: Hans Rosenfeld <hans.rosenfeld@amd.com>
LKML-Reference: <1280336972-865982-1-git-send-email-hans.rosenfeld@amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch simply declares the new sys_fanotify_mark syscall
int fanotify_mark(int fanotify_fd, unsigned int flags, u64_mask,
int dfd const char *pathname)
Signed-off-by: Eric Paris <eparis@redhat.com>
This patch defines a new syscall fanotify_init() of the form:
int sys_fanotify_init(unsigned int flags, unsigned int event_f_flags,
unsigned int priority)
This syscall is used to create and fanotify group. This is very similar to
the inotify_init() syscall.
Signed-off-by: Eric Paris <eparis@redhat.com>
Don't quote $nm in the script for checking the vdso for external
references. Doing so breaks multiword constructs, like using
CROSS_COMPILE='ccache '.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <20100728134252.2e4c27cf.sfr@canb.auug.org.au>
Clean up and simplify set_64bit(). This code is quite old (1.3.11)
and contains a fair bit of auxilliary machinery that current versions
of gcc handle just fine automatically. Worse, the auxilliary
machinery can actually cause an unnecessary spill to memory.
Furthermore, the loading of the old value inside the loop in the
32-bit case is unnecessary: if the value doesn't match, the CMPXCHG8B
instruction will already have loaded the "new previous" value for us.
Clean up the comment, too, and remove page references to obsolete
versions of the Intel SDM.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <tip-*@vger.kernel.org>
gcc 4.4.4 will complain if you use a .discard section for both text and
data ("causes a section type conflict"). Add support for ".discard.*"
sections, and use .discard.text for a dummy function in the x86
RESERVE_BRK() macro.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
xchg() and cmpxchg() modify their memory operands, not merely read
them. For some versions of gcc the "memory" clobber has apparently
dealt with the situation, but not for all.
Originally-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Glauber Costa <glommer@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Palfrader <peter@palfrader.org>
Cc: Greg KH <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Zachary Amsden <zamsden@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: <stable@kernel.org>
LKML-Reference: <4C4F7277.8050306@zytor.com>
functions.
We add the glue code that sets up a dma_ops structure with the
xen_swiotlb_* functions. The code turns on xen_swiotlb flag
when it detects it is running under Xen and it is either
in privileged mode or the iommu=soft flag was passed in.
It also disables the bare-metal SWIOTLB if the Xen-SWIOTLB has
been enabled.
Note: The Xen-SWIOTLB is only built when CONFIG_XEN is enabled.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Albert Herranz <albert_herranz@yahoo.es>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Rather than trying to deal with aliases once they appear, just completely
inhibit them. Mostly the removal of aliases was managable, but it comes
unstuck in xen_create_contiguous_region() because it gets executed at
interrupt time (as a result of dma_alloc_coherent()), which causes all
sorts of confusion in the vmap code, as it was never intended to be run
in interrupt context.
This has the unfortunate side effect of removing all the unmap batching
the vmap code so carefully added, but that can't be helped.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch exports the capability of the AMD IOMMU to force
cache coherency of DMA transactions through the IOMMU-API.
This is required to disable some nasty hacks in KVM when
this capability is not available.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This converts the most common of the x86 clocksources over to use
clocksource_register_hz/khz.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-11-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
update_vsyscall() did not provide the wall_to_monotoinc offset,
so arch specific implementations tend to reference wall_to_monotonic
directly. This limits future cleanups in the timekeeping core, so
this patch fixes the update_vsyscall interface to provide
wall_to_monotonic, allowing wall_to_monotonic to be made static
as planned in Documentation/feature-removal-schedule.txt
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Tony Luck <tony.luck@intel.com>
LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now that all arches have been converted over to use generic time via
clocksources or arch_gettimeoffset(), we can remove the GENERIC_TIME
config option and simplify the generic code.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
LKML-Reference: <1279068988-21864-4-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Due to vtime calling vgettimeofday(), its possible that an application
could call time();create("stuff",O_RDRW); only to see the file's
creation timestamp to be before the value returned by time.
A similar way to reproduce the issue is to compare the vsyscall time()
with the syscall time(), and observe ordering issues.
The modified test case from Oleg Nesterov below can illustrate this:
int main(void)
{
time_t sec1,sec2;
do {
sec1 = time(&sec2);
sec2 = syscall(__NR_time, NULL);
} while (sec1 <= sec2);
printf("vtime: %d.000000\n", sec1);
printf("time: %d.000000\n", sec2);
return 0;
}
The proper fix is to make vtime use the same time value as
current_kernel_time() (which is exported via update_vsyscall) instead of
vgettime().
Thanks to Jiri Olsa for bringing up the issue and catching bugs in
earlier verisons of this fix.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
LKML-Reference: <1279068988-21864-2-git-send-email-johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When a pagetable is about to be destroyed, we notify Xen so that the
hypervisor can clear the related shadow pagetable.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Add a xen_emul_unplug command line option to the kernel to unplug
xen emulated disks and nics.
Set the default value of xen_emul_unplug depending on whether or
not the Xen PV frontends and the Xen platform PCI driver have
been compiled for this kernel (modules or built-in are both OK).
The user can specify xen_emul_unplug=ignore to enable PV drivers on HVM
even if the host platform doesn't support unplug.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Use xen_vcpuop_clockevent instead of hpet and APIC timers as main
clockevent device on all vcpus, use the xen wallclock time as wallclock
instead of rtc and use xen_clocksource as clocksource.
The pv clock algorithm needs to work correctly for the xen_clocksource
and xen wallclock to be usable, only modern Xen versions offer a
reliable pv clock in HVM guests (XENFEAT_hvm_safe_pvclock).
Using the hpet as clocksource means a VMEXIT every time we read/write to
the hpet mmio addresses, pvclock give us a better rating without
VMEXITs. Same goes for the xen wallclock and xen_vcpuop_clockevent
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Don Dutile <ddutile@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Do not try to disable hpet if it hasn't been initialized before
x86, i8259: Only register sysdev if we have a real 8259 PIC
The Pstate transition latency check was added for broken F10h BIOSen
which wrongly contain a value of 0 for transition and bus master
latency. Fam11h and later, however, (will) have similar transition
latency so extend that behavior for them too.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The PCC cpufreq driver unmaps the mailbox address range if any CPUs fail to
initialise, but doesn't do anything to remove the registered CPUs from the
cpufreq core resulting in failures further down the line. We're better off
simply returning a failure - the cpufreq core will unregister us cleanly if
we end up with no successfully registered CPUs. Tidy up the failure path
and also add a sanity check to ensure that the firmware gives us a realistic
frequency - the core deals badly with that being set to 0.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The pcc specification documents an _OSC method that's incompatible with the
one defined as part of the ACPI spec. This shouldn't be a problem as both
are supposed to be guarded with a UUID. Unfortunately approximately nobody
(including HP, who wrote this spec) properly check the UUID on entry to the
_OSC call. Right now this could result in surprising behaviour if the pcc
driver performs an _OSC call on a machine that doesn't implement the pcc
specification. Check whether the PCCH method exists first in order to reduce
this probability.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Use kmalloc() instead of vmalloc() for KVM_[GS]ET_MSR
KVM: MMU: fix conflict access permissions in direct sp
Commit 2a6b69765a
(ACPI: Store NVS state even when entering suspend to RAM) caused the
ACPI suspend code save the NVS area during suspend and restore it
during resume unconditionally, although it is known that some systems
need to use acpi_sleep=s4_nonvs for hibernation to work. To allow
the affected systems to avoid saving and restoring the NVS area
during suspend to RAM and resume, introduce kernel command line
option acpi_sleep=nonvs and make acpi_sleep=s4_nonvs work as its
alias temporarily (add acpi_sleep=s4_nonvs to the feature removal
file).
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16396 .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: tomas m <tmezzadra@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
hpet_disable is called unconditionally on machine reboot if hpet support
is compiled in the kernel.
hpet_disable only checks if the machine is hpet capable but doesn't make
sure that hpet has been initialized.
[ tglx: Made it a one liner and removed the redundant hpet_address check ]
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Venkatesh Pallipadi <venki@google.com>
LKML-Reference: <alpine.DEB.2.00.1007211726240.22235@kaball-desktop>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In no-direct mapping, we mark sp is 'direct' when we mapping the
guest's larger page, but its access is encoded form upper page-struct
entire not include the last mapping, it will cause access conflict.
For example, have this mapping:
[W]
/ PDE1 -> |---|
P[W] | | LPA
\ PDE2 -> |---|
[R]
P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
same lage page(LPA). The P's access is WR, PDE1's access is WR,
PDE2's access is RO(just consider read-write permissions here)
When guest access PDE1, we will create a direct sp for LPA, the sp's
access is from P, is W, then we will mark the ptes is W in this sp.
Then, guest access PDE2, we will find LPA's shadow page, is the same as
PDE's, and mark the ptes is RO.
So, if guest access PDE1, the incorrect #PF is occured.
Fixed by encode the last mapping access into direct shadow page
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Suspend/resume requires few different things on HVM: the suspend
hypercall is different; we don't need to save/restore memory related
settings; except the shared info page and the callback mechanism.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Set the callback to receive evtchns from Xen, using the
callback vector delivery mechanism.
The traditional way for receiving event channel notifications from Xen
is via the interrupts from the platform PCI device.
The callback vector is a newer alternative that allow us to receive
notifications on any vcpu and doesn't need any PCI support: we allocate
a vector exclusively to receive events, in the vector handler we don't
need to interact with the vlapic, therefore we avoid a VMEXIT.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Initialize basic pv on hvm features adding a new Xen HVM specific
hypervisor_x86 structure.
Don't try to initialize xen-kbdfront and xen-fbfront when running on HVM
because the backends are not available.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
It turns out that there is a bit in the _CST for Intel FFH C3
that tells the OS if we should be checking BM_STS or not.
Linux has been unconditionally checking BM_STS.
If the chip-set is configured to enable BM_STS,
it can retard or completely prevent entry into
deep C-states -- as illustrated by turbostat:
http://userweb.kernel.org/~lenb/acpi/utils/pmtools/turbostat/
ref: Intel Processor Vendor-Specific ACPI Interface Specification
table 4 "_CST FFH GAS Field Encoding"
Bit 1: Set to 1 if OSPM should use Bus Master avoidance for this C-state
https://bugzilla.kernel.org/show_bug.cgi?id=15886
Signed-off-by: Len Brown <len.brown@intel.com>
and fix the broken case if a core's frequency depends on others.
trace_power_frequency was only implemented in a rather ungeneric
way in acpi-cpufreq driver's target() function only.
-> Move the call to trace_power_frequency to
cpufreq.c:cpufreq_notify_transition() where CPUFREQ_POSTCHANGE
notifier is triggered.
This will support power frequency tracing by all cpufreq
drivers.
trace_power_frequency did not trace frequency changes correctly
when the userspace governor was used or when CPU cores'
frequency depend on each other.
-> Moving this into the CPUFREQ_POSTCHANGE notifier and pass the cpu
which gets switched automatically fixes this.
Robert Schoene provided some important fixes on top of my
initial quick shot version which are integrated in this patch:
- Forgot some changes in power_end trace (TP_printk/variable names)
- Variable dummy in power_end must now be cpu_id
- Use static 64 bit variable instead of unsigned int for cpu_id
[akpm@linux-foundation.org: build fix]
Signed-off-by: Thomas Renninger <trenn@suse.de>
Cc: davej@codemonkey.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Arjan van de Ven <arjan@infradead.org>
Cc: Robert Schoene <robert.schoene@tu-dresden.de>
Tested-by: Robert Schoene <robert.schoene@tu-dresden.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Load initial_gs as two 32-bit values instead of splitting a 64-bit value.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1279371808-24804-3-git-send-email-brgerst@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Use symbolic MSR names instead of hardcoding the MSR index.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1279371808-24804-2-git-send-email-brgerst@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
MSR_K6_EFER is unused, and MSR_K6_STAR is redundant with MSR_STAR.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
LKML-Reference: <1279371808-24804-1-git-send-email-brgerst@gmail.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
In the CONFIG_AUDITSYSCALL fast-path for x86 64-bit system calls,
we can pass a bad return value and/or error indication for the
system call to audit_syscall_exit(). This happens when
TIF_NEED_RESCHED was set as the system call returned, so we went
out to schedule() and came back to the exit-audit fast-path. The
fix is to reload the user return value register from the pt_regs
before using it for audit_syscall_exit().
Both the 32-bit kernel's fast path and the 64-bit kernel's 32-bit
system call fast paths work slightly differently, so that they
always leave the fast path entirely to reschedule and don't return
there, so they don't have the analogous bugs.
Reported-by: Alexander Viro <aviro@redhat.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
xstate_enable_boot_cpu() is, as the name implies, only used on the
boot CPU; furthermore, it invokes alloc_bootmem(), which is __init;
hence it needs to be tagged __init rather than __cpuinit.
Furthermore, it is *not* safe in the long run to rely on CPU 0 only
coming online during the early boot -- at some point we're going to
support offlining (and re-onlining) the boot CPU, and at that point we
must not call xstate_enable_boot_cpu() again.
The code is a fair bit more obscure than one would like, because the
__ref overrides aren't quite powerful enough.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Robert Richter <robert.richter@amd.com>
LKML-Reference: <4C476236.1020302@zytor.com>
This is called only from initialization code.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279731838-1522-6-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The pointer is only used in xsave.c. Making it static.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279731838-1522-5-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The patch introduces the XSTATE_CPUID macro and adds a check that
tests if XSTATE_CPUID exists.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279731838-1522-4-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The patch renames xsave_cntxt_init() and __xsave_init() into
xstate_enable_boot_cpu() and xstate_enable() as this names are more
meaningful.
It also removes the duplicate xcr setup for the boot cpu.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279731838-1522-3-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As xsave also supports other than fpu features, it should be
initialized independently of the fpu. This patch moves this out of fpu
initialization.
There is also a lot of cross referencing between fpu and xsave
code. This patch reduces this by making xsave_cntxt_init() and
init_thread_xstate() static functions.
The patch moves the cpu_has_xsave check at the beginning of
xsave_init(). All other checks may removed then.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279731838-1522-2-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
smp_processor_id() returns an int and not an unsigned long.
Also, since the function is small enough, there's no need for a
local variable caching its value.
No functionality change, just cleanup.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100721124705.GA674@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pointed out by Lucas who found the new one in a comment in
setup_percpu.c. And then I fixed the others that I grepped
for.
Reported-by: Lucas <canolucas@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clarified few comments and made initialization of %edx/%rdx more uniform
accross __down_write_nested, __up_read and __up_write functions.
Signed-off-by: Michel Lespinasse <walken@google.com>
LKML-Reference: <201007202219.o6KMJkiA021048@imap1.linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Mike Waychison <mikew@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When count > 0 there is no need to take the call_rwsem_wake path. If
we did take that path, it would just return without doing anything due
to the active count not being zero.
Signed-off-by: Michel Lespinasse <walken@google.com>
LKML-Reference: <201007202219.o6KMJj9x021042@imap1.linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Mike Waychison <mikew@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
x86 early_iounmap(): fix off-by-one error in page alignment of allocation
size for sizes where size%PAGE_SIZE==1.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
LKML-Reference: <201007202219.o6KMJlES021058@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Without this, adding entries into the address_markers array means adding
more and more of an #ifdef maze in pt_dump_init(). By using indices, we
can keep it a bit saner.
Signed-off-by: Andres Salomon <dilinger@queued.net>
LKML-Reference: <201007202219.o6KMJkUs021052@imap1.linux-foundation.org>
Cc: Jordan Crouse <jordan.crouse@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Commit e534c7c5f8 ("numa: x86_64: use generic percpu var
numa_node_id() implementation") broke numa systems that don't have ram
on node0 when MEMORY_HOTPLUG is enabled, because cpu_up() will call
cpu_to_node() before per_cpu(numa_node) is setup for APs.
When Node0 doesn't have RAM, on x86, cpus already round it to nearest
node with RAM in x86_cpu_to_node_map. and per_cpu(numa_node) is not set
up until in c_init for APs.
When later cpu_up() calling cpu_to_node() will get 0 again, and make it
online even there is no RAM on node0. so later all APs can not booted up,
and later will have panic.
[ 1.611101] On node 0 totalpages: 0
.........
[ 2.608558] On node 0 totalpages: 0
[ 2.612065] Brought up 1 CPUs
[ 2.615199] Total of 1 processors activated (3990.31 BogoMIPS).
...
93.225341] calling loop_init+0x0/0x1a4 @ 1
[ 93.229314] PERCPU: allocation failed, size=80 align=8, failed to populate
[ 93.246539] Pid: 1, comm: swapper Tainted: G W 2.6.35-rc4-tip-yh-04371-gd64e6c4-dirty #354
[ 93.264621] Call Trace:
[ 93.266533] [<ffffffff81125e43>] pcpu_alloc+0x83a/0x8e7
[ 93.270710] [<ffffffff81125f15>] __alloc_percpu+0x10/0x12
[ 93.285849] [<ffffffff8140786c>] alloc_disk_node+0x94/0x16d
[ 93.291811] [<ffffffff81407956>] alloc_disk+0x11/0x13
[ 93.306157] [<ffffffff81503e51>] loop_alloc+0xa7/0x180
[ 93.310538] [<ffffffff8277ef48>] loop_init+0x9b/0x1a4
[ 93.324909] [<ffffffff8277eead>] ? loop_init+0x0/0x1a4
[ 93.329650] [<ffffffff810001f2>] do_one_initcall+0x57/0x136
[ 93.345197] [<ffffffff827486d0>] kernel_init+0x184/0x20e
[ 93.348146] [<ffffffff81034954>] kernel_thread_helper+0x4/0x10
[ 93.365194] [<ffffffff81c7cc3c>] ? restore_args+0x0/0x30
[ 93.369305] [<ffffffff8274854c>] ? kernel_init+0x0/0x20e
[ 93.386011] [<ffffffff81034950>] ? kernel_thread_helper+0x0/0x10
[ 93.392047] loop: out of memory
...
Try to assign per_cpu(numa_node) early
[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Yinghai <yinghai@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch moves boot cpu initialization to xsave_init(). Now all cpus
are initialized in one single function.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279651857-24639-5-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Boot cpu id is always 0, thus simplifying and unifying boot cpu check.
boot_cpu_id is there for historical reasons and was renamed to
boot_cpu_physical_apicid in patch:
c70dcb7 x86: change boot_cpu_id to boot_cpu_physical_apicid
However, there are some remaining occurrences of boot_cpu_id that are
never touched in the kernel and thus its value is always 0.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279651857-24639-3-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
There are no dependencies to asm/i387.h. Instead, if including only
xsave.h the following error occurs:
.../arch/x86/include/asm/i387.h:110: error: ‘XSTATE_FP’ undeclared (first use in this function)
.../arch/x86/include/asm/i387.h:110: error: (Each undeclared identifier is reported only once
.../arch/x86/include/asm/i387.h:110: error: for each function it appears in.)
This patch fixes this.
Signed-off-by: Robert Richter <robert.richter@amd.com>
LKML-Reference: <1279651857-24639-2-git-send-email-robert.richter@amd.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Just some dead code, no real bugs.
Found by gcc 4.6 -Wall
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <201007202219.o6KMJnQ0021072@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Avoids quite a lot of warnings with a gcc 4.6 -Wall build
because this happens in a commonly used header file (apic.h)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <201007202219.o6KMJme6021066@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
My platform makes use of the null_legacy_pic choice and oopses when doing
a shutdown as the shutdown code goes through all the registered sysdevs
and calls their shutdown method which in my case poke on a non-existing
i8259. Imho the i8259 specific sysdev should only be registered if the
i8259 is actually there.
Do not register the sysdev function when the null_legacy_pic is used so
that the i8259 resume, suspend and shutdown functions are not called.
Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de>
LKML-Reference: <201007202218.o6KMIJ3m020955@imap1.linux-foundation.org>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: <stable@kernel.org> 2.6.34
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Scan the set of pages we're freeing and make sure they're actually
owned by the domain before freeing. This generally won't happen on a
domU (since Xen gives us contigious memory), but it could happen if
there are some hardware mappings passed through.
We only bother going up to the highest page Xen actually claimed to
give us, since there's definitely nothing of ours above that.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Scan an e820 table and release any memory which lies between e820 entries,
as it won't be used and would just be wasted. At present this is just to
release any memory beyond the end of the e820 map, but it will also deal
with holes being punched in the map.
Derived from patch by Miroslav Rezanina <mrezanin@redhat.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
addon_cpuid_features.c contains exactly two almost completely
unrelated functions, plus has a long and very generic name. Split it
into two files, scattered.c for the scattered feature flags, and
topology.c for the topology information.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <tip-*@git.kernel.org>
Clean up the formatting in cpufeature.h, and remove an unnecessary
name override.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <tip-*@git.kernel.org>
xsaveopt is a more optimized form of xsave specifically designed
for the context switch usage. xsaveopt doesn't save the state that's not
modified from the prior xrstor. And if a specific feature state gets
modified to the init state, then xsaveopt just updates the header bit
in the xsave memory layout without updating the corresponding memory
layout.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100719230205.604014179@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
With xsaveopt, if a processor implementation discern that a processor state
component is in its initialized state it may modify the corresponding bit in
the xsave_hdr.xstate_bv as '0', with out modifying the corresponding memory
layout. Hence wHile presenting the xstate information to the user, we always
ensure that the memory layout of a feature will be in the init state if the
corresponding header bit is zero. This ensures the consistency and avoids the
condition of the user seeing some some stale state in the memory layout during
signal handling, debugging etc.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100719230205.351459480@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Subleaves of the cpuid vector 0xd provides the offset and size of different
feature state that are managed by the xsave/xrstor. Track this for the upcoming
usage during signal handling.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100719230205.262987929@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Enumerate the xsaveopt feature.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100719230205.604014179@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Add cpu feature bit support for the XSAVEOPT instruction.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100719230205.523204988@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some cpuid features (like xsaveopt) are enumerated using cpuid
subleaves.
Extend init_scattered_cpuid_features() to take subleaf into account.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20100719230205.439900717@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, pci, mrst: Add extra sanity check in walking the PCI extended cap chain
x86: Fix x2apic preenabled system with kexec
x86: Force HPET readback_cmp for all ATI chipsets
pavel@suse.cz no longer works, replace it with working address.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit f007ea26, the order of the %es and %ds segment registers
got accidentally swapped, so synthesized 'struct pt_regs' frames
have the two values inverted. It's almost sure that these values
never matter, and that they also never differ. But wrong is wrong.
Signed-off-by: Roland McGrath <roland@redhat.com>
Remove the initialization of MMRs
UVH_LB_BAU_SB_ACTIVATION_CONTROL and UVH_BAU_DATA_BROADCAST on
UV hubs that have no active cpus. Such initialization on hubs
with no active cpus would result in a kernel page fault.
This is not of real high priority, because we don't have any
such systems (with UV hubs that have no active cpus). But they
will be coming.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
LKML-Reference: <E1OZmZN-0006cW-RC@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The fixed bar capability structure is searched in PCI extended
configuration space. We need to make sure there is a valid capability
ID to begin with otherwise, the search code may stuck in a infinite
loop which results in boot hang. This patch adds additional check for
cap ID 0, which is also invalid, and indicates end of chain.
End of chain is supposed to have all fields zero, but that doesn't
seem to always be the case in the field.
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
LKML-Reference: <1279306706-27087-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Found one x2apic system kexec loop test failed
when CONFIG_NMI_WATCHDOG=y (old) or CONFIG_LOCKUP_DETECTOR=y (current tip)
first kernel can kexec second kernel, but second kernel can not kexec third one.
it can be duplicated on another system with BIOS preenabled x2apic.
First kernel can not kexec second kernel.
It turns out, when kernel boot with pre-enabled x2apic, it will not execute
disable_local_APIC on shutdown path.
when init_apic_mappings() is called in setup_arch, it will skip setting of
apic_phys when x2apic_mode is set. ( x2apic_mode is much early check_x2apic())
Then later, disable_local_APIC() will bail out early because !apic_phys.
So check !x2apic_mode in x2apic_mode in disable_local_APIC with !apic_phys.
another solution could be updating init_apic_mappings() to set apic_phys even
for preenabled x2apic system. Actually even for x2apic system, that lapic
address is mapped already in early stage.
BTW: is there any x2apic preenabled system with apicid of boot cpu > 255?
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4C3EB22B.3000701@kernel.org>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we fail to assign resources to a PCI BAR, this patch makes us try the
original address from BIOS rather than leaving it disabled.
Linux tries to make sure all PCI device BARs are inside the upstream
PCI host bridge or P2P bridge apertures, reassigning BARs if necessary.
Windows does similar reassignment.
Before this patch, if we could not move a BAR into an aperture, we left
the resource unassigned, i.e., at address zero. Windows leaves such BARs
at the original BIOS addresses, and this patch makes Linux do the same.
This is a bit ugly because we disable the resource long before we try to
reassign it, so we have to keep track of the BIOS BAR address somewhere.
For lack of a better place, I put it in the struct pci_dev.
I think it would be cleaner to attempt the assignment immediately when the
claim fails, so we could easily remember the original address. But we
currently claim motherboard resources in the middle, after attempting to
claim PCI resources and before assigning new PCI resources, and changing
that is a fairly big job.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16263
Reported-by: Andrew <nitr0@seti.kr.ua>
Tested-by: Andrew <nitr0@seti.kr.ua>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Add __NR_prlimit64 syscall numbers to asm-generic. Add them also to
asm-x86, both 32 and 64-bit.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Also needed if pr_<level> becomes a bit more space efficient.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
LKML-Reference: <1277768808.29157.280.camel@Joe-Laptop.home>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
commit 30a564be (x86, hpet: Restrict read back to affected ATI
chipset) restricted the workaround for the HPET bug to SMX00
chipsets. This was reasonable as those were the only ones against
which we ever got a bug report.
Stephan Wolf reported now that this patch breaks his IXP400 based
machine. Though it's confirmed to work on other IXP400 based systems.
To error out on the safe side, we force the HPET readback workaround
for all ATI SMbus class chipsets.
Reported-by: Stephan Wolf <stephan@letzte-bankreihe.de>
LKML-Reference: <alpine.LFD.2.00.1007142134140.3321@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Stephan Wolf <stephan@letzte-bankreihe.de>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
putchar is using early_serial_base to check if port is initialized.
So we only assign it after early_serial_init() is called,
in case we need use VGA to debug early serial console.
Also add display for port addr and baud.
-v2: update to current tip
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4C3E0171.6050008@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
input: i8042 - add runtime check in x86's i8042_platform_init
Revert "Input: fixup X86_MRST selects"
Revert "Input: do not force selecting i8042 on Moorestown"
x86, mrst: Add i8042_detect API for Moorestwon platform
x86: Add i8042 pre-detection hook to x86_platform_ops
x86, platform: Export x86_platform to modules
Make the alternatives-patching code BUG on encountering an invalid CPU
feature number. Should have done this a long time ago.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Yinghai Lu <yinhai@kernel.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <tip-df378ccfc4dd04e263426ad805516915874774aa@git.kernel.org>
Fix a missing case of an 8-bit alternative number, buried inside an
assembly macro.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Reported-by: Yinghai Lu <yinhai@kernel.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <4C3BDDA3.2060900@kernel.org>
Make the boot code also accept the console=uart8250,io,0x2f8,115200n
form of early console.
Also add back simple_guess_base(), otherwise those simple_strtoull(,,0)
are not going to work.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4C3CCE05.4090505@kernel.org>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds serial I/O support to the real-mode setup (very early
boot) printf(). It's useful for debugging boot code when running Linux
under KVM, for example. The actual code was lifted from early printk.
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1278835617-11368-1-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
After remove a rmap, we should flush all vcpu's tlb
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Check for normal RAM in x86 ioremap() code seems to not work for the
last page frame in the specified physical address range.
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
LKML-Reference: <4C1AE6CD.1080704@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Current x86 ioremap() doesn't handle physical address higher than
32-bit properly in X86_32 PAE mode. When physical address higher than
32-bit is passed to ioremap(), higher 32-bits in physical address is
cleared wrongly. Due to this bug, ioremap() can map wrong address to
linear address space.
In my case, 64-bit MMIO region was assigned to a PCI device (ioat
device) on my system. Because of the ioremap()'s bug, wrong physical
address (instead of MMIO region) was mapped to linear address space.
Because of this, loading ioatdma driver caused unexpected behavior
(kernel panic, kernel hangup, ...).
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
LKML-Reference: <4C1AE680.7090408@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is needed so that the staging hyperv can properly access this
symbol.
Signed-off-by: K. Y. Srinivasan <ksrinivasan@novell.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In today's linux-next I got this compile warning:
arch/x86/kernel/apic/es7000_32.c:132: warning: 'base' defined but not used
Current patch solves the issue removing the unused variable.
Signed-off-by: Javier Martinez Canillas <martinez.javier@gmail.com>
Cc: Rakib Mullick <rakib.mullick@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Tejun Heo <tj@kernel.org>
LKML-Reference: <1278546719.9020.4.camel@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel has defined CPUID leaf 7 as the next set of feature flags (see
the AVX specification, version 007). Add support for this new feature
flags word.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <tip-*@vger.kernel.org>
It will just return 0 as there is no i8042 controller
Signed-off-by: Feng Tang <feng.tang@intel.com>
LKML-Reference: <1278342202-10973-3-git-send-email-feng.tang@intel.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some x86 platforms like Intel MID platforms don't have i8042 controllers,
and i8042 driver's probe to some legacy IO ports may hang the MID
processor. With this hook, i8042 driver can runtime check and skip the
probe when the pretection fail which also saves some probe time
[ hpa note: this is currently a compile-time check, which breaks the
i386 allyesconfig build. This patch series thus does fix a regression. ]
Signed-off-by: Feng Tang <feng.tang@intel.com>
LKML-Reference: <1278342202-10973-2-git-send-email-feng.tang@intel.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Export x86_platform to modules in preparation of using it for i8042
discovery control.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <1278342202-10973-1-git-send-email-feng.tang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
We already have cpufeature indicies above 255, so use a 16-bit number
for the alternatives index. This consumes a padding field and so
doesn't add any size, but it means that abusing the padding field to
create assembly errors on overflow no longer works. We can retain the
test simply by redirecting it to the .discard section, however.
[ v3: updated to include open-coded locations ]
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <tip-f88731e3068f9d1392ba71cc9f50f035d26a0d4f@git.kernel.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add support for the newly documented F16C (16-bit floating point
conversions) and RDRND (RDRAND instruction) CPU feature flags.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rbtree: Undo augmented trees performance damage and regression
x86, Calgary: Limit the max PHB number to 256
fxsave/xsave doesn't touch all the bytes in the memory layout used by
these instructions. Specifically SW reserved (bytes 464..511) fields
in the fxsave frame and the reserved fields in the xsave header.
To present a clean context for the signal handling, just clear these fields
instead of clearing the complete fxsave/xsave memory layout, when we dump these
registers directly to the user signal frame.
Also avoid the call to second xrstor (which inits the state not passed
in the signal frame) in restore_user_xstate() if all the state has already
been restored by the first xrstor.
These changes improve the performance of signal handling(by ~3-5% as measured
by the lat_sig).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <1277249017.2847.85.camel@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
enter_lmode() and exit_lmode() modify the guest's EFER.LMA before calling
vmx_set_efer(). However, the latter function depends on the value of EFER.LMA
to determine whether MSR_KERNEL_GS_BASE needs reloading, via
vmx_load_host_state(). With EFER.LMA changing under its feet, it took the
wrong choice and corrupted userspace's %gs.
This causes 32-on-64 host userspace to fault.
Fix not touching EFER.LMA; instead ask vmx_set_efer() to change it.
Signed-off-by: Avi Kivity <avi@redhat.com>
Reimplement augmented RB-trees without sprinkling extra branches
all over the RB-tree code (which lives in the scheduler hot path).
This approach is 'borrowed' from Fabio's BFQ implementation and
relies on traversing the rebalance path after the RB-tree-op to
correct the heap property for insertion/removal and make up for
the damage done by the tree rotations.
For insertion the rebalance path is trivially that from the new
node upwards to the root, for removal it is that from the deepest
node in the path from the to be removed node that will still
be around after the removal.
[ This patch also fixes a video driver regression reported by
Ali Gholami Rudi - the memtype->subtree_max_end was updated
incorrectly. ]
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Ali Gholami Rudi <ali@rudi.ir>
Cc: Fabio Checconi <fabio@gandalf.sssup.it>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1275414172.27810.27961.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To support cache events we have reserved the low 6 bits in
hw_perf_event::config (which is a part of CCCR register
configuration actually).
These bits represent Replay Event mertic enumerated in
enum P4_PEBS_METRIC. The caller should not care about
which exact bits should be set and how -- the caller
just chooses one P4_PEBS_METRIC entity and puts it into
the config. The kernel will track it and set appropriate
additional MSR registers (metrics) when needed.
The reason for this redesign was the PEBS enable bit, which
should not be set until DS (and PEBS sampling) support will
be implemented properly.
TODO
====
- PEBS sampling (note it's tricky and works with _one_ counter only
so for HT machines it will be not that easy to handle both threads)
- tracking of PEBS registers state, a user might need to turn
PEBS off completely (ie no PEBS enable, no UOP_tag) but some
other event may need it, such events clashes and should not
run simultaneously, at moment we just don't support such events
- eventually export user space bits in separate header which will
allow user apps to configure raw events more conveniently.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Lin Ming <ming.m.lin@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1278295769.9540.15.camel@minggr.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Fix incorrect branches event on AMD CPUs
perf tools: Fix find tids routine by excluding "." and ".."
x86: Send a SIGTRAP for user icebp traps
While doing some performance counter validation tests on some
assembly language programs I noticed that the "branches:u"
count was very wrong on AMD machines.
It looks like the wrong event was selected.
Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <alpine.DEB.2.00.1007011526010.23160@cl320.eecs.utk.edu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes the CONFIG_MCORE2 check from around NET_IP_ALIGN. It is
based on a suggestion from Andi Kleen. The assumption is that there are
not any x86 cores where unaligned access is really slow, and this change
would allow for a performance improvement to still exist on configurations
that are not necessarily optimized for Core 2.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
fs/fs-writeback.c
Merge reason: Resolve the conflict
Note, i picked the version from Linus's tree, which effectively reverts
the fs-writeback.c bits of:
b97181f: fs: remove all rcu head initializations, except on_stack initializations
As the upstream changes to this file changed this code heavily and the
first attempt to resolve the conflict resulted in a non-booting kernel.
It's safer to re-try this portion of the commit cleanly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x3950 family can have as many as 256 PCI buses in a single system, so
change the limits to the maximum. Since there can only be 256 PCI buses in one
domain, we no longer need the BUG_ON check.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
LKML-Reference: <20100701004519.GQ15515@tux1.beaverton.ibm.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
x86 architectures can handle unaligned accesses in hardware, and it has
been shown that unaligned DMA accesses can be expensive on Nehalem
architectures. As such we should overwrite NET_IP_ALIGN to resolve
this issue.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before we had a generic breakpoint layer, x86 used to send a
sigtrap for any debug event that happened in userspace,
except if it was caused by lazy dr7 switches.
Currently we only send such signal for single step or breakpoint
events.
However, there are three other kind of debug exceptions:
- debug register access detected: trigger an exception if the
next instruction touches the debug registers. We don't use
it.
- task switch, but we don't use tss.
- icebp/int01 trap. This instruction (0xf1) is undocumented and
generates an int 1 exception. Unlike single step through TF
flag, it doesn't set the single step origin of the exception
in dr6.
icebp then used to be reported in userspace using trap signals
but this have been incidentally broken with the new breakpoint
code. Reenable this. Since this is the only debug event that
doesn't set anything in dr6, this is all we have to check.
This fixes a regression in Wine where World Of Warcraft got broken
as it uses this for software protection checks purposes. And
probably other apps do.
Reported-and-tested-by: Alexandre Julliard <julliard@winehq.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: 2.6.33.x 2.6.34.x <stable@kernel.org>
Fix resume_execution() and is_IF_modifier() to skip x86
instruction prefixes correctly by using x86 instruction
attribute.
Without this fix, resume_execution() can't handle instructions
which have non-REX prefixes (REX prefixes are skipped). This
will cause unexpected kernel panic by hitting bad address when a
kprobe hits on two-byte ret (e.g. "repz ret" generated for
Athlon/K8 optimization), because it just checks "repz" and can't
recognize the "ret" instruction.
These prefixes can be found easily with x86 instruction
attribute. This patch introduces skip_prefixes() and uses it in
resume_execution() and is_IF_modifier() to skip prefixes.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
LKML-Reference: <4C298A6E.8070609@hitachi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Define WQ_MAX_ACTIVE and create keventd with max_active set to half of
it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1
works concurrently. Unless some combination can result in dependency
loop longer than max_active, deadlock won't happen and thus it's
unnecessary to check whether current_is_keventd() before trying to
schedule a work. Kill current_is_keventd().
(Lockdep annotations are broken. We need lock_map_acquire_read_norecurse())
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, Calgary: Increase max PHB number
x86: Fix rebooting on Dell Precision WorkStation T7400
x86: Fix vsyscall on gcc 4.5 with -Os
x86, pat: Proper init of memtype subtree_max_end
um, hweight: Fix UML boot crash due to x86 optimized hweight
x86, setup: Set ax register in boot vga query
percpu, x86: Avoid warnings of unused variables in per cpu
x86, irq: Rename gsi_end gsi_top, and fix off by one errors
x86: use __ASSEMBLY__ rather than __ASSEMBLER__
Newer systems (x3950M2) can have 48 PHBs per chassis and 8
chassis, so bump the limits up and provide an explanation
of the requirements for each class.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Corinna Schultz <cschultz@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
LKML-Reference: <20100624212647.GI15515@tux1.beaverton.ibm.com>
[ v2: Fixed build bug, added back PHBS_PER_CALGARY == 4 ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Instruction breakpoints need to have a specific length of 0 to
be working. Bring this support but also take care the user is not
trying to set an unsupported length, like a range breakpoint for
example.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Instruction breakpoints trigger before the instruction executes,
and returning back from the breakpoint handler brings us again
to the instruction that breakpointed. This naturally bring to
a breakpoint recursion.
To solve this, x86 has the Resume Bit trick. When the cpu flags
have the RF flag set, the next instruction won't trigger any
instruction breakpoint, and once this instruction is executed,
RF is cleared back.
This let's us jump back to the instruction that triggered the
breakpoint without recursion.
Use this when an instruction breakpoint triggers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andres Salomon <dilinger@queued.net>
Cc: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <20100618174653.7755a39a@dev.queued.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add support for saving OFW's cif, and later calling into it to run OFW
commands. OFW remains resident in memory, living within virtual range
0xff800000 - 0xffc00000. A single page directory entry points to the
pgdir that OFW actually uses, so rather than saving the entire page
table, we grab and install that one entry permanently in the kernel's
page table.
This is currently only used by the OLPC XO. Note that this particular
calling convention breaks PAE and PAT, and so cannot be used on newer
x86 hardware.
Signed-off-by: Andres Salomon <dilinger@queued.net>
LKML-Reference: <20100618174653.7755a39a@dev.queued.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The vdso is a piece of userspace code which is supposed to be fully
self-contained. Any external (undefined) reference is an error, and
should be caught at compile time. This was giving us trouble when
compiling with -Os on gcc 4.5.0, for example (failed inline).
The need to do a buildtime check was pointed out by Andi Kleen.
Reported-by: Andi Kleen <andi@firstfloor.org>
LKML-Reference: <tip-*@vger.kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This fixes the -Os breaks with gcc 4.5 bug. rdtsc_barrier needs to be
force inlined, otherwise user space will jump into kernel space and
kill init.
This also addresses http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44129
I believe.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20100618210859.GA10913@basil.fritz.box>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org>
When initrd is in use and a driver does request_module() in its
module_init (i.e. __initcall or device_initcall), a modprobe process
is created with VDSO mapping. But VDSO is inited even in __initcall,
i.e. on the same level (at the same time), so it may not be inited
yet (link order matters).
Move the VDSO initialization code earlier by switching to something
before rootfs_initcall where initrd is loaded as rootfs. Specifically
to subsys_initcall. Do it for standard 64-bit path (init_vdso_vars)
and for compat (sysenter_setup), just in case people have 32-bit
initrd and ia32 emulation built-in.
i386 (pure 32-bit) is not affected, since sysenter_setup() is called
from check_bugs()->identify_boot_cpu() in start_kernel() before
rest_init()->kernel_thread(kernel_init) where even kernel_init() calls
do_basic_setup()->do_initcalls().
What this patch fixes are early modprobe crashes such as:
Unpacking initramfs...
Freeing initrd memory: 9324k freed
modprobe[368]: segfault at 7fff4429c020 ip 00007fef397e160c \
sp 00007fff442795c0 error 4 in ld-2.11.2.so[7fef397df000+1f000]
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
LKML-Reference: <1276720242-13365-1-git-send-email-jslaby@suse.cz>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Systems using the idle thread from process_32.c and process_64.c
do not generate power_end events which could be traced using
perf. This patch adds the event generation for such systems.
Signed-off-by: Robert Schoene <robert.schoene@tu-dresden.de>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1276515440.5441.45.camel@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
After every iounmap mmiotrace has to free kmmio_fault_pages, but
it can't do it directly, so it defers freeing by RCU.
It usually works, but when mmiotraced code calls ioremap-iounmap
multiple times without sleeping between (so RCU won't kick in
and start freeing) it can be given the same virtual address, so
at every iounmap mmiotrace will schedule the same pages for
release. Obviously it will explode on second free.
Fix it by marking kmmio_fault_pages which are scheduled for
release and not adding them second time.
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Tested-by: Marcin Kocielnicki <koriakin@0x04.net>
Tested-by: Shinpei KATO <shinpei@il.is.s.u-tokyo.ac.jp>
Acked-by: Pekka Paalanen <pq@iki.fi>
Cc: Stuart Bennett <stuart@freedesktop.org>
Cc: Marcin Kocielnicki <koriakin@0x04.net>
Cc: nouveau@lists.freedesktop.org
Cc: <stable@kernel.org>
LKML-Reference: <20100613215654.GA3829@joi.lan>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The new IA32_ENERGY_PERF_BIAS MSR allows system software to give
hardware a hint whether OS policy favors more power saving,
or more performance. This allows the OS to have some influence
on internal hardware power/performance tradeoffs where the OS
has previously had no influence.
The support for this feature is indicated by CPUID.06H.ECX.bit3,
as documented in the Intel Architectures Software Developer's Manual.
This patch discovers support of this feature and displays it
as "epb" in /proc/cpuinfo.
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
LKML-Reference: <alpine.LFD.2.00.1006032310160.6669@localhost.localdomain>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The mce processing applies rcu_dereference_check() to integers used as
array indices. This patch therefore moves mce to the new RCU API
rcu_dereference_index_check() that avoids the sparse processing that
would otherwise result in compiler errors.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI: clear bridge resource range if BIOS assigned bad one
PCI: hotplug/cpqphp, fix NULL dereference
Revert "PCI: create function symlinks in /sys/bus/pci/slots/N/"
PCI: change resource collision messages from KERN_ERR to KERN_INFO
subtree_max_end that was recently added to struct memtype was not getting
properly initialized resulting in
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory
in memtype_rb_augment_cb()
reported here
https://bugzilla.kernel.org/show_bug.cgi?id=16092
This change fixes the problem.
Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Tested-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
LKML-Reference: <1276217101-11515-1-git-send-email-venki@google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Yannick found that video does not work with 2.6.34. The cause of this
bug was that the BIOS had assigned the wrong range to the PCI bridge
above the video device. Before 2.6.34 the kernel would have shrunk
the size of the bridge window, but since
d65245c PCI: don't shrink bridge resources
the kernel will avoid shrinking BIOS ranges.
So zero out the old range if we fail to claim it at boot time; this will
cause us to allocate a new range at startup, restoring the 2.6.34
behavior.
Fixes regression https://bugzilla.kernel.org/show_bug.cgi?id=16009.
Reported-by: Yannick <yannick.roehlly@free.fr>
Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Use HW_ERR printk prefix in MCE handler. To make it more explicit that
this is hardware error instead of software error.
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1275978939.3444.668.camel@yhuang-dev.sh.intel.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
It is reported that CMCI is not raised when number of corrected error
reaches preset threshold. After inspection, it is found that
MSR_IA32_MCI_CTL2 threshold field is not setup properly. This patch
fixed it.
Value of MCI_CTL2_CMCI_THRESHOLD_MASK is fixed according to x86_64
Software Developer's Manual too.
Reported-by: Shaohui Zheng <shaohui.zheng@intel.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1275977350.3444.660.camel@yhuang-dev.sh.intel.com>
Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Rename CMCI_EN to MCI_CTL2_CMCI_EN and CMCI_THRESHOLD_MASK to
MCI_CTL2_CMCI_THRESHOLD_MASK to make naming consistent.
Signed-off-by: Huang Ying <ying.huang@intel.com>
LKML-Reference: <1275977348.3444.659.camel@yhuang-dev.sh.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Catch missing conversion to the register structure "glove box" scheme.
Found by gcc 4.6's new warnings.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20100610111040.F1781B1A2B@basil.firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Saving platform non-volatile state may be required for suspend to RAM as
well as hibernation. Move it to more generic code.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Extend support to future families, and in particular:
* extend direct mapping split of Tseg SMM area.
* extend K8 flavored alternatives (NOPS).
* rep movs* prefix is fast in ucode.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100602182921.GA21557@aftab>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is in preparation for disabling L3 cache indices after having
received correctable ECCs in the L3 cache. Now we allow for initial
setting of a disabled index slot (write once) and deny writing new
indices to it after it has been disabled. Also, we deny using both slots
to disable one and the same index.
Userspace can restore the previously disabled indices by rewriting those
sysfs entries when booting.
Cleanup and reorganize code while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100602161840.GI18327@aftab>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The places which call check_for_xstate() only care about zero or
non-zero so this patch doesn't change how the code runs, but it's a
cleanup. The main reason for this patch is that I'm looking for places
which don't return -EFAULT for copy_from_user() failures.
Signed-off-by: Dan Carpenter <error27@gmail.com>
LKML-Reference: <20100603100746.GU5483@bicker>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
When I introduced the global variable gsi_end I thought gsi_end on
io_apics was one past the end of the gsi range for the io_apic. After
it was pointed out the the range on io_apics was inclusive I changed
my global variable to match. That was a big mistake. Inclusive
semantics without a range start cannot describe the case when no gsi's
are allocated. Describing the case where no gsi's are allocated is
important in sfi.c and mpparse.c so that we can assign gsi numbers
instead of blindly copying the gsi assignments the BIOS has done as we
do in the acpi case.
To keep from getting the global variable confused with the gsi range
end rename it gsi_top.
To allow describing the case where no gsi's are allocated have gsi_top
be one place the highest gsi number seen in the system.
This fixes an off by one bug in sfi.c:
Reported-by: jacob pan <jacob.jun.pan@linux.intel.com>
This fixes the same off by one bug in mpparse.c:
This fixes an off unreachable by one bug in acpi/boot.c:irq_to_gsi
Reported-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
LKML-Reference: <m17hm9jre7.fsf_-_@fess.ebiederm.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If cr0.wp=0, we have to allow the guest kernel access to a page with pte.w=0.
We do that by setting spte.w=1, since the host cr0.wp must remain set so the
host can write protect pages. Once we allow write access, we must remove
user access otherwise we mistakenly allow the user to write the page.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Always invalidate spte and flush TLBs when changing page size, to make
sure different sized translations for the same address are never cached
in a CPU's TLB.
Currently the only case where this occurs is when a non-leaf spte pointer is
overwritten by a leaf, large spte entry. This can happen after dirty
logging is disabled on a memslot, for example.
Noticed by Andrea.
KVM-Stable-Tag
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements a workaround for AMD erratum 383 into
KVM. Without this erratum fix it is possible for a guest to
kill the host machine. This patch implements the suggested
workaround for hypervisors which will be published by the
next revision guide update.
[jan: fix overflow warning on i386]
[xiao: fix unused variable warning]
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch moves handling of the MC vmexits to an earlier
point in the vmexit. The handle_exit function is too late
because the vcpu might alreadry have changed its physical
cpu.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cleanup. Factor the common code in save_stack_address() and
save_stack_address_nosched().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20100603193243.GA31534@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
If CONFIG_FRAME_POINTER=n, print_context_stack() shouldn't neglect the
non-reliable addresses on stack, this is all we have if dump_trace(bp)
is called with the wrong or zero bp.
For example, /proc/pid/stack doesn't work if CONFIG_FRAME_POINTER=n.
This patch obviously has no effect if CONFIG_FRAME_POINTER=y, otherwise
it reverts 1650743c "x86: don't save unreliable stack trace entries".
Also, remove the unnecessary type-cast.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <20100603193239.GA31530@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Since now all modification to event->count (and ->prev_count
and ->period_left) are local to a cpu, change then to local64_t so we
avoid the LOCK'ed ops.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On 64bit, local_t is of size long, and thus we make local64_t an alias.
On 32bit, we fall back to atomic64_t. (architecture can provide optimized
32-bit version)
(This new facility is to be used by perf events optimizations.)
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-arch@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On Netburst PMU we need a second write to a performance counter
due to cpu erratum.
A simple flag test instead of alternative instructions was choosen
because wrmsrl is already a macro and if virtualization is turned
on will need an additional wrapper call which is more expencise.
nb: we should propably switch to jump-labels as only this facility
reach the mainline.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100602212304.GC5264@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Clarify some of the transactional group scheduling API details
and change it so that a successfull ->commit_txn also closes
the transaction.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1274803086.5882.1752.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Drop this argument now that we always want to rewind only to the
state of the first caller.
It means frame pointers are not necessary anymore to reliably get
the source of an event. But this also means we need this helper
to be a macro now, as an inline function is not an option since
we need to know when to provide a default implentation.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
arch/x86/include/asm/stacktrace.h and arch/x86/kernel/dumpstack.h
declare headers of objects that deal with the same topic.
Actually most of the files that include stacktrace.h also include
dumpstack.h
Although dumpstack.h seems more reserved for internals of stack
traces, those are quite often needed to define specialized stack
trace operations. And perf event arch headers are going to need
access to such low level operations anyway. So don't continue to
bother with dumpstack.h as it's not anymore about isolated deep
internals.
v2: fix struct stack_frame definition conflict in sysprof
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Soeren Sandmann <sandmann@daimi.au.dk>
Streamline the large uv_flush_send_and_wait() function by use of
a couple of helper functions.
And remove some excess comments.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004ay-IH@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the Broadcast Assist Unit driver use the BAU for TLB
shootdowns of cpu's on the local uvhub.
It was previously thought that IPI might be faster to the cpu's
on the local hub. But the IPI operation would have to follow
the completion of the BAU broadcast anyway. So we broadcast to
the local uvhub in all cases except when the current cpu was the
only local cpu in the mask.
This simplifies uv_flush_send_and_wait() in that it returns
either all shootdowns complete, or none.
Adjust the statistics to account for shootdowns on the local
uvhub.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004aq-G7@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Broadcast Assist Unit messages have a regular or retry
message type. The regular type was not being set, but needs to
be, because the lack of a message type is sometimes used to
identify an unused entry in the message queue.
Also removing some excess comments.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004ak-Dy@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove a faulty assumption that a long running BAU request has
encountered a hardware problem and will never finish.
Numalink congestion can make a request appear to have
encountered such a problem, but it is not safe to cancel the
request. If such a cancel is done but a reply is later received
we can miss a TLB shootdown.
We depend upon the max_bau_concurrent 'throttle' to prevent the
stay-busy case from happening.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004ad-BV@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Correct the initialization-time assumption of contigous blade
numbers and of sockets numbered from zero.
There may be hubs present with no cpu's enabled.
There may be disabled sockets such that the active socket is not
number zero.
And assign a 'socket master' by assuming that a socket is a
node. (it is not safe to extract socket number from an apicid)
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004aW-9S@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Correct the acknowledgment and the reset of a BAU
software-acknowledged message.
A retry message should be testing only for timed-out resources
(mask << 8). (And we delete a log message that might cause
unnecessary concern) The acknowledge MMR is
|--timed-out--|---pending--|, each is 8 bits.
The IPI-driven reset of software acknowledge resources frees
both timed out and pending resources.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004aP-7O@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move some structure definitions from the C code to the BAU
header file, and change the organization of that header file a
little.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004aI-54@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use a pointer from the per-cpu BAU control structure to the
per-cpu BAU statistics structure.
We nearly always know the first before needing the second.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004aB-2k@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The numalink network can become so congested that TLB shootdown
using the Broadcast Assist Unit becomes slower than using IPI's.
In that case, disable the use of the BAU for a period of time.
The period is tunable. When the period expires the use of the
BAU is re-enabled. A count of these actions is added to the
statistics file.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNy-0004a4-0a@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make the Broadcast Assist Unit driver's nine tuning values variable by
making them accessible through a read/write debugfs file.
The file will normally be mounted as
/sys/kernel/debug/sgi_uv/bau_tunables. The tunables are kept in each
cpu's per-cpu BAU structure.
The patch also does a little name improvement, and corrects the reset of
two destination timeout counters.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNx-0004Zx-Uo@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Calculate the Broadcast Assist Unit's destination timeout period from the
values in the relevant MMR's.
Store it in each cpu's per-cpu BAU structure so that a destination
timeout can be differentiated from a 'plugged' situation in which all
software ack resources are already allocated and a timeout is pending.
That case returns an immediate destination error.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: gregkh@suse.de
LKML-Reference: <E1OJvNx-0004Zq-RK@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fixes to 'cpuid10_edx' to comply with Intel documentation.
According to the Intel Manual, Volume 2A, Table 3-12, the cpuid for
architecture performance monitoring returns, in EDX, two pieces of
information:
1) Number of fixed-function counters (5 bits, not 4)
2) Width of fixed-function counters (8 bits)
Signed-off-by: Livio Soares <livio@eecg.toronto.edu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As Ingo pointed out in a separate patch, we should be using __ASSEMBLY__.
Make that the case in pgtable headers.
Signed-off-by: Andres Salomon <dilinger@queued.net>
LKML-Reference: <20100605114042.35ac69c1@dev.queued.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Save/restore MISC_ENABLE register on suspend/resume.
This fixes OOPS (invalid opcode) on resume from STR on Asus P4P800-VM,
which wakes up with MWAIT disabled.
Fixes https://bugzilla.kernel.org/show_bug.cgi?id=15385
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Tested-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
A memory region must be physically contiguous in order to be accessed
through DMA. This patch adds xen_create_contiguous_region, which
ensures a region of contiguous virtual memory is also physically
contiguous.
Based on Stephen Tweedie's port of the 2.6.18-xen version.
Remove contiguous_bitmap[] as it's no longer needed.
Ported from linux-2.6.18-xen.hg 707:e410857fd83c
[ Impact: add Xen-internal API to make pages phys-contig ]
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* xen_create_contiguous_region needs access to the balloon lock to
ensure memory doesn't change under its feet, so expose the balloon
lock
* Change the name of the lock to xen_reservation_lock, to imply it's
now less-specific usage.
[ Impact: cleanup ]
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
PV DomU domains are allowed to map hardware MFNs for PCI passthrough,
but are not generally allowed to map raw machine pages. In particular,
various pieces of code try to map DMI and ACPI tables in the ISA ROM
range. We disallow _PAGE_IOMAP for those mappings, so that they are
redirected to a set of local zeroed pages we reserve for that purpose.
[ Impact: prevent passthrough of ISA space, as we only allow PCI ]
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
In a Xen domain, ioremap operates on machine addresses, not
pseudo-physical addresses. We use _PAGE_IOMAP to determine whether a
mapping is intended for machine addresses.
[ Impact: allow Xen domain to map real hardware ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core: (83 commits)
i7core_edac: Better describe the supported devices
Add support for Westmere to i7core_edac driver
i7core_edac: don't free on success
i7core_edac: Add support for X5670
Always call i7core_[ur]dimm_check_mc_ecc_err
i7core_edac: fix memory leak of i7core_dev
EDAC: add __init to i7core_xeon_pci_fixup
i7core_edac: Fix wrong device id for channel 1 devices
i7core: add support for Lynnfield alternate address
i7core_edac: Add initial support for Lynnfield
i7core_edac: do not export static functions
edac: fix i7core build
edac: i7core_edac produces undefined behaviour on 32bit
i7core_edac: Use a more generic approach for probing PCI devices
i7core_edac: PCI device is called NONCORE, instead of NOCORE
i7core_edac: Fix ringbuffer maxsize
i7core_edac: First store, then increment
i7core_edac: Better parse "any" addrmask
i7core_edac: Use a lockless ringbuffer
edac: Create an unique instance for each kobj
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, smpboot: Fix cores per node printing on boot
x86/amd-iommu: Fall back to GART if initialization fails
x86/amd-iommu: Fix crash when request_mem_region fails
x86/mm: Remove unused DBG() macro
arch/x86/kernel: Add missing spin_unlock
* 'for-linus/bugfixes' of git://xenbits.xensource.com/people/ianc/linux-2.6:
xen: avoid allocation causing potential swap activity on the resume path
xen: ensure timer tick is resumed even on CPU driving the resume
* 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Fix crash in swevents
perf buildid-list: Fix --with-hits event processing
perf scripts python: Give field dict to unhandled callback
perf hist: fix objdump output parsing
perf-record: Check correct pid when forking
perf: Do the comm inheritance per thread in event__process_task
perf: Use event__process_task from perf sched
perf: Process comm events by tid
blktrace: Fix new kernel-doc warnings
perf_events: Fix unincremented buffer base on partial copy
perf_events: Fix event scheduling issues introduced by transactional API
perf_events, trace: Fix perf_trace_destroy(), mutex went missing
perf_events, trace: Fix probe unregister race
perf_events: Fix races in group composition
perf_events: Fix races and clean up perf_event and perf_mmap_data interaction
The core suspend/resume code is run from stop_machine on CPU0 but
parts of the suspend/resume machinery (including xen_arch_resume) are
run on whichever CPU happened to schedule the xenwatch kernel thread.
As part of the non-core resume code xen_arch_resume is called in order
to restart the timer tick on non-boot processors. The boot processor
itself is taken care of by core timekeeping code.
xen_arch_resume uses smp_call_function which does not call the given
function on the current processor. This means that we can end up with
one CPU not receiving timer ticks if the xenwatch thread happened to
be scheduled on CPU > 0.
Use on_each_cpu instead of smp_call_function to ensure the timer tick
is resumed everywhere.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Stable Kernel <stable@kernel.org> # .32.x
Percpu initialization happens now after booting the cores on the
machine and this causes them all to be displayed as belonging to
node 0:
Jun 8 05:57:21 kepek kernel: [ 0.106999] Booting Node 0,
Processors #1#2#3#4#5#6#7#8#9#10#11#12#13#14#15#16#17#18#19#20#21#22#23 Ok.
Use early_cpu_to_node() to get the correct node of each core
instead.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100601190455.GA14237@aftab>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-35' of git://repo.or.cz/linux-kbuild: (81 commits)
kbuild: Revert part of e8d400a to resolve a conflict
kbuild: Fix checking of scm-identifier variable
gconfig: add support to show hidden options that have prompts
menuconfig: add support to show hidden options which have prompts
gconfig: remove show_debug option
gconfig: remove dbg_print_ptype() and dbg_print_stype()
kconfig: fix zconfdump()
kconfig: some small fixes
add random binaries to .gitignore
kbuild: Include gen_initramfs_list.sh and the file list in the .d file
kconfig: recalc symbol value before showing search results
.gitignore: ignore *.lzo files
headerdep: perlcritic warning
scripts/Makefile.lib: Align the output of LZO
kbuild: Generate modules.builtin in make modules_install
Revert "kbuild: specify absolute paths for cscope"
kbuild: Do not unnecessarily regenerate modules.builtin
headers_install: use local file handles
headers_check: fix perl warnings
export_report: fix perl warnings
...
This patch implements a fallback to the GART IOMMU if this
is possible and the AMD IOMMU initialization failed.
Otherwise the fallback would be nommu which is very
problematic on machines with more than 4GB of memory or
swiotlb which hurts io-performance.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When request_mem_region fails the error path tries to
disable the IOMMUs. This accesses the mmio-region which was
not allocated leading to a kernel crash. This patch fixes
the issue.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The transactional API patch between the generic and model-specific
code introduced several important bugs with event scheduling, at
least on X86. If you had pinned events, e.g., watchdog, and were
over-committing the PMU, you would get bogus counts. The bug was
showing up on Intel CPU because events would move around more
often that on AMD. But the problem also existed on AMD, though
harder to expose.
The issues were:
- group_sched_in() was missing a cancel_txn() in the error path
- cpuc->n_added was not properly maintained, leading to missing
actions in hw_perf_enable(), i.e., n_running being 0. You cannot
update n_added until you know the transaction has succeeded. In
case of failed transaction n_added was not adjusted back.
- in case of failed transactions, event_sched_out() was called
and eventually invoked x86_disable_event() to touch the HW reg.
But with transactions, on X86, event_sched_in() does not touch
HW registers, it simply collects events into a list. Thus, you
could end up calling x86_disable_event() on a counter which
did not correspond to the current event when idx != -1.
The patch modifies the generic and X86 code to avoid all those problems.
First, we keep track of the number of events added last. In case the
transaction fails, we substract them from n_added. This approach is
necessary (as opposed to delaying updates to n_added) because not all
event updates use the transaction API, e.g., single events.
Second, we encapsulate the event_sched_in() and event_sched_out() in
group_sched_in() inside the transaction. That makes the operations
symmetrical and you can also detect that you are inside a transaction
and skip the HW reg access by checking cpuc->group_flag.
With this patch, you can now overcommit the PMU even with pinned
system-wide events present and still get valid counts.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1274796225.5882.1389.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (47 commits)
mfd: Rename twl5031 sih modules
mfd: Storage class for timberdale should be before const qualifier
mfd: Remove unneeded and dangerous clearing of clientdata
mfd: New AB8500 driver
gpio: Fix inverted rdc321x gpio data out registers
mfd: Change rdc321x resources flags to IORESOURCE_IO
mfd: Move pcf50633 irq related functions to its own file.
mfd: Use threaded irq for pcf50633
mfd: pcf50633-adc: Fix potential race in pcf50633_adc_sync_read
mfd: Fix pcf50633 bitfield logic in interrupt handler
gpio: rdc321x needs to select MFD_CORE
mfd: Use menuconfig for quicker config editing
ARM: AB3550 board configuration and irq for U300
mfd: AB3550 core driver
mfd: AB3100 register access change to abx500 API
mfd: Renamed ab3100.h to abx500.h
gpio: Add TC35892 GPIO driver
mfd: Add Toshiba's TC35892 MFD core
mfd: Delay to mask tsc irq in max8925
mfd: Remove incorrect wm8350 kfree
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, cpufeature: Unbreak compile with gcc 3.x
x86, pat: Fix memory leak in free_memtype
x86, k8: Fix section mismatch for powernowk8_exit()
lib/atomic64_test: fix missing include of linux/kernel.h
x86: remove last traces of quicklist usage
x86, setup: Phoenix BIOS fixup is needed on Dell Inspiron Mini 1012
x86: "nosmp" command line option should force the system into UP mode
arch/x86/pci: use kasprintf
x86, apic: ack all pending irqs when crashed/on kexec
This reverts commit 0ac0c0d0f8, which
caused cross-architecture build problems for all the wrong reasons.
IA64 already added its own version of __node_random(), but the fact is,
there is nothing architectural about the function, and the original
commit was just badly done. Revert it, since no fix is forthcoming.
Requested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
intel_idle: native hardware cpuidle driver for latest Intel processors
ACPI: acpi_idle: touch TS_POLLING only in the non-MWAIT case
acpi_pad: uses MONITOR/MWAIT, so it doesn't need to clear TS_POLLING
sched: clarify commment for TS_POLLING
ACPI: allow a native cpuidle driver to displace ACPI
cpuidle: make cpuidle_curr_driver static
cpuidle: add cpuidle_unregister_driver() error check
cpuidle: fail to register if !CONFIG_CPU_IDLE
TS_POLLING set tells the scheduler an idle_task will poll
need_resched() to look for work.
TS_POLLING clear tells resched_task() and wake_up_idle_cpu()
that the remote CPU's idle_task is now sleeping in idle,
and thus requires a reschedule interrupt notice work.
Update the description of TS_POLLING to reflect how it works.
"idle task polling need_resched, skip sending interrupt"
Wordsmithing-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
This file is replaced by a cleaner version with the adding of a MFD driver for
the southbridge.
Signed-off-by: Florian Fainelli <florian@openwrt.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (61 commits)
tracing: Add __used annotation to event variable
perf, trace: Fix !x86 build bug
perf report: Support multiple events on the TUI
perf annotate: Fix up usage of the build id cache
x86/mmiotrace: Remove redundant instruction prefix checks
perf annotate: Add TUI interface
perf tui: Remove annotate from popup menu after failure
perf report: Don't start the TUI if -D is used
perf: Fix getline undeclared
perf: Optimize perf_tp_event_match()
perf: Remove more code from the fastpath
perf: Optimize the !vmalloc backed buffer
perf: Optimize perf_output_copy()
perf: Fix wakeup storm for RO mmap()s
perf-record: Share per-cpu buffers
perf-record: Remove -M
perf: Ensure that IOC_OUTPUT isn't used to create multi-writer buffers
perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events
perf, trace: Optimize tracepoints by removing IRQ-disable from perf/tracepoint interaction
perf tui: Allow disabling the TUI on a per command basis in ~/.perfconfig
...
gcc 3 is too braindamaged to be able to compile static_cpu_has() --
apparently it can't tell that a constant passed to an inline function
is still a constant -- so if we're using gcc 3, just use the dynamic
test. This is bad for performance, but if you care about performance,
don't use an ancient, known-to-optimize-poorly compiler.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
LKML-Reference: <4BF2FF82.7090005@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
x86 arch specific changes to use generic numa_node_id() based on generic
percpu variable infrastructure. Back out x86's custom version of
numa_node_id()
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rework the generic version of the numa_node_id() function to use the new
generic percpu variable infrastructure.
Guard the new implementation with a new config option:
CONFIG_USE_PERCPU_NUMA_NODE_ID.
Archs which support this new implemention will default this option to 'y'
when NUMA is configured. This config option could be removed if/when all
archs switch over to the generic percpu implementation of numa_node_id().
Arch support involves:
1) converting any existing per cpu variable implementations to use
this implementation. x86_64 is an instance of such an arch.
2) archs that don't use a per cpu variable for numa_node_id() will
need to initialize the new per cpu variable "numa_node" as cpus
are brought on-line. ia64 is an example.
3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
when NUMA is configured. This is required because I have
retained the old implementation by default to allow archs to
be modified incrementally, as desired.
Subsequent patches will convert x86_64 and ia64 to use this implemenation.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are more architectures that don't support ARCH_HAS_SG_CHAIN than
those that support it. This removes removes ARCH_HAS_SG_CHAIN in
asm-generic/scatterlist.h and lets arhictectures to define it.
It's clearer than defining ARCH_HAS_SG_CHAIN asm-generic/scatterlist.h and
undefing it in arhictectures that don't support it.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are only two ways to define sg_dma_len(); use sg->dma_length or
sg->length. This patch introduces NEED_SG_DMA_LENGTH that enables
architectures to choose sg->dma_length or sg->length.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sync_single_range_for_cpu and sync_single_range_for_device hooks in
swiotlb_dma_ops are unnecessary because sync_single_for_cpu and
sync_single_for_device are used there.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By the previous modification, the cpu notifier can return encapsulate
errno value. This converts the cpu notifiers for msr, cpuid, and
therm_throt.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems). Part of the reason is that
the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
node 0 for newly created tasks.
This patch changes the rotor to be initialized to a random node number of
the cpuset.
[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
Signed-off-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a spin_unlock missing on the error path. The locks and unlocks are
balanced in other functions, so it seems that the same should be the case
here.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression E1;
@@
* spin_lock(E1,...);
<+... when != E1
if (...) {
... when != E1
* return ...;
}
...+>
* spin_unlock(E1,...);
// </smpl>
Cc: stable@kernel.org
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Reserve_memtype will allocate memory for new memtype, but
in free_memtype, after the memtype erased from rbtree, the
memory is not freed.
Changes since V1:
make rbt_memtype_erase return erased memtype so that
it can be freed in free_memtype.
[ hpa: not for -stable: 2.6.34 and earlier not affected ]
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1274838670-8731-1-git-send-email-dfeng@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Jack Steiner <steiner@sgi.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This reverts commit b3b77c8cae, which was
also totally broken (see commit 0d2daf5cc8 that reverted the crc32
version of it). As reported by Stephen Rothwell, it causes problems on
big-endian machines:
> In file included from fs/jfs/jfs_types.h:33,
> from fs/jfs/jfs_incore.h:26,
> from fs/jfs/file.c:22:
> fs/jfs/endian24.h:36:101: warning: "__LITTLE_ENDIAN" is not defined
The kernel has never had that crazy "__BYTE_ORDER == __LITTLE_ENDIAN"
model. It's not how we do things, and it isn't how we _should_ do
things. So don't go there.
Requested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following warning:
"WARNING: arch/x86/kernel/built-in.o(.exit.text+0x72):
Section mismatch in reference from the function powernowk8_exit() to the variable .cpuinit.data:cpb_nb
The function __exit powernowk8_exit() references a variable
__cpuinitdata cpb_nb. This is often seen when error handling in the exit
function uses functionality in the init path. The fix is often to remove
the __cpuinitdata annotation of cpb_nb so it may be used outside an init
section."
Cc: <stable@kernel.org>
Reported-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100525152858.GA24836@aftab>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The MSR IA32_TEMPERATURE_TARGET contains the TjMax value in the newer
Intel processors.
Signed-off-by: Huaxu Wan <huaxu.wan@linux.intel.com>
Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Cc: Jean Delvare <khali@linux-fr.org>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Yong Wang <yong.y.wang@linux.intel.com>
Cc: Rudolf Marek <r.marek@assembler.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linux does not define __BYTE_ORDER in its endian header files which makes
some header files bend backwards to get at the current endian. Lets
#define __BYTE_ORDER in big_endian.h/litte_endian.h to make it easier for
header files that are used in user space too.
In userspace the convention is that
1. _both_ __LITTLE_ENDIAN and __BIG_ENDIAN are defined,
2. you have to test for e.g. __BYTE_ORDER == __BIG_ENDIAN.
Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch b7e2ecef92 (perf, trace: Optimize tracepoints by removing
IRQ-disable from perf/tracepoint interaction) made the
unfortunate mistake of assuming the world is x86 only, correct
this.
The problem was that perf_fetch_caller_regs() did
local_save_flags() into regs->flags, and I re-used that to
remove another local_save_flags(), forgetting !x86 doesn't have
regs->flags.
Do the reverse, remove the local_save_flags() from
perf_fetch_caller_regs() and let the ftrace site do the
local_save_flags() instead.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: acme@redhat.com
Cc: efault@gmx.de
Cc: fweisbec@gmail.com
Cc: rostedt@goodmis.org
LKML-Reference: <1274778175.5882.623.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We still have a stray quicklist header included even though we axed
quicklist usage quite a while back.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201005241913.o4OJDJe9010881@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The low-memory corruption checker triggers during suspend/resume, so we
need to reserve the low 64k. Don't be fooled that the BIOS identifies
itself as "Dell Inc.", it's still Phoenix BIOS.
[ hpa: I think we blacklist almost every BIOS in existence. We should
either change this to a whitelist or just make it unconditional. ]
Signed-off-by: Gabor Gombas <gombasg@digikabel.hu>
LKML-Reference: <201005241913.o4OJDIMM010877@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org>
Bits set in cpu_possible_mask prior to the execution of
prefill_possible_map() (i.e. when parsing ACPI or MPS tables) would
prevent the SMP alternatives logic from switching to UP mode, plus
unnecessary setup of per-CPU data for CPUs that can never come online.
Additionally, without CONFIG_HOTPLUG_CPU disabled CPUs can never come
online, and hence setting cpu_possible_mask bits for them is again a
simple waste of resources.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <201005241913.o4OJDH3Z010874@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
kasprintf combines kmalloc and sprintf, and takes care of the size
calculation itself.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression a,flag;
expression list args;
statement S;
@@
a =
- \(kmalloc\|kzalloc\)(...,flag)
+ kasprintf(flag,args)
<... when != a
if (a == NULL || ...) S
...>
- sprintf(a,args);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
LKML-Reference: <201005241913.o4OJDG3R010871@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When the SMP kernel decides to crash_kexec() the local APICs may have
pending interrupts in their vector tables.
The setup routine for the local APIC has a deficient mechanism for
clearing these interrupts, it only handles interrupts that has already
been dispatched to the local core for servicing (the ISR register) safely,
it doesn't consider lower prioritized queued interrupts stored in the IRR
register.
If you have more than one pending interrupt within the same 32 bit word in
the LAPIC vector table registers you may find yourself entering the IO
APIC setup with pending interrupts left in the LAPIC. This is a situation
for wich the IO APIC setup is not prepared. Depending of what/which
interrupt vector/vectors are stuck in the APIC tables your system may show
various degrees of malfunctioning. That was the reason why the
check_timer() failed in our system, the timer interrupts was blocked by
pending interrupts from the old kernel when routed trough the IO APIC.
Additional comment from Jiri Bohac:
==============
If this should go into stable release,
I'd add some kind of limit on the number of iterations, just to be safe from
hard to debug lock-ups:
+if (loops++ > MAX_LOOPS) {
+ printk("LAPIC pending clean-up")
+ break;
+}
while (queued);
with MAX_LOOPS something like 1E9 this would leave plenty of time for the
pending IRQs to be cleared and would and still cause at most a second of delay
if the loop were to lock-up for whatever reason.
[trenn@suse.de:
V2: Use tsc if avail to bail out after 1 sec due to possible virtual
apic_read calls which may take rather long (suggested by: Avi Kivity
<avi@redhat.com>) If no tsc is available bail out quickly after
cpu_khz, if we broke out too early and still have irqs pending (which
should never happen?) we still get a WARN_ON...
V3: - Fixed indentation -> checkpatch clean
- max_loops must be signed
V4: - Fix typo, mixed up tsc and ntsc in first rdtscll() call
V5: Adjust WARN_ON() condition to also catch error in cpu_has_tsc case]
Cc: <jbohac@novell.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Kerstin Jonsson <kerstin.jonsson@ericsson.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Tested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Thomas Renninger <trenn@suse.de>
LKML-Reference: <201005241913.o4OJDGWM010865@imap1.linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Get rid of the duplicated entries in prefix_codes[]
to eliminate redundant checks by skip_prefix().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Pekka Paalanen <pq@iki.fi>
LKML-Reference: <1274140110-5841-1-git-send-email-akinobu.mita@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (269 commits)
KVM: x86: Add missing locking to arch specific vcpu ioctls
KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls
KVM: MMU: Segregate shadow pages with different cr0.wp
KVM: x86: Check LMA bit before set_efer
KVM: Don't allow lmsw to clear cr0.pe
KVM: Add cpuid.txt file
KVM: x86: Tell the guest we'll warn it about tsc stability
x86, paravirt: don't compute pvclock adjustments if we trust the tsc
x86: KVM guest: Try using new kvm clock msrs
KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
KVM: x86: add new KVMCLOCK cpuid feature
KVM: x86: change msr numbers for kvmclock
x86, paravirt: Add a global synchronization point for pvclock
x86, paravirt: Enable pvclock flags in vcpu_time_info structure
KVM: x86: Inject #GP with the right rip on efer writes
KVM: SVM: Don't allow nested guest to VMMCALL into host
KVM: x86: Fix exception reinjection forced to true
KVM: Fix wallclock version writing race
KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
...
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mjg59/platform-drivers-x86: (32 commits)
Move N014, N051 and CR620 dmi information to load scm dmi table
drivers/platform/x86/eeepc-wmi.c: fix build warning
X86 platfrom wmi: Add debug facility to dump WMI data in a readable way
X86 platform wmi: Also log GUID string when an event happens and debug is set
X86 platform wmi: Introduce debug param to log all WMI events
Clean up all objects used by scm model when driver initial fail or exit
msi-laptop: fix up some coding style issues found by checkpatch
msi-laptop: Add i8042 filter to sync sw state with BIOS when function key pressed
msi-laptop: Set rfkill init state when msi-laptop intiial
msi-laptop: Add MSI CR620 notebook dmi information to scm models table
msi-laptop: Add N014 N051 dmi information to scm models table
drivers/platform/x86: Use kmemdup
drivers/platform/x86: Use kzalloc
drivers/platform/x86: Clarify the MRST IPC driver description slightly
eeepc-wmi: depends on BACKLIGHT_CLASS_DEVICE
IPC driver for Intel Mobile Internet Device (MID) platforms
classmate-laptop: Add RFKILL support.
thinkpad-acpi: document backlight level writeback at driver init
thinkpad-acpi: clean up ACPI handles handling
thinkpad-acpi: don't depend on led_path for led firmware type (v2)
...
Read the memory ranges behind the Broadcom CNB20LE host bridge out of the
hardware. This allows PCI hotplugging to work, since we know which memory
range to allocate PCI BAR's from.
The x86 PCI code automatically prefers the ACPI _CRS information when it is
available. In that case, this information is not used.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
* 'drm-for-2.6.35' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (207 commits)
drm/radeon/kms/pm/r600: select the mid clock mode for single head low profile
drm/radeon: fix power supply kconfig interaction.
drm/radeon/kms: record object that have been list reserved
drm/radeon: AGP memory is only I/O if the aperture can be mapped by the CPU.
drm/radeon/kms: don't default display priority to high on rs4xx
drm/edid: fix typo in 1600x1200@75 mode
drm/nouveau: fix i2c-related init table handlers
drm/nouveau: support init table i2c device identifier 0x81
drm/nouveau: ensure we've parsed i2c table entry for INIT_*I2C* handlers
drm/nouveau: display error message for any failed init table opcode
drm/nouveau: fix init table handlers to return proper error codes
drm/nv50: support fractional feedback divider on newer chips
drm/nv50: fix monitor detection on certain chipsets
drm/nv50: store full dcb i2c entry from vbios
drm/nv50: fix suspend/resume with DP outputs
drm/nv50: output calculated crtc pll when debugging on
drm/nouveau: dump pll limits entries when debugging is on
drm/nouveau: bios parser fixes for eDP boards
drm/nouveau: fix a nouveau_bo dereference after it's been destroyed
drm/nv40: remove some completed ctxprog TODOs
...
Allow kdb to work properly with with earlyprintk=vga by interpreting
the backspace and carriage return output characters. These
interpretation of these characters is used for simple line editing
provided in the kdb shell.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: x86@kernel.org
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
If the kernel debugger was configured, attached and started with
kgdbwait, the hardware breakpoint registers should get restored by the
kgdb code which is managing the dr registers.
CC: x86@kernel.org
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
It is not possible to use the hw_breakpoint.c API prior to mm_init(),
but it is possible to use hardware breakpoints with the kernel
debugger.
Prior to smp_init() it is possible to simply write to the dr registers
of the boot cpu directly. This can be used up until the
kgdb_arch_late() is invoked, at which point the standard hw_breakpoint.c
API will get used.
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The kernel debugger can operate well before mm_init(), but the x86
hardware breakpoint code which uses the perf api requires that the
kernel allocators are initialized.
This means the kernel debug core needs to provide an optional arch
specific call back to allow the initialization functions to run after
the kernel has been further initialized.
The kdb shell already had a similar restriction with an early
initialization and late initialization. The kdb_init() was moved into
the debug core's version of the late init which is called
dbg_late_init();
CC: kgdb-bugreport@lists.sourceforge.net
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Allow the x86 arch to have early exception processing for the purpose
of debugging via the kgdb.
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
The only way the debugger can handle a trap in inside rcu_lock,
notify_die, or atomic_notifier_call_chain without a triple fault is
to have a low level "first opportunity handler" in the int3 exception
handler.
Generally this will be something the vast majority of folks will not
need, but for those who need it, it is added as a kernel .config
option called KGDB_LOW_LEVEL_TRAP.
CC: Ingo Molnar <mingo@elte.hu>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: H. Peter Anvin <hpa@zytor.com>
CC: x86@kernel.org
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Remove all the references to the kgdb_post_primary_code. This
function serves no useful purpose because you can obtain the same
information from the "struct kgdb_state *ks" from with in the
debugger, if for some reason you want the data.
Also remove the unintentional duplicate assignment for ks->ex_vector.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
These are the minimum changes to the kgdb core in order to enable an
API to connect a new front end (kdb) to the debug core.
This patch introduces the dbg_kdb_mode variable controls where the
user level I/O is routed. It will be routed to the gdbstub (kgdb) or
to the kdb front end which is a simple shell available over the kgdboc
connection.
You can switch back and forth between kdb or the gdb stub mode of
operation dynamically. From gdb stub mode you can blindly type
"$3#33", or from the kdb mode you can enter "kgdb" to switch to the
gdb stub.
The logic in the debug core depends on kdb to look for the typical gdb
connection sequences and return immediately with KGDB_PASS_EVENT if a
gdb serial command sequence is detected. That should allow a
reasonably seamless transition between kdb -> gdb without leaving the
kernel exception state. The two gdb serial queries that kdb is
responsible for detecting are the "?" and "qSupported" packets.
CC: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Martin Hicks <mort@sgi.com>
* 'acpica' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (22 commits)
ACPI: fix early DSDT dmi check warnings on ia64
ACPICA: Update version to 20100428.
ACPICA: Update/clarify some parameter names associated with acpi_handle
ACPICA: Rename acpi_ex_system_do_suspend->acpi_ex_system_do_sleep
ACPICA: Prevent possible allocation overrun during object copy
ACPICA: Split large file, evgpeblk
ACPICA: Add GPE support for dynamically loaded ACPI tables
ACPICA: Clarify/rename some root table descriptor fields
ACPICA: Update version to 20100331.
ACPICA: Minimize the differences between linux GPE code and ACPICA code base
ACPI: add boot option acpi=copy_dsdt to fix corrupt DSDT
ACPICA: Update DSDT copy/detection.
ACPICA: Add subsystem option to force copy of DSDT to local memory
ACPICA: Add detection of corrupted/replaced DSDT
ACPICA: Add write support for DataTable operation regions
ACPICA: Fix for acpi_reallocate_root_table for incorrect root table copy
ACPICA: Update comments/headers, no functional change
ACPICA: Update version to 20100304
ACPICA: Fix for possible fault in acpi_ex_release_mutex
ACPICA: Standardize integer output for ACPICA warnings/errors
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
vlynq: make whole Kconfig-menu dependant on architecture
add descriptive comment for TIF_MEMDIE task flag declaration.
EEPROM: max6875: Header file cleanup
EEPROM: 93cx6: Header file cleanup
EEPROM: Header file cleanup
agp: use NULL instead of 0 when pointer is needed
rtc-v3020: make bitfield unsigned
PCI: make bitfield unsigned
jbd2: use NULL instead of 0 when pointer is needed
cciss: fix shadows sparse warning
doc: inode uses a mutex instead of a semaphore.
uml: i386: Avoid redefinition of NR_syscalls
fix "seperate" typos in comments
cocbalt_lcdfb: correct sections
doc: Change urls for sparse
Powerpc: wii: Fix typo in comment
i2o: cleanup some exit paths
Documentation/: it's -> its where appropriate
UML: Fix compiler warning due to missing task_struct declaration
UML: add kernel.h include to signal.c
...
Traditionally, fatal MCE will cause Linux print error log to console
then reboot. Because MCE registers will preserve their content after
warm reboot, the hardware error can be logged to disk or network after
reboot. But system may fail to warm reboot, then you may lose the
hardware error log. ERST can help here. Through saving the hardware
error log into flash via ERST before go panic, the hardware error log
can be gotten from the flash after system boot successful again.
The fatal MCE processing procedure with ERST involved is as follow:
- Hardware detect error, MCE raised
- MCE read MCE registers, check error severity (fatal), prepare error record
- Write MCE error record into flash via ERST
- Go panic, then trigger system reboot
- System reboot, /sbin/mcelog run, it reads /dev/mcelog to check flash
for error record of previous boot via ERST, and output and clear
them if available
- /sbin/mcelog logs error records into disk or network
ERST only accepts CPER record format, but there is no pre-defined CPER
section can accommodate all information in struct mce, so a customized
section type is defined to hold struct mce inside a CPER record as an
error section.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.
Now, only SCI notification type and memory errors are supported. More
notification type and hardware error type will be added later. These
memory errors are reported to user space through /dev/mcelog via
faking a corrected Machine Check, so that the error memory page can be
offlined by /sbin/mcelog if the error count for one page is beyond the
threshold.
On some machines, Machine Check can not report physical address for
some corrected memory errors, but GHES can do that. So this simplified
GHES is implemented firstly.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
* 'timers-for-linus-hpet' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hpet: Add reference to chipset erratum documentation for disable-hpet-msi-quirk
x86, hpet: Restrict read back to affected ATI chipsets
* 'timers-for-linus-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
avr32: Fix typo in read_persistent_clock()
sparc: Convert sparc to use read/update_persistent_clock
cris: Convert cris to use read/update_persistent_clock
m68k: Convert m68k to use read/update_persistent_clock
m32r: Convert m32r to use read/update_peristent_clock
blackfin: Convert blackfin to use read/update_persistent_clock
ia64: Convert ia64 to use read/update_persistent_clock
avr32: Convert avr32 to use read/update_persistent_clock
h8300: Convert h8300 to use read/update_persistent_clock
frv: Convert frv to use read/update_persistent_clock
mn10300: Convert mn10300 to use read/update_persistent_clock
alpha: Convert alpha to use read/update_persistent_clock
xtensa: Fix unnecessary setting of xtime
time: Clean up direct xtime usage in xen
We have an enum mrst_timer_options, use it so that the kernel knows if
we're missing something from a switch statement or equivalent.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <1274295685-6774-4-git-send-email-jacob.jun.pan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
We have an enum, might as well use it. While we're at it, make it an
inline... there is really no point in calling a function for this
stuff.
LKML-Reference: <1274295685-6774-3-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Always-on local APIC timer (ARAT) has been introduced to Medfield, along
with the platform APB timers we have more timer configuration options
between Moorestown and Medfield.
This patch adds run-time detection of avaiable timer features so that
we can treat Medfield as a variant of Moorestown and set up the optimal
timer options for each platform. i.e.
Medfield: per cpu always-on local APIC timer
Moorestown: per cpu APB timer
Manual override is possible via cmdline option x86_mrst_timer.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
LKML-Reference: <1274295685-6774-4-git-send-email-jacob.jun.pan@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Medfield is the follow-up of Moorestown, it is treated under the same
HW sub-architecture. However, we do need to know the CPU type in order
for some of the driver to act accordingly.
We also have different optimal clock configuration for each CPU type.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
LKML-Reference: <1274295685-6774-3-git-send-email-jacob.jun.pan@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some extra CPU features such as ARAT is needed in early boot so
that x86_init function pointers can be set up properly.
http://lkml.org/lkml/2010/5/18/519
At start_kernel() level, this patch moves init_scattered_cpuid_features()
from check_bugs() to setup_arch() -> early_cpu_init() which is earlier than
platform specific x86_init layer setup. Suggested by HPA.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
LKML-Reference: <1274295685-6774-2-git-send-email-jacob.jun.pan@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
When cr0.wp=0, we may shadow a gpte having u/s=1 and r/w=0 with an spte
having u/s=0 and r/w=1. This allows excessive access if the guest sets
cr0.wp=1 and accesses through this spte.
Fix by making cr0.wp part of the base role; we'll have different sptes for
the two cases and the problem disappears.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
kvm_x86_ops->set_efer() would execute vcpu->arch.efer = efer, so the
checking of LMA bit didn't work.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The current lmsw implementation allows the guest to clear cr0.pe, contrary
to the manual, which breaks EMM386.EXE.
Fix by ORing the old cr0.pe with lmsw's operand.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch puts up the flag that tells the guest that we'll warn it
about the tsc being trustworthy or not. By now, we also say
it is not.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the HV told us we can fully trust the TSC, skip any
correction
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We now added a new set of clock-related msrs in replacement of the old
ones. In theory, we could just try to use them and get a return value
indicating they do not exist, due to our use of kvm_write_msr_save.
However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.
So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Right now, we were using individual KVM_CAP entities to communicate
userspace about which cpuids we support. This is suboptimal, since it
generates a delay between the feature arriving in the host, and
being available at the guest.
A much better mechanism is to list para features in KVM_GET_SUPPORTED_CPUID.
This makes userspace automatically aware of what we provide. And if we
ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This cpuid, KVM_CPUID_CLOCKSOURCE2, will indicate to the guest
that kvmclock is available through a new set of MSRs. The old ones
are deprecated.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.
This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.
Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.
Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.
A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.
We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.
OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.
After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.
Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Avi Kivity <avi@redhat.com>
CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>