This facility provides three entry points:
ilog2() Log base 2 of unsigned long
ilog2_u32() Log base 2 of u32
ilog2_u64() Log base 2 of u64
These facilities can either be used inside functions on dynamic data:
int do_something(long q)
{
...;
y = ilog2(x)
...;
}
Or can be used to statically initialise global variables with constant values:
unsigned n = ilog2(27);
When performing static initialisation, the compiler will report "error:
initializer element is not constant" if asked to take a log of zero or of
something not reducible to a constant. They treat negative numbers as
unsigned.
When not dealing with a constant, they fall back to using fls() which permits
them to use arch-specific log calculation instructions - such as BSR on
x86/x86_64 or SCAN on FRV - if available.
[akpm@osdl.org: MMC fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Howells <dhowells@redhat.com>
Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change all the uses of f_{dentry,vfsmnt} to f_path.{dentry,mnt} in the i386
arch code.
Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes i386 use the generic BUG machinery. There are no functional
changes from the old i386 implementation.
The main advantage in using the generic BUG machinery for i386 is that the
inlined overhead of BUG is just the ud2a instruction; the file+line(+function)
information are no longer inlined into the instruction stream. This reduces
cache pollution, and makes disassembly work properly.
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The elf note saving code is currently duplicated over several
architectures. This cleanup patch simply adds code to a common file and
then replaces the arch-specific code with calls to the newly added code.
The only drawback with this approach is that s390 doesn't fully support
kexec-on-panic which for that arch leads to introduction of unused code.
Signed-off-by: Magnus Damm <magnus@valinux.co.jp>
Cc: Vivek Goyal <vgoyal@in.ibm.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Every file should #include the headers containing the prototypes for
its global functions.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn,
prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus
generating compiler warnings of unused symbols, hence forcing people to add
#ifdefs.
the compiler can skip truly unused functions just fine:
text data bss dec hex filename
1624412 728710 3674856 6027978 5bfaca vmlinux.before
1624412 728710 3674856 6027978 5bfaca vmlinux.after
[akpm@osdl.org: topology.c fix]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
smp_call_function_single() can deadlock if the caller disabled local
interrupts (the target CPU could be spinning on call_lock). Check for that.
Why on earth do these functions use spin_lock_bh()??
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Andi Kleen <ak@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When we are unregistering a kprobe-booster, we can't release its
instruction buffer immediately on the preemptive kernel, because some
processes might be preempted on the buffer. The freeze_processes() and
thaw_processes() functions can clean most of processes up from the buffer.
There are still some non-frozen threads who have the PF_NOFREEZE flag. If
those threads are sleeping (not preempted) at the known place outside the
buffer, we can ensure safety of freeing.
However, the processing of this check routine takes a long time. So, this
patch introduces the garbage collection mechanism of insn_slot. It also
introduces the "dirty" flag to free_insn_slot because of efficiency.
The "clean" instruction slots (dirty flag is cleared) are released
immediately. But the "dirty" slots which are used by boosted kprobes, are
marked as garbages. collect_garbage_slots() will be invoked to release
"dirty" slots if there are more than INSNS_PER_PAGE garbage slots or if
there are no unused slots.
Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "bibo,mao" <bibo.mao@intel.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Yumiko Sugita <yumiko.sugita.yf@hitachi.com>
Cc: Satoshi Oshima <soshima@redhat.com>
Cc: Hideo Aoki <haoki@redhat.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
After LOADER_TYPE && INITRD_START are true, the short if-condition
for INITRD_START can never be false.
Remove unused code from the else condition.
Signed-off-by: Henry Nestler <henry.ne@arcor.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Name some of the remaning 'old_style_spin_init' locks
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make swsusp support i386 systems with PAE or without PSE.
This is done by creating temporary page tables located in resume-safe page
frames before the suspend image is restored in the same way as x86_64 does
it.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Andi Kleen <ak@suse.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Nigel Cunningham <ncunningham@linuxmail.org>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Move process freezing functions from include/linux/sched.h to freezer.h, so
that modifications to the freezer or the kernel configuration don't require
recompiling just about everything.
[akpm@osdl.org: fix ueagle driver]
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SLAB_KERNEL is an alias of GFP_KERNEL.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The patch (as824b) makes percpu_free() ignore NULL arguments, as one would
expect for a deallocation routine. (Note that free_percpu is #defined as
percpu_free in include/linux/percpu.h.) A few callers are updated to remove
now-unneeded tests for NULL. A few other callers already seem to assume
that passing a NULL pointer to percpu_free() is okay!
The patch also removes an unnecessary NULL check in percpu_depopulate().
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kunmap_atomic() will call kpte_clear_flush with vaddr/ptep arguments which
don't correspond if the vaddr is just a normal lowmem address (ie, not in
the KMAP area). This patch makes sure that the pte is only cleared if kmap
area was actually used for the mapping.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce pagefault_{disable,enable}() and use these where previously we did
manual preempt increments/decrements to make the pagefault handler do the
atomic thing.
Currently they still rely on the increased preempt count, but do not rely on
the disabled preemption, this might go away in the future.
(NOTE: the extra barrier() in pagefault_disable might fix some holes on
machines which have too many registers for their own good)
[heiko.carstens@de.ibm.com: s390 fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Following up with the work on shared page table done by Dave McCracken. This
set of patch target shared page table for hugetlb memory only.
The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments. In the normal
page case, the amount of memory saved from process' page table is quite
significant. For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.
With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio. One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency. These two effects contribute to higher
application performance.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- remove the write-only local variable "bandwidth"
- don't set "max_cache_size" in the (cachesize < 0) case:
that's already handled in kernel/sched.c:measure_migration_cost()
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
The .eh_frame section contents is never written to, so it can as well
benefit from CONFIG_DEBUG_RODATA.
Diff-ed against firstfloor tree.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
We don't need to setup _irq_regs in smp_xxx_interrupt (except apic timer).
These handlers run with irqs disabled and do not call functions which need
"struct pt_regs".
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <ak@suse.de>
Acked-By: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Remove unused variable in msr_write().
Reported by D Binderman <dcb314@hotmail.com>.
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: David Rientjes <rientjes@cs.washington.edu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Tighten the requirements on both input to and output from the Dwarf2
unwinder.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
-mregparm=3 has been enabled by default for some time on i386, and AFAIK
there aren't any problems with it left.
This patch removes the REGPARM config option and sets -mregparm=3
unconditionally.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Sometimes the soft watchdog fires after we're done oopsing.
See http://projects.info-pull.com/mokb/MOKB-25-11-2006.html for an example.
AK: changed to touch_nmi_watchdog()
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Add sysctl for kstack_depth_to_print. This lets users change
the amount of raw stack data printed in dump_stack() without
having to reboot.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
We do the exact same printk about a dozen lines above
with no intermediate printk's.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
When using memmap kernel parameter in EFI boot we should also add to memory map
memory regions of runtime services to enable their mapping later.
AK: merged and cleaned up the patch
Signed-off-by: Artiom Myaskouvskey <artiom.myaskouvskey@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Read/Write APIC_LVTPC and APIC_LVTTHMR only,
if get_maxlvt() returns certain values.
This is done like everywhere else in i386/kernel/apic.c,
so I guess its correct.
Suspends/Resumes to disk fine and eleminates an smp_error_interrupt()
here on a K8.
AK: ported to x86-64 too
Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Signed-off-by: Andi Kleen <ak@suse.de>
There are two consumers of apic=: the apic debug level and the low
level generic architecture code. early_param would warn when the
low level code rejected "debug". Avoid this.
Signed-off-by: Andi Kleen <ak@suse.de>
Here is a small patch for i386 which adds a cpufeature flag and
detection code for Intel's Branch Trace Store (BTS) feature. This
feature can be found on Intel P4 and Core 2 processors among others.
It can also be used by perfmon.
changelog:
- add CPU_FEATURE_BTS
- add Branch Trace Store detection
signed-off-by: stephane eranian <eranian@hpl.hp.com>
Signed-off-by: Andi Kleen <ak@suse.de>
irq_vector[] can now become static.
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
The Coverity checker noted that bad things might happen if
find_isa_irq_apic() returned -1.
[akpm@osdl.org: add debugging checks]
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Function efi_get_time called not only during init kernel phase but also
during suspend (from get_cmos_time).
When it is called from get_cmos_time the corresponding runtime service
should be called in virtual and not in physical mode.
Signed-off-by: Artiom Myaskouvskey <artiom.myaskouvskey@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: "Narayanan, Chandramouli" <chandramouli.narayanan@intel.com>
Cc: "Jiossy, Rami" <rami.jiossy@intel.com>
Cc: "Satt, Shai" <shai.satt@intel.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Move the irqbalance quirks for E7320/E7520/E7525(Errata 23 in
http://download.intel.com/design/chipsets/specupdt/30304203.pdf) to early
quirks.
And add a PCI quirk for these platforms to check(which happens very late
during the boot) if the APIC routing is indeed set to default flat mode.
This fixes the breakage(in x86_64) of this quirk due to cpu hotplug which
selects physical mode instead of the logical flat(as needed for this errata
workaround).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Change the 'no_control' field in the cpu struct to a more positive
and better term 'hotpluggable'. And change(/cleanup) the logic accordingly.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Add 'enable_cpu_hotplug' flag and when cleared, the hotplug control file
("online") will not be added under /sys/devices/system/cpu/cpuX/
Next patch doing PCI quirks will use this.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Mechanism of selecting physical mode in genapic when cpu hotplug is enabled on
x86_64, broke the quirk(quirk_intel_irqbalance()) introduced for working
around the transposing interrupt message errata in E7520/E7320/E7525 (revision
ID 0x9 and below. errata #23 in
http://download.intel.com/design/chipsets/specupdt/30304203.pdf).
This errata requires the mode to be in logical flat, so that interrupts can be
directed to more than one cpu(and thus use hardware IRQ balancing enabled by
BIOS on these platforms).
Following four patches fixes this by moving the quirk to early quirk and
forcing the x86_64 genapic selection to logical flat on these platforms.
Thanks to Shaohua for pointing out the breakage.
This patch:
Add write_pci_config_byte() to direct PCI access routines
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Cc: "Li, Shaohua" <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
o Convert more absolute symbols to section relative to keep the theme in
vmlinux.lds.S file and to avoid problem if kernel is relocated.
o Also put a message so that in future people can be aware of it and
avoid introducing absolute symbols.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Make the needlessly global alloc_gdt() static.
(against) pda-percpu-init
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@muc.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Until not so long ago, there were system log messages pointing to
inconsistent MTRR setup of the video frame buffer caused by the way vesafb
and X worked. While vesafb was fixed meanwhile, I believe fixing it there
only hides a shortcoming in the MTRR code itself, in that that code is not
symmetric with respect to the ordering of attempts to set up two (or more)
regions where one contains the other. In the current shape, it permits
only setting up sub-regions of pre-exisiting ones. The patch below makes
this symmetric.
While working on that I noticed a few more inconsistencies in that code,
namely
- use of 'unsigned int' for sizes in many, but not all places (the patch
is converting this to use 'unsigned long' everywhere, which specifically
might be necessary for x86-64 once a processor supporting more than 44
physical address bits would become available)
- the code to correct inconsistent settings during secondary processor
startup tried (if necessary) to correct, among other things, the value
in IA32_MTRR_DEF_TYPE, however the newly computed value would never get
used (i.e. stored in the respective MSR)
- the generic range validation code checked that the end of the
to-be-added range would be above 1MB; the value checked should have been
the start of the range
- when contained regions are detected, previously this was allowed only
when the old region was uncacheable; this can be symmetric (i.e. the new
region can also be uncacheable) and even further as per Intel's
documentation write-trough and write-back for either region is also
compatible with the respective opposite in the other
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Just like on x86-64, don't touch foreign CPUs' memory if the watchdog
isn't enabled at all.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
While not strictly required with the current code (as the upper half of
page table entries generated by __set_fixmap() cannot be non-zero due
to the second parameter of this function being 'unsigned long'), the
use of set_pte() in __set_fixmap() in the context of clear_fixmap() is
still improper with CONFIG_X86_PAE (see the respective comment in
include/asm-i386/pgtable-3level.h) and would turn into a bug if that
second parameter ever gets changed to a 64-bit type.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
gcc doesn't support -mtune=core2 yet, but will be soon. Use -mtune=generic or -mtune=i686
as fallback
TBD need benchmarking for INTEL_USERCOPY etc. So far I used the same defaults as MPENTIUMM
Signed-off-by: Andi Kleen <ak@suse.de>
Add a way to disable the timer IRQ routing check via a boot option. The
VMI timer code uses this to avoid triggering the pester Mingo code, which
probes for some very unusual and broken motherboard routings. It fires
100% of the time when using a paravirtual delay mechanism instead of using
a realtime delay, since there is no elapsed real time, and the 4 timer IRQs
have not yet been delivered.
In addition, it is entirely possible, though improbable, that this bug
could surface on real hardware which picks a particularly bad time to enter
SMM mode, causing a long latency during one of the timer IRQs.
While here, make check_timer be __init.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
[chrisw: use no_timer_check to bring inline with x86_64 as per Andi's request]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
BIOS ROM areas may not be mapped into the guest address space, so be careful
when touching those addresses to make sure they appear to be mapped.
[akpm@osdl.org: fix unused var warning]
AK: Changed __get_user to probe_kernel_address
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Add the three bare TLB accessor functions to paravirt-ops. Most amusingly,
flush_tlb is redefined on SMP, so I can't call the paravirt op flush_tlb.
Instead, I chose to indicate the actual flush type, kernel (global) vs. user
(non-global). Global in this sense means using the global bit in the page
table entry, which makes TLB entries persistent across CR3 reloads, not
global as in the SMP sense of invoking remote shootdowns, so the term is
confusingly overloaded.
AK: folded in fix from Zach for PAE compilation
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Add APIC accessors to paravirt-ops. Unfortunately, we need two write
functions, as some older broken hardware requires workarounds for
Pentium APIC errata - this is the purpose of apic_write_atomic.
AK: replaced __inline with inline
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Two legacy power management modes are much easier to just explicitly disable
when running in paravirtualized mode - neither APM nor PnP is still relevant.
The status of ACPI is still debatable, and noacpi is still a common enough
boot parameter that it is not necessary to explicitly disable ACPI.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Allow selected bug checks to be skipped by paravirt kernels. The two most
important are the F00F workaround (which is either done by the hypervisor,
or not required), and the 'hlt' instruction check, which can break under
some hypervisors.
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
1) Each hypervisor writes a probe function to detect whether we are
running under that hypervisor. paravirt_probe() registers this
function.
2) If vmlinux is booted with ring != 0, we call all the probe
functions (with registers except %esp intact) in link order: the
winner will not return.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Both lhype and Xen want to call the core of the x86 cpu detect code before
calling start_kernel.
(extracted from larger patch)
AK: folded in start_kernel header patch
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
It turns out that the most called ops, by several orders of magnitude,
are the interrupt manipulation ops. These are obvious candidates for
patching, so mark them up and create infrastructure for it.
The method used is that the ops structure has a patch function, which
is called for each place which needs to be patched: this returns a
number of instructions (the rest are NOP-padded).
Usually we can spare a register (%eax) for the binary patched code to
use, but in a couple of critical places in entry.S we can't: we make
the clobbers explicit at the call site, and manually clobber the
allowed registers in debug mode as an extra check.
And:
Don't abuse CONFIG_DEBUG_KERNEL, add CONFIG_DEBUG_PARAVIRT.
And:
AK: Fix warnings in x86-64 alternative.c build
And:
AK: Fix compilation with defconfig
And:
^From: Andrew Morton <akpm@osdl.org>
Some binutlises still like to emit references to __stop_parainstructions and
__start_parainstructions.
And:
AK: Fix warnings about unused variables when PARAVIRT is disabled.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Create a paravirt.h header for all the critical operations which need to be
replaced with hypervisor calls, and include that instead of defining native
operations, when CONFIG_PARAVIRT.
This patch does the dumbest possible replacement of paravirtualized
instructions: calls through a "paravirt_ops" structure. Currently these are
function implementations of native hardware: hypervisors will override the ops
structure with their own variants.
All the pv-ops functions are declared "fastcall" so that a specific
register-based ABI is used, to make inlining assember easier.
And:
+From: Andy Whitcroft <apw@shadowen.org>
The paravirt ops introduce a 'weak' attribute onto memory_setup().
Code ordering leads to the following warnings on x86:
arch/i386/kernel/setup.c:651: warning: weak declaration of
`memory_setup' after first use results in unspecified behavior
Move memory_setup() to avoid this.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
IOPL is implicitly saved and restored on task switch,
so explicit check is no longer needed.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Makes the intention of the code cleaner to read and avoids
a potential deadlock on mmap_sem. Also change the types of
the arguments to not include __user because they're really
not user addresses.
Signed-off-by: Andi Kleen <ak@suse.de>
The entry.S code at work_notifysig is surely wrong. It drops into unrelated
code if the branch to work_notifysig_v86 is taken, and CONFIG_VM86=n.
[PATCH] Make vm86 support optional
tree 9b5daef528
pushed to git Jan 8, 2006, and first appears in 2.6.16
The 'fix' here is to also compile out the vm86 test & branch when
CONFIG_VM86=n.
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Extend bzImage protocol to enable bootloaders to load a completely relocatable
bzImage. Now protected mode component of kernel is also relocatable and a
boot-loader can load the protected mode component at a differnt physical
address than 1MB. (If kernel was built with CONFIG_RELOCATABLE)
Kexec can make use of it to load this kernel at a different physical address
to capture kernel crash dumps.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
o Now CONFIG_PHYSICAL_START is being replaced with CONFIG_PHYSICAL_ALIGN.
Hardcoding the kernel physical start value creates a problem in relocatable
kernel context due to boot loader limitations. For ex, if somebody
compiles a relocatable kernel to be run from address 4MB, but this kernel
will run from location 1MB as grub loads the kernel at physical address
1MB. Kernel thinks that I am a relocatable kernel and I should run from
the address I have been loaded at. So somebody wanting to run kernel
from 4MB alignment location (for improved performance regions) can't do
that.
o Hence, Eric proposed that probably CONFIG_PHYSICAL_ALIGN will make
more sense in relocatable kernel context. At run time kernel will move
itself to a physical addr location which meets user specified alignment
restrictions.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
o Relocations generated w.r.t absolute symbols are not processed as by
definition, absolute symbols are not to be relocated. Explicitly warn
user about absolutions relocations present at compile time.
o These relocations get introduced either due to linker optimizations or
some programming oversights.
o Also create a list of symbols which have been audited to be safe and
don't emit warnings for these.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
This patch modifies the i386 kernel so that if CONFIG_RELOCATABLE is
selected it will be able to be loaded at any 4K aligned address below
1G. The technique used is to compile the decompressor with -fPIC and
modify it so the decompressor is fully relocatable. For the main
kernel relocations are generated. Resulting in a kernel that is relocatable
with no runtime overhead and no need to modify the source code.
A reserved 32bit word in the parameters has been assigned
to serve as a stack so we figure out where are running.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Defining __PHYSICAL_START and __KERNEL_START in asm-i386/page.h works but
it triggers a full kernel rebuild for the silliest of reasons. This
modifies the users to directly use CONFIG_PHYSICAL_START and linux/config.h
which prevents the full rebuild problem, which makes the code much
more maintainer and hopefully user friendly.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Currently when we are reserving the memory the kernel text
resides in we start at __PHYSICAL_START which happens to be
correct but not very obvious. In addition when we start relocating
the kernel __PHYSICAL_START is the wrong value, as it is an
absolute symbol that does not get relocated.
By starting the reservation at __pa_symbol(_text)
the code is clearer and will be correct when relocated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Ld knows about 2 kinds of symbols, absolute and section
relative. Section relative symbols symbols change value
when a section is moved and absolute symbols do not.
Currently in the linker script we have several labels
marking the beginning and ending of sections that
are outside of sections, making them absolute symbols.
Having a mixture of absolute and section relative
symbols refereing to the same data is currently harmless
but it is confusing.
This must be done carefully as newer revs of ld do not place
symbols that appear in sections without data and instead
ld makes those symbols global :(
My ultimate goal is to build a relocatable kernel. The
safest and least intrusive technique is to generate
relocation entries so the kernel can be relocated at load
time. The only penalty would be an increase in the size
of the kernel binary. The problem is that if absolute and
relocatable symbols are not properly specified absolute symbols
will be relocated or section relative symbols won't be, which
is fatal.
The practical motivation is that when generating kernels that
will run from a reserved area for analyzing what caused
a kernel panic, it is simpler if you don't need to hard code
the physical memory location they will run at, especially
for the distributions.
[AK: and merged:]
o Also put a message so that in future people can be aware of it and
avoid introducing absolute symbols.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
On modern systems RAM errors don't cause NMIs, but it's usually
caused by PCI SERR. Mention PCI instead of RAM in the printk.
Reported by r_hayashi@ctc-g.co.jp (Ryutaro Hayashi)
Cc: r_hayashi@ctc-g.co.jp
Signed-off-by: Andi Kleen <ak@suse.de>
Resending as I believe the discussion about them established they were
correct.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Currently the idle loop has two nested loops -- one high level
in cpu_idle and in some low level idle functions another one.
Looping in the low level idle functions breaks the idle notifiers
because interrupts waking up sleep states need to execute
exit_idle() which is only in cpu_idle().
So don't do that, only loop in cpu_idle(). This only removes
code.
In some cases e.g. poll_idle the idle loop is a little longer
now because cpu_idle checks more things. I hope that isn't a problem
ACPI idle doesn't change behaviour because it never looped anyways.
Cc: len.brown@intel.com
Cc: eranian@hpl.hp.com
Signed-off-by: Andi Kleen <ak@suse.de>
Use the pcurrent field in the PDA to implement the "current" macro. This ends
up compiling down to a single instruction to get the current task.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Use the cpu_number in the PDA to implement raw_smp_processor_id. This is a
little simpler than using thread_info, though the cpu field in thread_info
cannot be removed since it is used for things other than getting the current
CPU in common code.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
sys_vm86 uses a struct kernel_vm86_regs, which is identical to pt_regs, but
adds an extra space for all the segment registers. Previously this structure
was completely independent, so changes in pt_regs had to be reflected in
kernel_vm86_regs. This changes just embeds pt_regs in kernel_vm86_regs, and
makes the appropriate changes to vm86.c to deal with the new naming.
Also, since %gs is dealt with differently in the kernel, this change adjusts
vm86.c to reflect this.
While making these changes, I also cleaned up some frankly bizarre code which
was added when auditing was added to sys_vm86.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
There are a few places where the change in struct pt_regs and the use of %gs
affect the userspace ABI. These are primarily debugging interfaces where
thread state can be inspected or extracted.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
This patch is the meat of the PDA change. This patch makes several related
changes:
1: Most significantly, %gs is now used in the kernel. This means that on
entry, the old value of %gs is saved away, and it is reloaded with
__KERNEL_PDA.
2: entry.S constructs the stack in the shape of struct pt_regs, and this
is passed around the kernel so that the process's saved register
state can be accessed.
Unfortunately struct pt_regs doesn't currently have space for %gs
(or %fs). This patch extends pt_regs to add space for gs (no space
is allocated for %fs, since it won't be used, and it would just
complicate the code in entry.S to work around the space).
3: Because %gs is now saved on the stack like %ds, %es and the integer
registers, there are a number of places where it no longer needs to
be handled specially; namely context switch, and saving/restoring the
register state in a signal context.
4: And since kernel threads run in kernel space and call normal kernel
code, they need to be created with their %gs == __KERNEL_PDA.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
When a CPU is brought up, a PDA and GDT are allocated for it. The GDT's
__KERNEL_PDA entry is pointed to the allocated PDA memory, so that all
references using this segment descriptor will refer to the PDA.
This patch rearranges CPU initialization a bit, so that the GDT/PDA are set up
as early as possible in cpu_init(). Also for secondary CPUs, GDT+PDA are
preallocated and initialized so all the secondary CPU needs to do is set up
the ldt and load %gs. This will be important once smp_processor_id() and
current use the PDA.
In all cases, the PDA is set up in head.S, before a CPU starts running C code,
so the PDA is always available.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Cc: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Matt Tolentino <matthew.e.tolentino@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
This patch has the basic definitions of struct i386_pda, and the segment
selector in the GDT.
asm-i386/pda.h is more or less a direct copy of asm-x86_64/pda.h. The most
interesting difference is the use of _proxy_pda, which is used to give gcc a
model for the actual memory operations on the real pda structure. No actual
reference is ever made to _proxy_pda, so it is never defined.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Zachary Amsden <zach@vmware.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Use asm-offsets for the offsets of registers into the pt_regs struct, rather
than having hard-coded constants
I left the constants in the comments of entry.S because they're useful for
reference; the code in entry.S is very dependent on the layout of pt_regs,
even when using asm-offsets.
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Keith Owens <kaos@ocs.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
ioremap must be balanced by an iounmap and failing to do so can result
in a memory leak.
Tested (compilation only):
- using allmodconfig
- making sure the files are compiling without any warning/error due to
new changes
Signed-off-by: Amol Lad <amol@verismonetworks.com>
Signed-off-by: Andi Kleen <ak@suse.de>
i386 port of the sLeAZY-fpu feature. Chuck reports that this gives him a +/-
0.4% improvement on his simple benchmark
x86_64 description follows:
Right now the kernel on x86-64 has a 100% lazy fpu behavior: after *every*
context switch a trap is taken for the first FPU use to restore the FPU
context lazily. This is of course great for applications that have very
sporadic or no FPU use (since then you avoid doing the expensive save/restore
all the time). However for very frequent FPU users... you take an extra trap
every context switch.
The patch below adds a simple heuristic to this code: After 5 consecutive
context switches of FPU use, the lazy behavior is disabled and the context
gets restored every context switch. If the app indeed uses the FPU, the trap
is avoided. (the chance of the 6th time slice using FPU after the previous 5
having done so are quite high obviously).
After 256 switches, this is reset and lazy behavior is returned (until there
are 5 consecutive ones again). The reason for this is to give apps that do
longer bursts of FPU use still the lazy behavior back after some time.
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Clean up the espfix code:
- Introduced PER_CPU() macro to be used from asm
- Introduced GET_DESC_BASE() macro to be used from asm
- Rewrote the fixup code in asm, as calling a C code with the altered %ss
appeared to be unsafe
- No longer altering the stack from a .fixup section
- 16bit per-cpu stack is no longer used, instead the stack segment base
is patched the way so that the high word of the kernel and user %esp
are the same.
- Added the limit-patching for the espfix segment. (Chuck Ebbert)
[jeremy@goop.org: use the x86 scaling addressing mode rather than shifting]
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Zachary Amsden <zach@vmware.com>
Acked-by: Chuck Ebbert <76306.1226@compuserve.com>
Acked-by: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>