Impact: Fix boot failure on EFI system with large runtime memory range
Brian Maly reported that some EFI system with large runtime memory
range can not boot. Because the FIX_MAP used to map runtime memory
range is smaller than run time memory range.
This patch fixes this issue by re-implement efi_ioremap() with
init_memory_mapping().
Reported-and-tested-by: Brian Maly <bmaly@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Cc: Brian Maly <bmaly@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236135513.6204.306.camel@yhuang-dev.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reactivate DMI quirks on EFI hardware
DMI tables are loaded by EFI, so the dmi calls must happen after
efi_init() and not before.
Currently Apple hardware uses DMI to determine the framebuffer mappings
for efifb. Without DMI working you also have no video on MacBook Pro.
This patch resolves the DMI issue for EFI hardware (DMI is now properly
detected at boot), and additionally efifb now loads on Apple hardware
(i.e. video works).
Signed-off-by: Brian Maly <bmaly@redhat>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: ying.huang@intel.com
LKML-Reference: <49ADEDA3.1030406@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/setup.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
Impact: build fix
The APIC code rewrite in the x86 tree broke the x86/mce branch:
arch/x86/kernel/cpu/mcheck/threshold.c: In function ‘mce_threshold_interrupt’:
arch/x86/kernel/cpu/mcheck/threshold.c:24: error: implicit declaration of function ‘ack_APIC_irq’
Also tidy up the file a bit while at it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix bad frame in rt_sigreturn on 64-bit
After commit 97286a2b64 some applications
fail to return from signal handler:
[ 145.150133] firefox[3250] bad frame in rt_sigreturn frame:00007f902b44eb28 ip:352e80b307 sp:7f902b44ef70 orax:ffffffffffffffff in libpthread-2.9.so[352e800000+17000]
[ 665.519017] firefox[5420] bad frame in rt_sigreturn frame:00007faa8deaeb28 ip:352e80b307 sp:7faa8deaef70 orax:ffffffffffffffff in libpthread-2.9.so[352e800000+17000]
The root cause is forgetting to keep 64 byte aligned value of
fpstate for next stack pointer calculation.
Reported-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
LKML-Reference: <49AC85C1.7060600@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On x86-64, a 32-bit process (TIF_IA32) can switch to 64-bit mode with
ljmp, and then use the "syscall" instruction to make a 64-bit system
call. A 64-bit process make a 32-bit system call with int $0x80.
In both these cases, audit_syscall_entry() will use the wrong system
call number table and the wrong system call argument registers. This
could be used to circumvent a syscall audit configuration that filters
based on the syscall numbers or argument details.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With x86-32 and -64 using the same mechanism for managing the
tss io permissions bitmap, large chunks of process*.c are
trivially unifyable, including:
- exit_thread
- flush_thread
- __switch_to_xtra (along with tsc enable/disable)
and as bonus pickups:
- sys_fork
- sys_vfork
(Note: asmlinkage expands to empty on x86-64)
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove 32-bit optimization to prepare unification
x86-32 and -64 differ in the way they context-switch tasks
with io permission bitmaps. x86-64 simply copies the next
tasks io bitmap into place (if any) on context switch. x86-32
invalidates the bitmap on context switch, so that the next
IO instruction will fault; at that point it installs the
appropriate IO bitmap.
This makes context switching IO-bitmap-using tasks a bit more
less expensive, at the cost of making the next IO instruction
slower due to the extra fault. This tradeoff only makes sense
if IO-bitmap-using processes are relatively common, but they
don't actually use IO instructions very often.
However, in a typical desktop system, the only process likely
to be using IO bitmaps is the X server, and nothing at all on
a server. Therefore the lazy context switch doesn't really win
all that much, and its just a gratuitious difference from
64-bit code.
This patch removes the lazy context switch, with a view to
unifying this code in a later change.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove __cpuinitdata section placement for translation_table
structure, since it is referenced from a functions within .text.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Remove __init section placement for some functions/data, so that
we don't get section mismatch warnings.
Also make inline function instead of empty setup_summit macro.
[v2]
One of them was not caught by
DEBUG_SECTION_MISMATCH=y
magic. Fix it.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Remove __init section placement for some functions, so that we don't
get section mismatch warnings.
[v2]:
2 of them were not caught by
DEBUG_SECTION_MISMATCH=y
magic. Fix it.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Perform same-cluster checking even for masks with all (nr_cpu_ids)
bits set and report correct apicid on success instead.
While at it, convert it to for_each_cpu and newer cpumask api.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Perform same-cluster checking even for masks with all (nr_cpu_ids)
bits set and report BAD_APICID on failure.
While at it, convert it to for_each_cpu.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove es7000_cpu_mask_to_apicid_cluster completely, because it's
almost the same as es7000_cpu_mask_to_apicid except 2 code paths.
One of them is about to be removed soon, the another should be
BAD_APICID (it's a fail path).
The _cluster one was not invoked on apic->cpu_mask_to_apicid_and
anyway, since there was no _cluster_and variant.
Also use newer cpumask functions.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The ones which go only into struct apic are de-inlined
by compiler anyway, so remove the inline specifier from them.
Afterwards, remove bigsmp_setup_portio_remap completely as it
is unused.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: unification
show_cpuinfo_core is identical for 32 and 64 bit and can be unified,
and CONFIG_X86_HT inherently depends on CONFIG_X86_SMP.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Add missing __user annotation to the parameter of get_sigframe().
Also change cast type to void __user * of *fpstate.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix this warning:
arch/x86/kernel/cpu/intel_cacheinfo.c:139: warning: ‘k8_nb_id’ defined but not used
arch/x86/kernel/cpu/intel_cacheinfo.c:527: warning: ‘free_cache_attributes’ defined but not used
arch/x86/kernel/cpu/intel_cacheinfo.c:538: warning: ‘detect_cache_attributes’ defined but not used
Unused variables in the !CONFIG_SYSCTL case.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/kernel/apic/es7000_32.c:702: error: 'es7000_acpi_madt_oem_check_cluster' undeclared here (not in a function)
Provide a es7000_acpi_madt_oem_check_cluster() definition in the !ACPI
case too.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
init_deasserted is only available on SMP. Make the secondary-wakeup
function conditional on SMP.
Also clean up the file some.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
- rename apic->wakeup_cpu to apic->wakeup_secondary_cpu, to
make it apparent that this is an SMP-only method
- handle NULL ->wakeup_secondary_cpus to mean the default INIT
wakeup sequence - this allows simplification of the APIC
driver templates.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
that is only needed when CONFIG_X86_VSMP is defined with 64bit
also remove dead code about PCI, because CONFIG_X86_VSMP depends on PCI
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
x86_quirks->update_apic() calling looks crazy. so try to remove it:
1. every apic take wakeup_cpu member directly
2. separate es7000_apic to es7000_apic_cluster
3. use uv_wakeup_cpu directly
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Recent changes in setup_percpu.c made a now meaningless DBG()
statement fail to compile and introduced a
comparison-of-different-types warning. Fix them.
Compile failure is reported by Ingo Molnar.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
Impact: Fix marginal race condition
One the first CPU the machine checks are enabled early before
the local APIC is enabled. This could in theory lead
to some lost CMCI events very early during boot because
CMCIs cannot be delivered with disabled LAPIC.
The poller also doesn't recover from this because it doesn't
check CMCI banks.
Add an explicit CMCI banks check after the LAPIC is enabled.
This is only done for CPU #0, the other CPUs only initialize
machine checks after the LAPIC is on.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Avoids confusing other OSes.
Disable the CMCI vector on reboot to avoid confusing other OS.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix on UP
The MCE code is reinitialized from resume, so we can't use
__cpuinit/__cpuexit for most of the code. Remove those annotations
for anything downstream of mce_init().
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Major new feature
Intel CMCI (Corrected Machine Check Interrupt) is a new
feature on Nehalem CPUs. It allows the CPU to trigger
interrupts on corrected events, which allows faster
reaction to them instead of with the traditional
polling timer.
Also use CMCI to discover shared banks. Machine check banks
can be shared by CPU threads or even cores. Using the CMCI enable
bit it is possible to detect the fact that another CPU already
saw a specific bank. Use this to assign shared banks only
to one CPU to avoid reporting duplicated events.
On CPU hot unplug bank sharing is re discovered. This is done
using a thread that cycles through all the CPUs.
To avoid races between the poller and CMCI we only poll
for banks that are not CMCI capable and only check CMCI
owned banks on a interrupt.
The shared banks ownership information is currently only used for
CMCI interrupts, not polled banks.
The sharing discovery code follows the algorithm recommended in the
IA32 SDM Vol3a 14.5.2.1
The CMCI interrupt handler just calls the machine check poller to
pick up the machine check event that caused the interrupt.
I decided not to implement a separate threshold event like
the AMD version has, because the threshold is always one currently
and adding another event didn't seem to add any value.
Some code inspired by Yunhong Jiang's Xen implementation,
which was in term inspired by a earlier CMCI implementation
by me.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Define a per cpu bitmap that contains the banks polled by the machine
check poller. This is needed for the CMCI code in the next patches
to be able to disable polling on specific banks.
The bank by default contains all banks, so there is no behaviour
change. Only future code will remove some banks from the polling
set.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: behavior change, use common code
Use a standard leaky bucket ratelimit for the machine check
warning print interval instead of waiting every check_interval.
Also decrease the limit to twice per minute.
This interacts better with threshold interrupts because
they can happen more often than check_interval.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: minor bugfix
The threshold handler on AMD (and soon on Intel) could be theoretically
reentered by the hardware. This could lead to corrupted events
because the machine check poll code assumes it is not reentered.
Move the APIC ACK to the end of the interrupt handler to let
the hardware avoid that.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup; preparation for feature
The mce_amd_64 code has an own private MC threshold vector with an own
interrupt handler. Since Intel needs a similar handler
it makes sense to share the vector because both can not
be active at the same time.
I factored the common APIC handler code into a separate file which can
be used by both the Intel or AMD MC code.
This is needed for the next patch which adds an Intel specific
CMCI handler.
This patch should be a nop for AMD, it just moves some code
around.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup (code movement)
Move MAX_NR_BANKS into mce.h because it's needed there
for followup patches.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The ones which go only into struct genapic are de-inlined
by compiler anyway, so remove the inline specifier from them.
Afterwards, remove summit_setup_portio_remap completely as it
is unused.
Remove inline also from summit_cpu_mask_to_apicid, since it's
not worth it (it is used in struct genapic too).
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use BAD_APICID instead of 0xFF constants in summit_cpu_mask_to_apicid.
Also remove bogus comments about what we actually return.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
native_usergs_sysret64 is described as
extern void native_usergs_sysret64(void)
so lets add ENDPROC here
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: heukelum@fastmail.fm
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
NEXT_PAGE already has 'balign' so no
need to keep this redundant one.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: heukelum@fastmail.fm
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add better first percpu allocation for NUMA
On NUMA, embedding allocator can't be used as different units can't be
made to fall in the correct NUMA nodes. To use large page mapping,
each unit needs to be remapped. However, percpu areas are usually
much smaller than large page size and unused space hurts a lot as the
number of cpus grow. This allocator remaps large pages for each chunk
but gives back unused part to the bootmem allocator making the large
pages mapped twice.
This adds slightly to the TLB pressure but is much better than using
4k mappings while still being NUMA-friendly.
Ingo suggested that this would be the correct approach for NUMA.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Impact: add better first percpu allocation for !NUMA
On !NUMA, we can simply allocate contiguous memory and use it for the
first chunk without mapping it into vmalloc area. As the memory area
is covered by the large page physical memory mapping, it allows the
dynamic perpcu allocator to not add any TLB overhead for the static
percpu area and whatever falls into the first chunk and the
implementation is very simple too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: modularize percpu first chunk allocation
x86 is gonna have a few different strategies for the first chunk
allocation. Modularize it by separating out the current allocation
mechanism into pcpu_alloc_bootmem() and setup_pcpu_4k().
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: more latitude for first percpu chunk allocation
The first percpu chunk serves the kernel static percpu area and may or
may not contain extra room for further dynamic allocation.
Initialization of the first chunk needs to be done before normal
memory allocation service is up, so it has its own init path -
pcpu_setup_static().
It seems archs need more latitude while initializing the first chunk
for example to take advantage of large page mapping. This patch makes
the following changes to allow this.
* Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
to reserve in the first chunk for further dynamic allocation.
* Rename pcpu_setup_static() to pcpu_setup_first_chunk().
* Make pcpu_setup_first_chunk() much more flexible by fetching page
pointer by callback and adding optional @unit_size, @free_size and
@base_addr arguments which allow archs to selectively part of chunk
initialization to their likings.
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: minor change to populate_extra_pte() and addition of pmd flavor
Update populate_extra_pte() to return pointer to the pte_t for the
specified address and add populate_extra_pmd() which only populates
till the pmd and returns pointer to the pmd entry for the address.
For 64bit, pud/pmd/pte fill functions are separated out from
set_pte_vaddr[_pud]() and used for set_pte_vaddr[_pud]() and
populate_extra_{pte|pmd}().
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: Bug fix when CPU hotplug is disabled
Correct the following broken __cpuinit/__cpuexit annotations:
- mce_cpu_features() is called from mce_resume(), and so cannot be
__cpuinit.
- mce_disable_cpu() and mce_reenable_cpu() are called from
mce_cpu_callback(), and so cannot be __cpuexit().
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: Cleanup
Checkin be44d2aabc eliminates the use of
a 16-bit stack for espfix. However, at least one instruction remained
that only operated on the low 16 bits of %esp.
This is not a bug per se because the kernel stack is always an aligned
4K or 8K block. Therefore it cannot cross 64K boundaries; this code,
in fact, relies strictly on that fact.
However, it's a lot cleaner (and, for that matter, smaller) to operate
on the entire 32-bit register.
Signed-off-by: Stas Sergeev <stsp@aknet.ru>
CC: Zachary Amsden <zach@vmware.com>
CC: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: fix early crash on LinuxBIOS systems
Kevin O'Connor reported that Coreboot aka LinuxBIOS tries to put
mptable somewhere very high, well above max_low_pfn (below which
BIOSes generally put the mptable), causing a panic.
The BIOS will probably be changed to be compatible with older
Linus versions, but nevertheless the MP-spec does not forbid
an MP-table in arbitrary system RAM, so make sure it all
works even if the table is in an unexpected place.
Check physptr with max_low_pfn * PAGE_SIZE.
Reported-by: Kevin O'Connor <kevin@koconnor.net>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Stefan Reinauer <stepan@coresystems.de>
Cc: coreboot@coreboot.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Make x86_quirks support more transparent. The highlevel
methods are now named:
extern void x86_quirk_pre_intr_init(void);
extern void x86_quirk_intr_init(void);
extern void x86_quirk_trap_init(void);
extern void x86_quirk_pre_time_init(void);
extern void x86_quirk_time_init(void);
This makes it clear that if some platform extension has to
do something here that it is considered ... weird, and is
discouraged.
Also remove arch_hooks.h and move it into setup.h (and other
header files where appropriate).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove dead code
Remove:
- pre_setup_arch_hook()
- mca_nmi_hook()
If needed they can be added back via an x86_quirk handler.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the sysdev_suspend/resume from the callee to the callers, with
no real change in semantics, so that we can rework the disabling of
interrupts during suspend/hibernation.
This is based on an earlier patch from Linus.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Right now nobody cares, but the suspend/resume code will eventually want
to suspend device interrupts without suspending the timer, and will
depend on this flag to know.
The modern x86 timer infrastructure uses the local APIC timers and never
shows up as a device interrupt at all, so it isn't affected and doesn't
need any of this.
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If BIOS hands over the control to OS in legacy xapic mode, select
legacy xapic related ops in the early apic probe and shift to x2apic
ops later in the boot sequence, only after enabling x2apic mode.
If BIOS hands over the control in x2apic mode, select x2apic related
ops in the early apic probe.
This fixes the early boot panic, where we were selecting x2apic ops,
while the cpu is still in legacy xapic mode.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix these sparse warnings:
arch/x86/kernel/machine_kexec_32.c:124:22: warning: Using plain integer as NULL pointer
arch/x86/kernel/traps.c:950:24: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As acpi_enter_sleep_state can fail, take this into account in
do_suspend_lowlevel and don't return to the do_suspend_lowlevel's
caller. This would break (currently) fpu status and preempt count.
Technically, this means use `call' instead of `jmp' and `jmp' to
the `resume_point' after the `call' (i.e. if
acpi_enter_sleep_state returns=fails). `resume_point' will handle
the restore of fpu and preempt count gracefully.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
- remove %ds re-set, it's already set in wakeup_long64
- remove double labels and alignment (ENTRY already adds both)
- use meaningful resume point labelname
- skip alignment while jumping from wakeup_long64 to the resume point
- remove .size, .type and unused labels
[v2]
- added ENDPROCs
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
Impact: Bug fix on UP
Checkin 6ec68bff3c:
x86, mce: reinitialize per cpu features on resume
introduced a call to mce_cpu_features() in the resume path, in order
for the MCE machinery to get properly reinitialized after a resume.
However, this function (and its successors) was flagged __cpuinit,
which becomes __init on UP configurations (on SMP suspend/resume
requires CPU hotplug and so this would not be seen.)
Remove the offending __cpuinit annotations for mce_cpu_features() and
its successor functions.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: cleanup
Rename TASK_SIZE64 to TASK_SIZE_MAX, and provide the
define on 32-bit too. (mapped to TASK_SIZE)
This allows 32-bit code to make use of the (former-) TASK_SIZE64
symbol as well, in a clean way.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
clean up vmi_read_cycles to use max()
Reported-b: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zach Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new dynamic allocator, unified access to static/dynamic
percpu memory
Convert to the new dynamic percpu allocator.
* implement populate_extra_pte() for both 32 and 64
* update setup_per_cpu_areas() to use pcpu_setup_static()
* define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
* define config HAVE_DYNAMIC_PER_CPU_AREA
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: cleanup
There are two allocated per-cpu accessor macros with almost identical
spelling. The original and far more popular is per_cpu_ptr (44
files), so change over the other 4 files.
tj: kill percpu_ptr() and update UP too
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: mingo@redhat.com
Cc: lenb@kernel.org
Cc: cpufreq@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: economize memory for large NR_CPUS
percpu data is setup earlier than irq, we can use percpu data
to economize memory.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Impact: fix time warps under vmware
Similar to the check for TSC going backwards in the TSC clocksource,
we also need this check for VMI clocksource.
Signed-off-by: Alok N Kataria <akataria@vmware.com>
Cc: Zachary Amsden <zach@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Impact: Cleanup
The standard spelling of a printf pattern for long long is "ll", not
"L", which is for long double.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup, performance enhancement
The machine check poller is diverging more and more from the fatal
exception handler. Instead of adding more special cases separate the code
paths completely. The corrected poll path is actually quite simple,
and this doesn't result in much code duplication.
This makes both handlers much easier to read and results in
cleaner code flow. The exception handler now only needs to care
about uncorrected errors, which also simplifies the handling of multiple
errors. The corrected poller also now always runs in standard interrupt
context and does not need to do anything special to handle NMI context.
Minor behaviour changes:
- MCG status is now not cleared on polling.
- Only the banks which had corrected errors get cleared on polling
- The exception handler only clears banks with errors now
v2: Forward port to new patch order. Add "uc" argument.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup
This merely factors out duplicated code to set up
the initial struct mce state into a single function.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: cleanup; making code future proof; memory saving on small systems
This patch replaces the hardcoded max number of machine check banks with
dynamic allocation depending on what the CPU reports. The sysfs
data structures and the banks array are dynamically allocated.
There is still a hard bank limit (128) because the mcelog protocol uses
banks >= 128 as pseudo banks to escape other events. But we expect
that 128 banks is beyond any reasonable CPU for now.
This supersedes an earlier patch by Venki, but it solves the problem
more completely by making the limit fully dynamic (up to the 128
boundary).
This saves some memory on machines with less than 6 banks because
they won't need sysdevs for unused ones and also allows to
use sysfs to control these banks on possible future CPUs with
more than 6 banks.
This is an updated patch addressing Venki's comments. I also added in
another patch from Thomas which fixed the error allocation path (that
patch was previously separated)
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, mce: fix ifdef for 64bit thermal apic vector clear on shutdown
x86, mce: use force_sig_info to kill process in machine check
x86, mce: reinitialize per cpu features on resume
x86, rcu: fix strange load average and ksoftirqd behavior
Impact: bugfix
Considering the situation as follow:
before: mcelog.next == 1, mcelog.entry[0].finished = 1
+--------------------------------------------------------------------------
R W1 W2 W3
read mcelog.next (1)
mcelog.next++ (2)
(working on entry 1,
finished == 0)
mcelog.next = 0
mcelog.next++ (1)
(working on entry 0)
mcelog.next++ (2)
(working on entry 1)
<----------------- race ---------------->
(done on entry 1,
finished = 1)
(done on entry 1,
finished = 1)
To fix the race condition, a cmpxchg loop is added to mce_read() to
ensure no new MCE record can be added between mcelog.next reading and
mcelog.next = 0.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Lower priority bug fix
Offlined CPUs could still get machine checks, but the machine check handler
cannot handle them properly, leading to an unconditional crash. Disable
machine checks on CPUs that are going down.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix, in this case the resume handler shouldn't run which
avoids incorrectly reenabling machine checks on resume
When MCEs are completely disabled on the command line don't set
up the sysdev devices for them either.
Includes a comment fix from Thomas Gleixner.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Higher priority bug fix
The machine check poller runs a single timer and then broadcasted an
IPI to all CPUs to check them. This leads to unnecessary
synchronization between CPUs. The original CPU running the timer has
to wait potentially a long time for all other CPUs answering. This is
also real time unfriendly and in general inefficient.
This was especially a problem on systems with a lot of events where
the poller run with a higher frequency after processing some events.
There could be more and more CPU time wasted with this, to
the point of significantly slowing down machines.
The machine check polling is actually fully independent per CPU, so
there's no reason to not just do this all with per CPU timers. This
patch implements that.
Also switch the poller also to use standard timers instead of work
queues. It was using work queues to be able to execute a user program
on a event, but mce_notify_user() handles this case now with a
separate callback. So instead always run the poll code in in a
standard per CPU timer, which means that in the common case of not
having to execute a trigger there will be less overhead.
This allows to clean up the initialization significantly, because
standard timers are already up when machine checks get init'ed. No
multiple initialization functions.
Thanks to Thomas Gleixner for some help.
Cc: thockin@google.com
v2: Use del_timer_sync() on cpu shutdown and don't try to handle
migrated timers.
v3: Add WARN_ON for timer running on unexpected CPU
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Needed for bug fix in next patch
This relaxes the requirement that mce_notify_user has to run in process
context. Useful for future changes, but also leads to cleaner
behaviour now. Now instead mce_notify_user can be called directly
from interrupt (but not NMI) context.
The work queue only uses a single global work struct, which can be done safely
because it is always free to reuse before the trigger function is executed.
This way no events can be lost.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: low priority bug fix
This removes part of a a patch I added myself some time ago. After some
consideration the patch was a bad idea. In particular it stopped machine check
exceptions during code patching.
To quote the comment:
* MCEs only happen when something got corrupted and in this
* case we must do something about the corruption.
* Ignoring it is worse than a unlikely patching race.
* Also machine checks tend to be broadcast and if one CPU
* goes into machine check the others follow quickly, so we don't
* expect a machine check to cause undue problems during to code
* patching.
So undo the machine check related parts of
8f4e956b31 NMIs are still disabled.
This only removes code, the only additions are a new comment.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
During suspend it is not reliable to process machine check
exceptions, because CPUs disappear but can still get machine check
broadcasts. Also the system is slightly more likely to
machine check them, but the handler is typically not a position
to handle them in a meaningfull way.
So disable them during suspend and enable them during resume.
Also make sure they are always disabled on hot-unplugged CPUs.
This new code assumes that suspend always hotunplugs all
non BP CPUs.
v2: Remove the WARN_ONs Thomas objected to.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bugfix
The ifdef for the apic clear on shutdown for the 64bit intel thermal
vector was incorrect and never triggered. Fix that.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: bug fix (with tolerant == 3)
do_exit cannot be called directly from the exception handler because
it can sleep and the exception handler runs on the exception stack.
Use force_sig() instead.
Based on a earlier patch by Ying Huang who debugged the problem.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Bug fix
This fixes a long standing bug in the machine check code. On resume the
boot CPU wouldn't get its vendor specific state like thermal handling
reinitialized. This means the boot cpu wouldn't ever get any thermal
events reported again.
Call the respective initialization functions on resume
v2: Remove ancient init because they don't have a resume device anyways.
Pointed out by Thomas Gleixner.
v3: Now fix the Subject too to reflect v2 change
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: fix preemption bug
x86, olpc: fix model detection without OFW
x86, hpet: fix for LS21 + HPET = boot hang
x86: CPA avoid repeated lazy mmu flush
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86, ptrace, mm: fix double-free on race
Impact: build fix, cleanup
A couple of arch setup callbacks were mistakenly in apic_32.c, breaking
the build.
Also simplify the code a bit.
Signed-off-by: Ingo Molnar <mingo@elte.hu>