Impact: fix crash during hibernation on 32-bit NUMA
The NUMA code on x86_32 creates special memory mapping that allows
each node's pgdat to be located in this node's memory. For this
purpose it allocates a memory area at the end of each node's memory
and maps this area so that it is accessible with virtual addresses
belonging to low memory. As a result, if there is high memory,
these NUMA-allocated areas are physically located in high memory,
although they are mapped to low memory addresses.
Our hibernation code does not take that into account and for this
reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and
with high memory present. Fix this by adding a special mapping for
the NUMA-allocated memory areas to the temporary page tables created
during the last phase of resume.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "x86: default to reboot via ACPI"
x86: align DirectMap in /proc/meminfo
AMD IOMMU: fix lazy IO/TLB flushing in unmap path
x86: add smp_mb() before sending INVALIDATE_TLB_VECTOR
x86: remove VISWS and PARAVIRT around NR_IRQS puzzle
x86: mention ACPI in top-level Kconfig menu
x86: size NR_IRQS on 32-bit systems the same way as 64-bit
x86: don't allow nr_irqs > NR_IRQS
x86/docs: remove noirqbalance param docs
x86: don't use tsc_khz to calculate lpj if notsc is passed
x86, voyager: fix smp_intr_init() compile breakage
AMD IOMMU: fix detection of NP capable IOMMUs
Impact: right-align /proc/meminfo consistent with other fields
When the split-LRU patches added Inactive(anon) and Inactive(file) lines
to /proc/meminfo, all counts were moved two columns rightwards to fit in.
Now move x86's DirectMap lines two columns rightwards to line up.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: introduce new APIs, separate kmap code from CONFIG_HIGHMEM
This takes the code used for CONFIG_HIGHMEM memory mappings except that
it's designed for dynamic IO resource mapping.
These fixmaps are available even with CONFIG_HIGHMEM turned off.
Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: allow /dev/mem mmaps on non-PAT CPUs/platforms
Fix mmap to /dev/mem when CONFIG_X86_PAT is off and CONFIG_STRICT_DEVMEM is
off
mmap to /dev/mem on kernel memory has been failing since the
introduction of PAT (CONFIG_STRICT_DEVMEM=n case). Seems like
the check to avoid cache aliasing with PAT is kicking in even
when PAT is disabled. The bug seems to have crept in 2.6.26.
This patch makes sure that the mmap to regular
kernel memory succeeds if CONFIG_STRICT_DEVMEM=n and
PAT is disabled, and the checks to avoid cache aliasing
still happens if PAT is enabled.
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Tested-by: Tim Sirianni <tim@scalemp.com>
Cc: <stable@kernel.org>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: remove incorrect WARN_ON(1)
Gets rid of dmesg spam created during physical memory hot-add which
will very likely confuse users. The change removes what appears to
be debugging code which I assume was unintentionally included in:
x86: arch/x86/mm/init_64.c printk fixes
commit 10f22dde55
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: some new sparse warnings in e820.c etc, but no functional change.
As with regular ioremap, iounmap etc, annotate with __iomem.
Fixes the following sparse warnings, will produce some new ones
elsewhere in arch/x86 that will get worked out over time.
arch/x86/mm/ioremap.c:402:9: warning: cast removes address space of expression
arch/x86/mm/ioremap.c:406:10: warning: cast adds address space to expression (<asn:2>)
arch/x86/mm/ioremap.c:782:19: warning: Using plain integer as NULL pointer
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so users are not confused with memhole causing big total ram
we don't need to worry about 32 bit, because memhole is always
above max_low_pfn.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crash with memory hotplug
Shuahua Li found:
| I just did some experiments on a desktop for memory hotplug and this bug
| triggered a crash in my test.
|
| Yinghai's suggestion also fixed the bug.
We don't need to round it, just remove that extra -1
Signed-off-by: Yinghai <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix section mismatch warning - apic_x2apic_phys
x86: fix section mismatch warning - apic_x2apic_cluster
x86: fix section mismatch warning - apic_x2apic_uv_x
x86: fix section mismatch warning - apic_physflat
x86: fix section mismatch warning - apic_flat
x86: memtest fix use of reserve_early()
x86 syscall.h: fix argument order
x86/tlb_uv: remove strange mc146818rtc include
x86: remove redundant KERN_DEBUG on pr_debug
x86: do_boot_cpu - check if we have ESR register
x86: MAINTAINERS change for AMD microcode patch loader
x86/proc: fix /proc/cpuinfo cpu offline bug
x86: call dmi-quirks for HP Laptops after early-quirks are executed
x86, kexec: fix hang on i386 when panic occurs while console_sem is held
MCE: Don't run 32bit machine checks with interrupts on
x86: SB600: skip IRQ0 override if it is not routed to INT2 of IOAPIC
x86: make variables static
Hi all,
Wrong usage of 2nd parameter in reserve_early call.
66/75: reserve_early(start_bad, last_bad - start_bad, "BAD RAM");
^^^^^^^^^^^^^^^^^^^^
The correct way is to use 'end' address and not 'size'.
As a bonus a fix to the printk format.
Signed-off-by: Daniele Calore <orkaan@orkaan.org>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).
The biggest problem with vmap is actually vunmap. Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache. This is all done under a global lock. As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
This gives terrible quadratic scalability characteristics.
Another problem is that the entire vmap subsystem works under a single
lock. It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.
This is a rewrite of vmap subsystem to solve those problems. The existing
vmalloc API is implemented on top of the rewritten subsystem.
The TLB flushing problem is solved by using lazy TLB unmapping. vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated. So the addresses aren't allocated again until
a subsequent TLB flush. A single TLB flush then can flush multiple
vunmaps from each CPU.
XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.
The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.
There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.
To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap. Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).
As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages. Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron. Results are
in nanoseconds per map+touch+unmap.
threads vanilla vmap rewrite
1 14700 2900
2 33600 3000
4 49500 2800
8 70631 2900
So with a 8 cores, the rewritten version is already 25x faster.
In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram... along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system. I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now. vmap is pretty well blown off the
profiles.
Before:
1352059 total 0.1401
798784 _write_lock 8320.6667 <- vmlist_lock
529313 default_idle 1181.5022
15242 smp_call_function 15.8771 <- vmap tlb flushing
2472 __get_vm_area_node 1.9312 <- vmap
1762 remove_vm_area 4.5885 <- vunmap
316 map_vm_area 0.2297 <- vmap
312 kfree 0.1950
300 _spin_lock 3.1250
252 sn_send_IPI_phys 0.4375 <- tlb flushing
238 vmap 0.8264 <- vmap
216 find_lock_page 0.5192
196 find_next_bit 0.3603
136 sn2_send_IPI 0.2024
130 pio_phys_write_mmr 2.0312
118 unmap_kernel_range 0.1229
After:
78406 total 0.0081
40053 default_idle 89.4040
33576 ia64_spinlock_contention 349.7500
1650 _spin_lock 17.1875
319 __reg_op 0.5538
281 _atomic_dec_and_lock 1.0977
153 mutex_unlock 1.5938
123 iget_locked 0.1671
117 xfs_dir_lookup 0.1662
117 dput 0.1406
114 xfs_iget_core 0.0268
92 xfs_da_hashname 0.1917
75 d_alloc 0.0670
68 vmap_page_range 0.0462 <- vmap
58 kmem_cache_alloc 0.0604
57 memset 0.0540
52 rb_next 0.1625
50 __copy_user 0.0208
49 bitmap_find_free_region 0.2188 <- vmap
46 ia64_sn_udelay 0.1106
45 find_inode_fast 0.1406
42 memcmp 0.2188
42 finish_task_switch 0.1094
42 __d_lookup 0.0410
40 radix_tree_lookup_slot 0.1250
37 _spin_unlock_irqrestore 0.3854
36 xfs_bmapi 0.0050
36 kmem_cache_free 0.0256
35 xfs_vn_getattr 0.0322
34 radix_tree_lookup 0.1062
33 __link_path_walk 0.0035
31 xfs_da_do_buf 0.0091
30 _xfs_buf_find 0.0204
28 find_get_page 0.0875
27 xfs_iread 0.0241
27 __strncpy_from_user 0.2812
26 _xfs_buf_initialize 0.0406
24 _xfs_buf_lookup_pages 0.0179
24 vunmap_page_range 0.0250 <- vunmap
23 find_lock_page 0.0799
22 vm_map_ram 0.0087 <- vmap
20 kfree 0.0125
19 put_page 0.0330
18 __kmalloc 0.0176
17 xfs_da_node_lookup_int 0.0086
17 _read_lock 0.0885
17 page_waitqueue 0.0664
vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.
[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The driver would like to map IO space directly for copying data in when
appropriate, to avoid CPU cache flushing for streaming writes.
kmap_atomic_pfn lets us avoid IPIs associated with ioremap for this process.
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Dave Airlie <airlied@redhat.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix compat-vdso
x86/mm: unify init task OOM handling
x86/mm: do not trigger a kernel warning if user-space disables interrupts and generates a page fault
Offer mmiotrace users a function to inject markers from inside the kernel.
This depends on the trace_vprintk() patch.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When SIL, DIL, BPL or SPL registers were used in MMIO, the datum
was extracted from AH, BH, CH, or DH, which are incorrect.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
Cc: "Steven Rostedt" <srostedt@redhat.com>
Cc: proski@gnu.org
Cc: "Pekka Enberg"
<penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus noticed that the "again:" versus "survive:" OOM logic for
the init task was arbitrarily different.
The 64-bit codepath is the better one, because it correctly re-lookups
the vma after having dropped the ->mmap_sem.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Arjan reported a spike in the following bug pattern in v2.6.27:
http://www.kerneloops.org/searchweek.php?search=lock_page
which happens because hwclock started triggering warnings due to
a (correct) might_sleep() check in the MM code.
The warning occurs because hwclock uses this dubious sequence of
code to run "atomic" code:
static unsigned long
atomic(const char *name, unsigned long (*op)(unsigned long),
unsigned long arg)
{
unsigned long v;
__asm__ volatile ("cli");
v = (*op)(arg);
__asm__ volatile ("sti");
return v;
}
Then it pagefaults in that "atomic" section, triggering the warning.
There is no way the kernel could provide "atomicity" in this path,
a page fault is a cannot-continue machine event so the kernel has to
wait for the page to be filled in.
Even if it was just a minor fault we'd have to take locks and might have
to spend quite a bit of time with interrupts disabled - not nice to irq
latencies in general.
So instead just enable interrupts in the pagefault path unconditionally
if we come from user-space, and handle the fault.
Also, while touching this code, unify some trivial parts of the x86
VM paths at the same time.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so we could remove the requirement that one needs to call
early_iounmap() in exactly reverse order of early_ioremap().
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
virt_addr_valid() calls __pa(), which calls __phys_addr(). With
CONFIG_DEBUG_VIRTUAL=y, __phys_addr() will kill the kernel if the
address *isn't* valid. That's clearly wrong for virt_addr_valid().
We also incorporate the debugging checks into virt_addr_valid().
Signed-off-by: Vegard Nossum <vegardno@ben.ifi.uio.no>
Acked-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The last use of trace_hardirqs_fixup is unnecessary, because the
trap is taken with interrupt off on i386 as well as x86_64, and
the irq-tracer is notified of this from the assembly code.
trace_hardirqs_fixup and trace_hardirqs_fixup_flags are removed
from include/asm-x86/irqflags.h as they are no longer used.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Portions of the ACPI code needs to know if a system is a UV system prior
to genapic initialization. This patch adds a call early_acpi_boot_init()
so that the apic type is discovered earlier.
V2 of the patch adding fixes from Yinghai Lu.
Much cleaner and smaller.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since pte_flags() is much cheaper than pte_val() in some virtualized
environments (namely, Xen), use the former whereever possible.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: "Nick Piggin" <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When nr_range gets decremented, the same slot must be considered for
coalescing with its new successor again.
The issue is apparently pretty benign to native code, but surfaces as a
boot time crash in our forward ported Xen tree (where the page table
setup overall works differently than in native).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The check prevents flags on mappings from being changed, which is not
desireable. There's no need to check for replacing a mapping, and
x86-32 does not do this check.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
early_ioremap() is also used to map normal memory when constructing
the linear memory mapping. However, since we sometimes need to be able
to distinguish between actual IO mappings and normal memory mappings,
add a early_memremap() call, which maps with PAGE_KERNEL (as opposed
to PAGE_KERNEL_IO for early_ioremap()), and use it when constructing
pagetables.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use one of the software-defined PTE bits to indicate that a mapping is
intended for an IO address. On native hardware this is irrelevent,
since a physical address is a physical address. But in a virtual
environment, physical addresses are also virtualized, so there needs
to be some way to distinguish between pseudo-physical addresses and
actual hardware addresses; _PAGE_IOMAP indicates this intent.
By default, __supported_pte_mask masks out _PAGE_IOMAP, so it doesn't
even appear in the final pagetable.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the prototypes from the generic kernel.h header to the more
appropriate include/asm-x86/bios_ebda.h header file.
Also, remove the check from the power management code - this is a
pure x86 matter for now.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86 implementation of early_ioremap has an off by one error. If we get
an object which ends on the first byte of a page we undermap by one page and
this causes a crash on boot with the ASUS P5QL whose DMI table happens to fit
this alignment.
The size computation is currently
last_addr = phys_addr + size - 1;
npages = (PAGE_ALIGN(last_addr) - phys_addr)
(Consider a request for 1 byte at alignment 0...)
Closes#11693
Debugging work by Ian Campbell/Felix Geyer
Signed-off-by: Alan Cox <alan@rehat.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeremy Fitzhardinge wrote:
> I'd noticed that current tip/master hasn't been booting under Xen, and I
> just got around to bisecting it down to this change.
>
> commit 065ae73c5462d42e9761afb76f2b52965ff45bd6
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
>
> x86, cpa: make the kernel physical mapping initialization a two pass sequence
>
> This patch is causing Xen to fail various pagetable updates because it
> ends up remapping pagetables to RW, which Xen explicitly prohibits (as
> that would allow guests to make arbitrary changes to pagetables, rather
> than have them mediated by the hypervisor).
Instead of making init a two pass sequence, to satisfy the Intel's TLB
Application note (developer.intel.com/design/processor/applnots/317080.pdf
Section 6 page 26), we preserve the original page permissions
when fragmenting the large mappings and don't touch the existing memory
mapping (which satisfies Xen's requirements).
Only open issue is: on a native linux kernel, we will go back to mapping
the first 0-1GB kernel identity mapping as executable (because of the
static mapping setup in head_64.S). We can fix this in a different
patch if needed.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Track the memtype for RAM pages in page struct instead of using the
memtype list. This avoids the explosion in the number of entries in
memtype list (of the order of 20,000 with AGP) and makes the PAT
tracking simpler.
We are using PG_arch_1 bit in page->flags.
We still use the memtype list for non RAM pages.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do a global flush tlb after splitting the large page and before we do the
actual change page attribute in the PTE.
With out this, we violate the TLB application note, which says
"The TLBs may contain both ordinary and large-page translations for
a 4-KByte range of linear addresses. This may occur if software
modifies the paging structures so that the page size used for the
address range changes. If the two translations differ with respect
to page frame or attributes (e.g., permissions), processor behavior
is undefined and may be implementation-specific."
And also serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity
mappings) using cpa_lock. So that we don't allow any other cpu, with stale
large tlb entries change the page attribute in parallel to some other cpu
splitting a large page entry along with changing the attribute.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: arjan@linux.intel.com
Cc: venkatesh.pallipadi@intel.com
Cc: jeremy@goop.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>