reserve_mem() and stabs_alloc() are both called only from other __init
routines, so can be marked __init.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The current get_unmapped_area code calls the f_ops->get_unmapped_area or the
arch one (via the mm) only when MAP_FIXED is not passed. That makes it
impossible for archs to impose proper constraints on regions of the virtual
address space. To work around that, get_unmapped_area() then calls some
hugetlbfs specific hacks.
This cause several problems, among others:
- It makes it impossible for a driver or filesystem to do the same thing
that hugetlbfs does (for example, to allow a driver to use larger page sizes
to map external hardware) if that requires applying a constraint on the
addresses (constraining that mapping in certain regions and other mappings
out of those regions).
- Some archs like arm, mips, sparc, sparc64, sh and sh64 already want
MAP_FIXED to be passed down in order to deal with aliasing issues. The code
is there to handle it... but is never called.
This series of patches moves the logic to handle MAP_FIXED down to the various
arch/driver get_unmapped_area() implementations, and then changes the generic
code to always call them. The hugetlbfs hacks then disappear from the generic
code.
Since I need to do some special 64K pages mappings for SPEs on cell, I need to
work around the first problem at least. I have further patches thus
implementing a "slices" layer that handles multiple page sizes through slices
of the address space for use by hugetlbfs, the SPE code, and possibly others,
but it requires that serie of patches first/
There is still a potential (but not practical) issue due to the fact that
filesystems/drivers implemeting g_u_a will effectively bypass all arch checks.
This is not an issue in practice as the only filesystems/drivers using that
hook are doing so for arch specific purposes in the first place.
There is also a problem with mremap that will completely bypass all arch
checks. I'll try to address that separately, I'm not 100% certain yet how,
possibly by making it not work when the vma has a file whose f_ops has a
get_unmapped_area callback, and by making it use is_hugepage_only_range()
before expanding into a new area.
Also, I want to turn is_hugepage_only_range() into a more generic
is_normal_page_range() as that's really what it will end up meaning when used
in stack grow, brk grow and mremap.
None of the above "issues" however are introduced by this patch, they are
already there, so I think the patch can go ini for 2.6.22.
This patch:
Handle MAP_FIXED in powerpc's arch_get_unmapped_area() in all 3
implementations of it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Matthew Wilcox <willy@debian.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is not necessary to tell the slab allocators to align to a cacheline
if an explicit alignment was already specified. It is rather confusing
to specify multiple alignments.
Make sure that the call sites only use one form of alignment.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch was recently posted to lkml and acked by Pekka.
The flag SLAB_MUST_HWCACHE_ALIGN is
1. Never checked by SLAB at all.
2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB
3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.
The only remaining use is in sparc64 and ppc64 and their use there
reflects some earlier role that the slab flag once may have had. If
its specified then SLAB_HWCACHE_ALIGN is also specified.
The flag is confusing, inconsistent and has no purpose.
Remove it.
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes a couple of kexec problems related to 64K page
support in the kernel. kexec issues a tlbie for each pte. The
parameters for the tlbie are the page size and the virtual address.
Support was missing for the computation of these two parameters
for 64K pages. This adds that support.
Signed-off-by: Luke Browning <lukebrowning@us.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Call the kprobes pagefault handler directly instead of going through
the complex notifier chain.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch takes the definitions for the PPC44x MMU (a software loaded
TLB) from asm-ppc/mmu.h, cleans them up of things no longer necessary
in arch/powerpc and puts them in a new asm-powerpc/mmu_44x.h file. It
also substantially simplifies arch/powerpc/mm/44x_mmu.c and makes a
couple of small fixes necessary for the 44x MMU code to build and work
properly in arch/powerpc.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
arch/powerpc/mm/mem.c exports page_is_ram, which is not used anywhere
that could be modular.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
32-bit powerpc systems define a macro, PHYS_FMT, giving a printf
format string fragment for displaying physical addresses, since most
32-bit powerpc platforms use 32-bit physical addresses but a few use
64-bit physical addresses.
This macro is used in exactly one place, a rare error message, where
we can solve the problem more simply by just unconditionally casting
the address up to 64-bit quantity before formatting it.
This patch does so, meaning that as we bring MMU definitions from
asm-ppc over to asm-powerpc, cleaning them up in the process, we don't
need to implement this ugly macro (which additionally has a very bad
name for something global).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
BenH's commit a741e67969 in powerpc.git,
although (AFAICT) only intended to affect ppc64, also has side-effects
which break 44x. I think 40x, 8xx and Freescale Book E are also
affected, though I haven't tested them.
The problem lies in unconditionally removing flush_tlb_pending() from
the versions of flush_tlb_mm(), flush_tlb_range() and
flush_tlb_kernel_range() used on ppc64 - which are also used the
embedded platforms mentioned above.
The patch below cleans up the convoluted #ifdef logic in tlbflush.h,
in the process restoring the necessary flushes for the software TLB
platforms. There are three sets of definitions for the flushing
hooks: the software TLB versions (revised to avoid using names which
appear to related to TLB batching), the 32-bit hash based versions
(external functions) amd the 64-bit hash based versions (which
implement batching).
It also moves the declaration of update_mmu_cache() to always be in
tlbflush.h (previously it was in tlbflush.h except for PPC64, where it
was in pgtable.h).
Booted on Ebony (440GP) and compiled for 64-bit and 32-bit
multiplatform.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Here's an implementation of DEBUG_PAGEALLOC for 64 bits powerpc.
It applies on top of the 32 bits patch.
Unlike Anton's previous attempt, I'm not using updatepp. I'm removing
the hash entries from the bolted mapping (using a map in RAM of all the
slots). Expensive but it doesn't really matter, does it ? :-)
Memory hot-added doesn't benefit from this unless it's added at an
address that is below end_of_DRAM() as calculated at boot time.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/Kconfig.debug | 2
arch/powerpc/mm/hash_utils_64.c | 84 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 82 insertions(+), 4 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Here's an implementation of DEBUG_PAGEALLOC for ppc32. It disables BAT
mapping and is only tested with Hash table based processor though it
shouldn't be too hard to adapt it to others.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/Kconfig.debug | 9 ++++++
arch/powerpc/mm/init_32.c | 4 +++
arch/powerpc/mm/pgtable_32.c | 52 +++++++++++++++++++++++++++++++++++++++
arch/powerpc/mm/ppc_mmu_32.c | 4 ++-
include/asm-powerpc/cacheflush.h | 6 ++++
5 files changed, 74 insertions(+), 1 deletion(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
On hash table based 32 bits powerpc's, the hash management code runs with
a big spinlock. It's thus important that it never causes itself a hash
fault. That code is generally safe (it does memory accesses in real mode
among other things) with the exception of the actual access to the code
itself. That is, the kernel text needs to be accessible without taking
a hash miss exceptions.
This is currently guaranteed by having a BAT register mapping part of the
linear mapping permanently, which includes the kernel text. But this is
not true if using the "nobats" kernel command line option (which can be
useful for debugging) and will not be true when using DEBUG_PAGEALLOC
implemented in a subsequent patch.
This patch fixes this by pre-faulting in the hash table pages that hit
the kernel text, and making sure we never evict such a page under hash
pressure.
Signed-off-by: Benjamin Herrenchmidt <benh@kernel.crashing.org>
arch/powerpc/mm/hash_low_32.S | 22 ++++++++++++++++++++--
arch/powerpc/mm/mem.c | 3 ---
arch/powerpc/mm/mmu_decl.h | 4 ++++
arch/powerpc/mm/pgtable_32.c | 11 +++++++----
4 files changed, 31 insertions(+), 9 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
The 32 bits map_page() function is used internally by the mm code
for early mmu mappings and for ioremap. It should never be called
for an address that already has a valid PTE or hash entry, so we
add a BUG_ON for that and remove the useless flush_HPTE call.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/mm/pgtable_32.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
Signed-off-by: Paul Mackerras <paulus@samba.org>
The current tlb flush code on powerpc 64 bits has a subtle race since we
lost the page table lock due to the possible faulting in of new PTEs
after a previous one has been removed but before the corresponding hash
entry has been evicted, which can leads to all sort of fatal problems.
This patch reworks the batch code completely. It doesn't use the mmu_gather
stuff anymore. Instead, we use the lazy mmu hooks that were added by the
paravirt code. They have the nice property that the enter/leave lazy mmu
mode pair is always fully contained by the PTE lock for a given range
of PTEs. Thus we can guarantee that all batches are flushed on a given
CPU before it drops that lock.
We also generalize batching for any PTE update that require a flush.
Batching is now enabled on a CPU by arch_enter_lazy_mmu_mode() and
disabled by arch_leave_lazy_mmu_mode(). The code epects that this is
always contained within a PTE lock section so no preemption can happen
and no PTE insertion in that range from another CPU. When batching
is enabled on a CPU, every PTE updates that need a hash flush will
use the batch for that flush.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some drivers have resources that they want to be able to map into
userspace that are 4k in size. On a kernel configured with 64k pages
we currently end up mapping the 4k we want plus another 60k of
physical address space, which could contain anything. This can
introduce security problems, for example in the case of an infiniband
adaptor where the other 60k could contain registers that some other
program is using for its communications.
This patch adds a new function, remap_4k_pfn, which drivers can use to
map a single 4k page to userspace regardless of whether the kernel is
using a 4k or a 64k page size. Like remap_pfn_range, it would
typically be called in a driver's mmap function. It only maps a
single 4k page, which on a 64k page kernel appears replicated 16 times
throughout a 64k page. On a 4k page kernel it reduces to a call to
remap_pfn_range.
The way this works on a 64k kernel is that a new bit, _PAGE_4K_PFN,
gets set on the linux PTE. This alters the way that __hash_page_4K
computes the real address to put in the HPTE. The RPN field of the
linux PTE becomes the 4k RPN directly rather than being interpreted as
a 64k RPN. Since the RPN field is 32 bits, this means that physical
addresses being mapped with remap_4k_pfn have to be below 2^44,
i.e. 0x100000000000.
The patch also factors out the code in arch/powerpc/mm/hash_utils_64.c
that deals with demoting a process to use 4k pages into one function
that gets called in the various different places where we need to do
that. There were some discrepancies between exactly what was done in
the various places, such as a call to spu_flush_all_slbs in one case
but not in others.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is more consistent and gets us closer to the Sparc code.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is more consistent and gets us closer to the Sparc code.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The SPU code doesn't properly invalidate SPUs SLBs when necessary,
for example when changing a segment size from the hugetlbfs code. In
addition, it saves and restores the SLB content on context switches
which makes it harder to properly handle those invalidations.
This patch removes the saving & restoring for now, something more
efficient might be found later on. It also adds a spu_flush_all_slbs(mm)
that can be used by the core mm code to flush the SLBs of all SPEs that
are running a given mm at the time of the flush.
In order to do that, it adds a spinlock to the list of all SPEs and move
some bits & pieces from spufs to spu_base.c
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
At present calling lmb_reserve() (and hence lmb_add_region()) twice
for exactly the same memory region will cause strange behaviour.
This makes life difficult when booting from a flat device tree with
memory reserve map. Which regions are automatically reserved by the
kernel has changed over time, so it's quite possible a newer kernel
could attempt to auto-reserve a region which is also explicitly listed
in the device tree's reserve map, leading to trouble.
This patch avoids the problem by making lmb_reserve() ignore a call to
reserve a previously reserved region. It also removes a now redundant
test designed to avoid one specific case of the problem noted above.
At present, this patch deals only with duplicate reservations of an
identical region. Attempting to reserve two different, but
overlapping regions will still cause problems. I might post another
patch later dealing with this case, but I'm avoiding it now since it
is substantially more complicated to deal with, less likely to occur
and more likely to indicate a genuine bug elsewhere if it does occur.
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
Documentation/kernel-docs.txt update.
arch/cris: typo in KERN_INFO
Storage class should be before const qualifier
kernel/printk.c: comment fix
update I/O sched Kconfig help texts - CFQ is now default, not AS.
Remove duplicate listing of Cris arch from README
kbuild: more doc. cleanups
doc: make doc. for maxcpus= more visible
drivers/net/eexpress.c: remove duplicate comment
add a help text for BLK_DEV_GENERIC
correct a dead URL in the IP_MULTICAST help text
fix the BAYCOM_SER_HDX help text
fix SCSI_SCAN_ASYNC help text
trivial documentation patch for platform.txt
Fix typos concerning hierarchy
Fix comment typo "spin_lock_irqrestore".
Fix misspellings of "agressive".
drivers/scsi/a100u2w.c: trivial typo patch
Correct trivial typo in log2.h.
Remove useless FIND_FIRST_BIT() macro from cardbus.c.
...
The code for bolting hash entries for ioremap done before proper
mm initialization has a grown a bug when using 64K pages on a
machine where non-cacheable mappings are demoted to 4K HW pages.
The wrong page size index is being passed to the hash table mapping
functions causing a crash at boot on some pSeries machines using
bare metal linux. This fixes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The recent vDSO consolidation patches broke powerpc due to a mistake
in the definition of MAXPAGES constants. This fixes it by moving to
a dynamically allocated array of pages instead as I don't like much
hard coded size limits. Also move the vdso initialisation to an initcall
since it doesn't really need to be done -that- early.
Applogies for not catching the breakage earlier, Roland _did_ CC me on
his patches a while ago, I got busy with other things and forgot to test
them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When building an 85xx kernel we get:
CC arch/powerpc/mm/pgtable_32.o
arch/powerpc/mm/pgtable_32.c: In function 'io_block_mapping':
arch/powerpc/mm/pgtable_32.c:330: error: expected identifier before '(' token
arch/powerpc/mm/pgtable_32.c:330: error: expected statement before ')' token
The is_power_of_2(x) fixup patch left an extra ')' on the is_power_of_4 macro.
There is a similiar issue on the arch/ppc side.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
arch/powerpc/mm/mem.c states that page_is_ram is called by the code that
implements /dev/mem, which isn't true. Remove the comment.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add the inline function "is_power_of_2()" to log2.h, where the value
zero is *not* considered to be a power of two.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Acked-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is just a straight port of the same done in arch/ppc
by Marcelo Tosatti. One used to be
[PATCH] ppc32 8xx: update_mmu_cache() needs unconditional tlbie,
commit eb07d964b4
In a nutshell, the board is nearly stuck without this, yet without any
visible failure - being just very slow.
Signed-off-by: Vitaly Bordug <vbordug@ru.mvista.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch changes handling return value of ppc_md.hpte_insert() into
the same way as __hash_page_*().
Signed-off-by: Kou Ishizaki <kou.ishizaki@toshiba.co.jp>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The powerpc specific version of hugetlb_get_unmapped_area() makes some
unwarranted assumptions about what checks have been made to its
parameters by its callers. This will lead to a BUG_ON() if a 32-bit
process attempts to make a hugepage mapping which extends above
TASK_SIZE (4GB).
I'm not sure if these assumptions came about because they were valid
with earlier versions of the get_unmapped_area() path, or if it was
always broken. Nonetheless this patch fixes the logic, and removes
the crash.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Run this:
#!/bin/sh
for f in $(grep -Erl "\([^\)]*\) *k[cmz]alloc" *) ; do
echo "De-casting $f..."
perl -pi -e "s/ ?= ?\([^\)]*\) *(k[cmz]alloc) *\(/ = \1\(/" $f
done
And then go through and reinstate those cases where code is casting pointers
to non-pointers.
And then drop a few hunks which conflicted with outstanding work.
Cc: Russell King <rmk@arm.linux.org.uk>, Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Paul Fulghum <paulkf@microgate.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Karsten Keil <kkeil@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Jeff Garzik <jeff@garzik.org>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Steven French <sfrench@us.ibm.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
For PAPR partitions with large amounts of memory, the firmware has an
alternative, more compact representation for the information about the
memory in the partition and its NUMA associativity information. This
adds the code to the kernel to parse this alternative representation.
The other part of this patch is telling the firmware that we can
handle the alternative representation. There is however a subtlety
here, because the firmware will invoke a reboot if the memory
representation we request is different from the representation that
firmware is currently using. This is because firmware can't change
the representation on the fly. Further, some firmware versions used
on POWER5+ machines have a bug where this reboot leaves the machine
with an altered value of load-base, which will prevent any kernel
booting until it is reset to the normal value (0x4000). Because of
this bug, we do NOT set fake_elf.rpanote.new_mem_def = 1, and thus we
do not request the new representation on POWER5+ and earlier machines.
We do request the new representation on POWER6, which uses the
ibm,client-architecture-support call.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Replace all uses of kmem_cache_t with struct kmem_cache.
The patch was generated using the following script:
#!/bin/sh
#
# Replace one string by another in all the kernel sources.
#
set -e
for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
quilt add $file
sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
mv /tmp/$$ $file
quilt refresh
done
The script was run like this
sh replace kmem_cache_t "struct kmem_cache"
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Following up with the work on shared page table done by Dave McCracken. This
set of patch target shared page table for hugetlb memory only.
The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments. In the normal
page case, the amount of memory saved from process' page table is quite
significant. For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.
With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio. One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency. These two effects contribute to higher
application performance.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Dave McCracken <dmccr@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove CPU_FTR_16M_PAGE from the cupfeatures mask at runtime on iSeries.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
A few code paths need to check whether or not they are running
on the PS3's LV1 hypervisor before making hcalls. This introduces
a new firmware feature bit for this, FW_FEATURE_PS3_LV1.
Now when both PS3 and IBM_CELL_BLADE are enabled, but not PSERIES,
FW_FEATURE_PS3_LV1 and FW_FEATURE_LPAR get enabled at compile time,
which is a bug. The same problem can also happen for (PPC_ISERIES &&
!PPC_PSERIES && PPC_SOMETHING_ELSE). In order to solve this, I
introduce a new CONFIG_PPC_NATIVE option that is set when at least
one platform is selected that can run without a hypervisor and then
turns the firmware feature check into a run-time option.
The new cell oprofile support that was recently merged does not
work on hypervisor based platforms like the PS3, therefore make
it depend on PPC_CELL_NATIVE instead of PPC_CELL. This may change
if we get oprofile support for PS3.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
powerpc: Merge 32 and 64 bits asm-powerpc/io.h
The rework on io.h done for the new hookable accessors made it easier,
so I just finished the work and merged 32 and 64 bits io.h for arch/powerpc.
arch/ppc still uses the old version in asm-ppc, there is just too much gunk
in there that I really can't be bothered trying to cleanup.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In order to suppose platforms with devices above 4Gb on 32 bits platforms
with a >32 bits physical address space, we used to have a special ioremap64
along with a fixup routine fixup_bigphys_addr.
This shouldn't be necessary anymore as struct resource now supports 64 bits
addresses even on 32 bits archs. This patch enables that option when
CONFIG_PHYS_64BIT is set and removes ioremap64 and fixup_bigphys_addr.
This is a preliminary work for the upcoming merge of 32 and 64 bits io.h
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch reworks the way iSeries hooks on PCI IO operations (both MMIO
and PIO) and provides a generic way for other platforms to do so (we
have need to do that for various other platforms).
While reworking the IO ops, I ended up doing some spring cleaning in
io.h and eeh.h which I might want to split into 2 or 3 patches (among
others, eeh.h had a lot of useless stuff in it).
A side effect is that EEH for PIO should work now (it used to pass IO
ports down to the eeh address check functions which is bogus).
Also, new are MMIO "repeat" ops, which other archs like ARM already had,
and that we have too now: readsb, readsw, readsl, writesb, writesw,
writesl.
In the long run, I might also make EEH use the hooks instead
of wrapping at the toplevel, which would make things even cleaner and
relegate EEH completely in platforms/iseries, but we have to measure the
performance impact there (though it's really only on MMIO reads)
Since I also need to hook on ioremap, I shuffled the functions a bit
there. I introduced ioremap_flags() to use by drivers who want to pass
explicit flags to ioremap (and it can be hooked). The old __ioremap() is
still there as a low level and cannot be hooked, thus drivers who use it
should migrate unless they know they want the low level version.
The patch "arch provides generic iomap missing accessors" (should be
number 4 in this series) is a pre-requisite to provide full iomap
API support with this patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(David:)
If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.
But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it. That will eventually call down to the
non-hugepage version of unmap_page_range(). On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries. I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on.
(Hugh:)
prepare_hugepage_range() should check file offset alignment when it checks
virtual address and length, to stop MAP_FIXED with a bad huge offset from
unmapping before it fails further down. PowerPC should apply the same
prepare_hugepage_range alignment checks as ia64 and all the others do.
Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
is the check for too small a mapping); but even so, move up setting of
VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
when unwinding from error will go the non-huge way, which may cause bad
behaviour on architectures (powerpc and ia64) which segregate their huge
mappings into a separate region of the address space.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Adam Litke <agl@us.ibm.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
bad_page_fault() prints a message telling the user what type of bad
fault we took. The first line of this message is currently implemented
as two separate printks. This has the unfortunate effect that if
several cpus simultaneously take a bad fault, the first and second parts
of the printk get jumbled up, which looks dodge and is hard to read.
So do a single one-line printk for each fault type.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Checking source for other get_paca()->field preemption dangers found that
open_high_hpage_areas does a structure copy into its paca while preemption
is enabled: unsafe however gcc accomplishes it. Just remove that copy:
it's done safely afterwards by on_each_cpu, as in open_low_hpage_areas.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Change the powerpc hpte_insert routines now called through ppc_md to
static scope.
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Arch-independent zone-sizing is using indices instead of symbolic names to
offset within an array related to zones (max_zone_pfns). The unintended
impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead
of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the
the machine fails to boot but will boot with CONFIG_HIGHMEM turned off.
The following patch properly initialises the max_zone_pfns[] array and uses
symbolic names instead of indices in each architecture using
arch-independent zone-sizing. Two users have successfully booted their
powerpcs with it (one an ibook G4). It has also been boot tested on x86,
x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2.
Credit to Benjamin Herrenschmidt for identifying the bug and rolling the
first fix. Additional credit to Johannes Berg and Andreas Schwab for
reporting the problem and testing on powerpc.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().
Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.
Eric's original description:
There are a lot of places in the kernel where we test for init
because we give it special properties. Most significantly init
must not die. This results in code all over the kernel test
->pid == 1.
Introduce is_init to capture this case.
With multiple pid spaces for all of the cases affected we are
looking for only the first process on the system, not some other
process that has pid == 1.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make PROT_WRITE imply PROT_READ for a number of architectures which don't
support write only in hardware.
While looking at this, I noticed that some architectures which do not
support write only mappings already take the exact same approach. For
example, in arch/alpha/mm/fault.c:
"
if (cause < 0) {
if (!(vma->vm_flags & VM_EXEC))
goto bad_area;
} else if (!cause) {
/* Allow reads even for write-only mappings */
if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
goto bad_area;
} else {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
}
"
Thus, this patch brings other architectures which do not support write only
mappings in-line and consistent with the rest. I've verified the patch on
ia64, x86_64 and x86.
Additional discussion:
Several architectures, including x86, can not support write-only mappings.
The pte for x86 reserves a single bit for protection and its two states are
read only or read/write. Thus, write only is not supported in h/w.
Currently, if i 'mmap' a page write-only, the first read attempt on that page
creates a page fault and will SEGV. That check is enforced in
arch/blah/mm/fault.c. However, if i first write that page it will fault in
and the pte will be set to read/write. Thus, any subsequent reads to the page
will succeed. It is this inconsistency in behavior that this patch is
attempting to address. Furthermore, if the page is swapped out, and then
brought back the first read will also cause a SEGV. Thus, any arbitrary read
on a page can potentially result in a SEGV.
According to the SuSv3 spec, "if the application requests only PROT_WRITE, the
implementation may also allow read access." Also as mentioned, some
archtectures, such as alpha, shown above already take the approach that i am
suggesting.
The counter-argument to this raised by Arjan, is that the kernel is enforcing
the write only mapping the best it can given the h/w limitations. This is
true, however Alan Cox, and myself would argue that the inconsitency in
behavior, that is applications can sometimes work/sometimes fails is highly
undesireable. If you read through the thread, i think people, came to an
agreement on the last patch i posted, as nobody has objected to it...
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Andi Kleen <ak@muc.de>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Ian Molton <spyro@f2s.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The PIN_SIZE definition name changed, update 44x_mmu.c accordingly.
Signed-off-by: Matt Porter <mporter@embeddedalley.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On Tue, 2006-08-15 at 08:22 -0700, Dave Hansen wrote:
> kernel BUG in cache_free_debugcheck at mm/slab.c:2748!
Alright, this one is only triggered when slab debugging is enabled. The
slabs are assumed to be aligned on a HUGEPTE_TABLE_SIZE boundary. The free
path makes use of this assumption and uses the lowest nibble to pass around
an index into an array of kmem_cache pointers. With slab debugging turned
on, the slab is still aligned, but the "working" object pointer is not.
This would break the assumption above that a full nibble is available for
the PGF_CACHENUM_MASK.
The following patch reduces PGF_CACHENUM_MASK to cover only the two least
significant bits, which is enough to cover the current number of 4 pgtable
cache types. Then use this constant to mask out the appropriate part of
the huge pte pointer.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a shadow buffer for the SLBs and regsiters it with PHYP.
Only the bolted SLB entries (top 3) are shadowed.
The SLB shadow buffer tells the hypervisor what the kernel needs to
have in the SLB for the kernel to be able to function. The hypervisor
can use this information to speed up partition context switches.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The PIN_SIZE definition name changed, update 44x_mmu.c accordingly.
Signed-off-by: Matt Porter <mporter@embeddedalley.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Now that get_property() returns a void *, there's no need to cast its
return value. Also, treat the return value as const, so we can
constify get_property later.
powerpc core changes.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's a bug in my cleaned up mem= handling, if the memory limit is
larger than the RMO size we'll erroneously enlarge the RMO size.
Fix is to only change the RMO size if the memory limit is less than
the current RMO value.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The compiler doesn't understand that BUG() never returns, so complains that
psize isn't set. Just set it to the normal value, which seems to produce nice
code and keeps gcc happy.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
There's a bug in my cleaned up mem= handling, if the memory limit is
larger than the RMO size we'll erroneously enlarge the RMO size.
Fix is to only change the RMO size if the memory limit is less than
the current RMO value.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits)
[POWERPC] Use little-endian bit from firmware ibm,pa-features property
[POWERPC] Make sure smp_processor_id works very early in boot
[POWERPC] U4 DART improvements
[POWERPC] todc: add support for Time-Of-Day-Clock
[POWERPC] Make lparcfg.c work when both iseries and pseries are selected
[POWERPC] Fix idr locking in init_new_context
[POWERPC] mpc7448hpc2 (taiga) board config file
[POWERPC] Add tsi108 pci and platform device data register function
[POWERPC] Add general support for mpc7448hpc2 (Taiga) platform
[POWERPC] Correct the MAX_CONTEXT definition
powerpc: minor cleanups for mpc86xx
[POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set
[POWERPC] Simplify the code defining the 64-bit CPU features
[POWERPC] powerpc: kconfig warning fix
[POWERPC] Consolidate some of kernel/misc*.S
[POWERPC] Remove unused function call_with_mmu_off
[POWERPC] update asm-powerpc/time.h
[POWERPC] Clean up it_lp_queue.h
[POWERPC] Skip the "copy down" of the kernel if it is already at zero.
[POWERPC] Add the use of the firmware soft-reset-nmi to kdump.
...
We always need to serialize accesses to mmu_context_idr.
I hit this bug when testing with a small number of mmu contexts.
Signed-off-by: Sonny Rao <sonny@burdell.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
With the ppc_md htab pointers setup earlier, we can use ppc_md.hpte_insert
in htab_bolt_mapping(), rather than deciding which version to call by hand.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Initialise the ppc_md htab callbacks earlier, in the probe routines. This
allows us to call htab_finish_init() from htab_initialize(), and makes it
private to hash_utils_64.c. Move htab_finish_init() and make_bl() above
htab_initialize() to avoid forward declarations.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Make notifier_blocks associated with cpu_notifier as __cpuinitdata.
__cpuinitdata makes sure that the data is init time only unless
CONFIG_HOTPLUG_CPU is defined.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Localize poison values into one header file for better documentation and
easier/quicker debugging and so that the same values won't be used for
multiple purposes.
Use these constants in core arch., mm, driver, and fs code.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change the name of old add_memory() to arch_add_memory. And use node id to
get pgdat for the node at NODE_DATA().
Note: Powerpc's old add_memory() is defined as __devinit. However,
add_memory() is usually called only after bootup.
I suppose it may be redundant. But, I'm not well known about powerpc.
So, I keep it. (But, __meminit is better at least.)
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Overloading of page fault notification with the notify_die() has performance
issues(since the only interested components for page fault is kprobes and/or
kdb) and hence this patch introduces the new notifier call chain exclusively
for page fault notifications their by avoiding notifying unnecessary
components in the do_page_fault() code path.
Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits)
[POWERPC] re-enable OProfile for iSeries, using timer interrupt
[POWERPC] support ibm,extended-*-frequency properties
[POWERPC] Extra sanity check in EEH code
[POWERPC] Dont look for class-code in pci children
[POWERPC] Fix mdelay badness on shared processor partitions
[POWERPC] disable floating point exceptions for init
[POWERPC] Unify ppc syscall tables
[POWERPC] mpic: add support for serial mode interrupts
[POWERPC] pseries: Print PCI slot location code on failure
[POWERPC] spufs: one more fix for 64k pages
[POWERPC] spufs: fail spu_create with invalid flags
[POWERPC] spufs: clear class2 interrupt status before wakeup
[POWERPC] spufs: fix Makefile for "make clean"
[POWERPC] spufs: remove stop_code from struct spu
[POWERPC] spufs: fix spu irq affinity setting
[POWERPC] spufs: further abstract priv1 register access
[POWERPC] spufs: split the Cell BE support into generic and platform dependant parts
[POWERPC] spufs: dont try to access SPE channel 1 count
[POWERPC] spufs: use kzalloc in create_spu
[POWERPC] spufs: fix initial state of wbox file
...
Manually resolved conflicts in:
drivers/net/phy/Makefile
include/asm-powerpc/spu.h
Clear the high BATS during load_up_mmu if FTR_HAS_HIGH_BATS.
Allow just a bit more time for secondary CPUs to phone home.
Signed-off-by: Wei Zhang <Wei.Zhang@freescale.com>
Signed-off-by: Haiying Wang <Haiying.Wang@freescale.com>
Signed-off-by: Jon Loeliger <jdl@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The page size encoding passed to tlbie is incorrect for new-style
large pages. This fixes it. This doesn't affect anything on older
machines because mmu_psize_defs[psize].penc (the page size encoding)
is 0 for 4k and 16M pages (the two are distinguished by a separate "is
a large page" bit).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove some stale POWER3/POWER4/970 on 32bit kernel support.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some POWER5+ machines can do 64k hardware pages for normal memory but
not for cache-inhibited pages. This patch lets us use 64k hardware
pages for most user processes on such machines (assuming the kernel
has been configured with CONFIG_PPC_64K_PAGES=y). User processes
start out using 64k pages and get switched to 4k pages if they use any
non-cacheable mappings.
With this, we use 64k pages for the vmalloc region and 4k pages for
the imalloc region. If anything creates a non-cacheable mapping in
the vmalloc region, the vmalloc region will get switched to 4k pages.
I don't know of any driver other than the DRM that would do this,
though, and these machines don't have AGP.
When a region gets switched from 64k pages to 4k pages, we do not have
to clear out all the 64k HPTEs from the hash table immediately. We
use the _PAGE_COMBO bit in the Linux PTE to indicate whether the page
was hashed in as a 64k page or a set of 4k pages. If hash_page is
trying to insert a 4k page for a Linux PTE and it sees that it has
already been inserted as a 64k page, it first invalidates the 64k HPTE
before inserting the 4k HPTE. The hash invalidation routines also use
the _PAGE_COMBO bit, to determine whether to look for a 64k HPTE or a
set of 4k HPTEs to remove. With those two changes, we can tolerate a
mix of 4k and 64k HPTEs in the hash table, and they will all get
removed when the address space is torn down.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pgdir field in the paca was a leftover from the dynamic VSIDs
patch, and is not used in the current kernel code. This removes it.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds a vdso_base element to the mm_context_t for 32-bit compiles
(both for ARCH=powerpc and ARCH=ppc). This fixes the compile errors
that have been reported in arch/powerpc/kernel/signal_32.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Our MMU hash management code would not set the "C" bit (changed bit) in
the hardware PTE when updating a RO PTE into a RW PTE. That would cause
the hardware to possibly to a write back to the hash table to set it on
the first store access, which in addition to being a performance issue,
might also hit a bug when running with native hash management (non-HV)
as our code is specifically optimized for the case where no write back
happens.
Thus there is a very small therocial window were a hash PTE can become
corrupted if that HPTE has just been upgraded to read write, a store
access happens on it, and that races with another processor evicting
that same slot. Since eviction (caused by an almost full hash) is
extremely rare, the bug is very unlikely to happen fortunately.
This fixes by allowing the updating of the protection bits in the native
hash handling to also set (but not clear) the "C" bit, and, in order to
also improve performances in the general case, by always setting that
bit on newly inserted hash PTE so that writeback really never happens.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We currently do mem= handling in three seperate places. And as benh pointed out
I wrote two of them. Now that we parse command line parameters earlier we can
clean this mess up.
Moving the parsing out of prom_init means the device tree might be allocated
above the memory limit. If that happens we'd have to move it. As it happens
we already have logic to do that for kdump, so just genericise it.
This also means we might have reserved regions above the memory limit, if we
do the bootmem allocator will blow up, so we have to modify
lmb_enforce_memory_limit() to truncate the reserves as well.
Tested on P5 LPAR, iSeries, F50, 44p. Tested moving device tree on P5 and
44p and F50.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Change of_node_to_nid() to traverse the device tree, looking for a numa id.
Cell uses this to assign ids to SPUs, which are children of the CPU node.
Existing users of of_node_to_nid() are altered to use of_node_to_nid_single(),
which doesn't do the traversal.
Export an attach_sysdev_to_node() function, allowing system devices (eg.
SPUs) to link themselves into the numa topology in sysfs.
Signed-off-by: Arnd Bergmann <arnd.bergmann@de.ibm.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At present, ARCH=powerpc kernels can waste considerable space in
pagetables when making large hugepage mappings. Hugepage PTEs go in
PMD pages, but each PMD page maps 256M and so contains only 16
hugepage PTEs (128 bytes of data), but takes up a 1024 byte
allocation. With CONFIG_PPC_64K_PAGES enabled (64k base page size),
the situation is worse. Now hugepage PTEs are at the PTE page level
(also mapping 256M), so we store 16 hugepage PTEs in a 64k allocation.
The PowerPC MMU already means that any 256M region is either all
hugepage, or all normal pages. Thus, with some care, we can use a
different allocation for the hugepage PTE tables and only allocate the
128 bytes necessary.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Quieten some of the debug ram config output. we already print out available
memory at KERN_INFO level.
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
for_each_cpu() actually iterates across all possible CPUs. We've had mistakes
in the past where people were using for_each_cpu() where they should have been
iterating across only online or present CPUs. This is inefficient and
possibly buggy.
We're renaming for_each_cpu() to for_each_possible_cpu() to avoid this in the
future.
This patch replaces for_each_cpu with for_each_possible_cpu.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix 44x and BookE page fault handler to correctly lock PTE before
trying to pte_update() it, otherwise this PTE might be swapped out
after pte_present() check but before pte_uptdate() call, resulting in
corrupted PTE. This can happen with enabled preemption and low memory
condition.
Signed-off-by: Eugene Surovegin <ebs@ebshome.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This removes statically assigned platform numbers and reworks the
powerpc platform probe code to use a better mechanism. With this,
board support files can simply declare a new machine type with a
macro, and implement a probe() function that uses the flattened
device-tree to detect if they apply for a given machine.
We now have a machine_is() macro that replaces the comparisons of
_machine with the various PLATFORM_* constants. This commit also
changes various drivers to use the new macro instead of looking at
_machine.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We were printing node ids in hex in one spot. Lets be consistent and
always print them in decimal.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The return statement is to prevent `warning: 'nid' might be used uninitialized
in this function'.
Cc: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Semaphore to mutex conversion.
The conversion was generated via scripts, and the result was validated
automatically via a script as well.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jens Axboe <axboe@suse.de>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Acked-by: Alasdair G Kergon <agk@redhat.com>
Cc: Greg KH <greg@kroah.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Adam Belay <ambx1@neo.rr.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
arch/powerpc/mm/mem.c: In function `add_memory':
arch/powerpc/mm/mem.c:128: warning: assignment makes integer from pointer without a cast
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Quite a long time back, prepare_hugepage_range() replaced
is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
verify if an address range is suitable for a hugepage mapping.
is_aligned_hugepage_range() stuck around, but only to implement
prepare_hugepage_range() on archs which didn't implement their own.
Most archs (everything except ia64 and powerpc) used the same
implementation of is_aligned_hugepage_range(). On powerpc, which
implements its own prepare_hugepage_range(), the custom version was never
used.
In addition, "is_aligned_hugepage_range()" was a bad name, because it
suggests it returns true iff the given range is a good hugepage range,
whereas in fact it returns 0-or-error (so the sense is reversed).
This patch cleans up by abolishing is_aligned_hugepage_range(). Instead
prepare_hugepage_range() is defined directly. Most archs use the default
version, which simply checks the given region is aligned to the size of a
hugepage. ia64 and powerpc define custom versions. The ia64 one simply
checks that the range is in the correct address space region in addition to
being suitably aligned. The powerpc version (just as previously) checks
for suitable addresses, and if necessary performs low-level MMU frobbing to
set up new areas for use by hugepages.
No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
set_page_count usage outside mm/ is limited to setting the refcount to 1.
Remove set_page_count from outside mm/, and replace those users with
init_page_count() and set_page_refcounted().
This allows more debug checking, and tighter control on how code is allowed
to play around with page->_count.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A couple of places set_page_count(page, 1) that don't need to.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In mm_init_ppc64() we calculate the location of the "IO hole", but then
no one ever looks at the value. So don't bother.
That's actually all mm_init_ppc64() does, so get rid of it too.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
It has been decreed that platform numbers are evil, so as a step in that
direction, replace platform_is_lpar() with a FW_FEATURE_LPAR bit.
Currently FW_FEATURE_LPAR really means i/pSeries LPAR, in the future we might
have to clean that up if we need to be more specific about what LPAR actually
means. But that's another patch ...
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
htab_bolt_mapping() takes a vstart and pstart parameter, but all but one of
its callers actually pass it vstart and vstart. Luckily before it passes
paddr (calculated from paddr) to the hpte_insert routines it calls
virt_to_abs() (aka. __pa()) on the address, so there isn't actually a bug.
map_io_page() however does pass pstart properly, so currently it's broken
AFAICT because we're calling __pa(paddr) which will get us something very
large. Presumably no one's calling map_io_page() in the right context.
Anyway, change htab_bolt_mapping() callers to properly pass pstart, and then
use it properly in htab_bolt_mapping(), ie. don't call __pa() on it again.
Booted on p5 LPAR, iSeries and Power3.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We can plug the boot cpu into its node independently of whether numa
topology is detected. And numa_setup_cpu does the right thing for all
cases now, so remove special-casing for non-numa from the cpu hotplug
callback.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The powerpc numa code unconditionally onlines all nodes from 0 to the
highest node id found, regardless of whether cpus or memory are
present in the nodes. This wastes 8K per node and complicates some
cpu and memory hotplug situations, such as adding a resource that
doesn't map to one of the nodes discovered at boot.
Set nodes online as resources are scanned. Fall back to node 0 only
when we're sure this isn't a NUMA machine.
Instead of defaulting to node 0 for cases of hot-adding a resource
which doesn't belong to any initialized node, assign it to the first
online node.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Code to handle Power4's invalid node id (0xffff) is duplicated for cpu
and memory. Better to handle this case in one place --
of_node_to_nid. Overall behavior should be unchanged.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since we effectively treat the domain ids given to us by firmare as
logical node ids, make this explicit (basically s/numa_domain/nid/).
No functional changes, only variable and function names are modified.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
map_cpu_to_node does not need to be inline, it is never called in a
hot path.
map_cpu_to_node, numa_setup_cpu, and find_cpu_node can be marked
__cpuinit, as they are never used after boot if CONFIG_HOTPLUG_CPU=n.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add debug statement for map_cpu_to_node; it's useful for cpu hotplug.
Clarify debug statement about not finding the numa reference points
property.
Don't print a meaningless associativity depth (-1) on non-numa systems.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
At boot, the numa code is assigning boot_cpuid to node 0
unconditionally. Basically, numa_setup_cpu is being stupid about it,
but this is the minimal fix -- just call numa_setup_cpu(boot_cpuid)
later, after all nodes have been set online.
Signed-off-by: Nathan Lynch <nathanl@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
My patch (d7a5b2ffa1) to always panic if
lmb_alloc() fails is broken because it checks alloc < 0, but should be
checking alloc == 0.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
remove warnings when building a 64bit kernel.
smp_call_function triggers also with 32bit kernel.
WARNING: vmlinux: duplicate symbol 'smp_call_function' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:164:EXPORT_SYMBOL(smp_call_function);
arch/powerpc/kernel/smp.c:300:EXPORT_SYMBOL(smp_call_function);
WARNING: vmlinux: duplicate symbol 'ioremap' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:113:EXPORT_SYMBOL(ioremap);
arch/powerpc/mm/pgtable_64.c:321:EXPORT_SYMBOL(ioremap);
WARNING: vmlinux: duplicate symbol '__ioremap' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:117:EXPORT_SYMBOL(__ioremap);
arch/powerpc/mm/pgtable_64.c:322:EXPORT_SYMBOL(__ioremap);
WARNING: vmlinux: duplicate symbol 'iounmap' previous definition was in vmlinux
arch/powerpc/kernel/ppc_ksyms.c:118:EXPORT_SYMBOL(iounmap);
arch/powerpc/mm/pgtable_64.c:323:EXPORT_SYMBOL(iounmap);
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Before the merge I updated create_pte_mapping() to work for iSeries, by
calling iSeries_hpte_bolt_or_insert. (4c55130b2a)
Later we changed iSeries_hpte_insert to cope with the bolting case, and called
that instead from create_pte_mapping() (which was renamed to htab_bolt_mapping)
(3c726f8dee).
Unfortunately that change introduced a subtle bug, where we pass an absolute
address to iSeries_hpte_insert() where it expects a physical address. This
leads to us calling phys_to_abs() twice on the physical address, which is
seriously bogus.
This only causes a problem if the absolute address from the first translation
can be looked up again in the chunk_map, which depends on the size and layout
of memory. I've seen it fail on one box, but not others.
The minimal fix is to pass the physical address to iSeries_hpte_insert(). For
2.6.17 we should make phys_to_abs() BUG if we try to double-translate an
address.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
native_hpte_clear has a spinlock recursion problem with the native_tlbie_lock
being called twice, once in native_hpte_clear() and once within tlbie().
Fix the problem by changing the call to tlbie() in native_hpte_clear() to
__tlbie(). It still supports only 4k pages for now.
Signed-off-by: R Sharada <sharada@in.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
For kexec we need to know the size of the MMU hash table.
Currently we calculate the size once in the htab code, and then twice more in
the kexec code, once using htab_hash_mask and once using ppc64_pft_size.
On some machines the ppc64_pft_size calculation is broken because
ppc64_pft_size is not set.
So we need to fix the second calculation, but better still we should just
calculate the size once and use it everywhere else.
Tested on Power5 LPAR, Power4 non-LPAR and Power3.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch removes all self references and fixes references to files
in the now defunct arch/ppc64 tree. I think this accomplises
everything wanted, though there might be a few references I missed.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
LMB_ALLOC_ANYWHERE doesn't need to be part of the API, it's only used in
lmb.c - so move it out of the header file.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently most callers of lmb_alloc() don't check if it worked or not, if it
ever does weird bad things will probably happen. The few callers who do check
just panic or BUG_ON.
So make lmb_alloc() panic internally, to catch bugs at the source. The few
callers who did check the result no longer need to.
The only caller that did anything interesting with the return result was
careful_allocation(). For it we create __lmb_alloc_base() which _doesn't_ panic
automatically, a little messy, but passable.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The code to mark a page as icache dirty (so that it will later be
icache-dcache flushed when we try to execute from it) is duplicated in
three places: flush_dcache_page() does this marking and nothing else,
but clear_user_page() and copy_user_page() duplicate it, since those
functions make the page icache dirty themselves.
This patch makes those other functions call flush_dcache_page()
instead, so the logic's all in one place. This will make life less
confusing if we ever need to tweak the details of the the lazy icache
flush mechanism.
arch/powerpc/mm/mem.c | 14 ++------------
1 file changed, 2 insertions(+), 12 deletions(-)
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
To prevent problems later in boot, make sure we don't create zero-size lmb
regions.
I've checked all the callers, and at the moment no one should ever hit this.
All callers use a constant size, or they check the computed size before they
call us.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
240-ioremap-null-ptr-test.patch
Under highly unusual circumstances, a buggy driver will ask a null ptr
to be ioremapped, an operation that curently succeeds but leads to
later trouble. Instead, refuse to remap the null pointer.
Signed-off-by: Linas Vepstas <linas@austin.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from e71d9e598533c1889e7162f5f4647e5d378c102c commit)
In add_memory() we should be using __va() to get a virtual address.
Spotted by Mike Kravetz.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When taking a DABR exception we were reporting the PC. It makes more
sense to report the address that caused the exception, and the gdb guys
would like it that way.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The system will oops if an attempt is made to add memory to an
empty node/zone. This patch prevents adding memory to an empty
node. The code to dynamically add a node/zone is non-trivial.
This patch is temporary and will be removed when the ability
to dynamically add a node/zone is complete.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Building the arch/powerpc tree currently gives me
two warnings with gcc-4.0:
arch/powerpc/mm/imalloc.c: In function '__im_get_area':
arch/powerpc/mm/imalloc.c:225: warning: 'tmp' may be used uninitialized in this function
arch/powerpc/mm/hugetlbpage.c: In function 'hugetlb_get_unmapped_area':
arch/powerpc/mm/hugetlbpage.c:608: warning: unused variable 'vma'
both fixes are trivial.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On ppc64, we independently define VMALLOCBASE and VMALLOC_START to be
the same thing: the start of the vmalloc() area at 0xd000000000000000.
VMALLOC_START is used much more widely, including in generic code, so
this patch gets rid of the extraneous VMALLOCBASE.
This does require moving the definitions of region IDs from page_64.h
to pgtable.h, but they don't clearly belong in the former rather than
the latter, anyway. While we're moving them, clean up the definitions
of the REGION_IDs:
- Abolish REGION_SIZE, it was only used once, to define
REGION_MASK anyway
- Define the specific region ids in terms of the REGION_ID()
macro.
- Define KERNEL_REGION_ID in terms of PAGE_OFFSET rather than
KERNELBASE. It amounts to the same thing, but conceptually this is
about the region of the linear mapping (which starts at PAGE_OFFSET)
rather than of the kernel text itself (which is at KERNELBASE).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The pre-parsed addrs/n_addrs fields in struct device_node are finally
gone. Remove the dodgy heuristics that did that parsing at boot and
remove the fields themselves since we now have a good replacement with
the new OF parsing code. This patch also fixes a bunch of drivers to use
the new code instead, so that at least pmac32, pseries, iseries and g5
defconfigs build.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We used to print a NUMA cpu summary at boot before the hotplug cpu code
was added. This has been useful for catching machine configuration as
well as firmware bugs in the past.
This patch restores that functionality. An example of the output is:
Node 0 CPUs: 0-7
Node 1 CPUs: 8-15
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Milton has proposed that we should support a "linux,usable-memory" property
on memory nodes which describes, in preference to "reg", the regions of memory
Linux should use.
This facility is required for kdump to inform the second kernel which memory
it should use.
Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This places dynamically added memory within the appropriate
numa node. A new routine hot_add_scn_to_nid() replicates most of
the memory scanning code in parse_numa_properties().
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch separates usage of KERNELBASE and PAGE_OFFSET. I haven't
looked at any of the PPC32 code, if we ever want to support Kdump on
PPC we'll have to do another audit, ditto for iSeries.
This patch makes PAGE_OFFSET the constant, it'll always be 0xC * 1
gazillion for 64-bit.
To get a physical address from a virtual one you subtract PAGE_OFFSET,
_not_ KERNELBASE.
KERNELBASE is the virtual address of the start of the kernel, it's
often the same as PAGE_OFFSET, but _might not be_.
If you want to know something's offset from the start of the kernel
you should subtract KERNELBASE.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's a bunch of code that compares an address with KERNELBASE to see if
it's a "kernel address", ie. >= KERNELBASE. The proper test is actually to
compare with PAGE_OFFSET, since we're going to change KERNELBASE soon.
So replace all of them with an is_kernel_addr() macro that does that.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Here is an updated version of the patch that panics if no memory is
found as Nathan suggested. I'm still concerned that panic strings
(not just the one added here) at this stage of booting do not show
up on my system. But, that is an issue separate from this patch.
Combine get_mem_*_cells() routines to avoid multiple memory node
lookups. Added missing of_node_put() call. Changed variable names
to help with some confusion as to meaning.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
I started to add missing of_node_put() calls to the routines that
determine the number of cells for memory. Decided to combine the
routines instead of making separate node lookups. Changed variable
names to help with some confusion as to meaning.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently, the powerpc version of hugetlb_get_unmapped_area() entirely
ignores the hint address. The only way to get a hugepage mapping at a
specified address is with MAP_FIXED, in which case there's no way
(short of parsing /proc/self/maps) for userspace to tell if it will
clobber an existing mapping. This is inconvenient, so the patch below
makes hugepage mappings use the given hint address if possible.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch unifies udbg for both ppc32 and ppc64 when building the
merged achitecture. xmon now has a single "back end". The powermac udbg
stuff gets enriched with some ADB capabilities and btext output. In
addition, the early_init callback is now called on ppc32 as well,
approx. in the same order as ppc64 regarding device-tree manipulations.
The init sequences of ppc32 and ppc64 are getting closer, I'll unify
them in a later patch.
For now, you can force udbg to the scc using "sccdbg" or to btext using
"btextdbg" on powermacs. I'll implement a cleaner way of forcing udbg
output to something else than the autodetected OF output device in a
later patch.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This is the current version of the spu file system, used
for driving SPEs on the Cell Broadband Engine.
This release is almost identical to the version for the
2.6.14 kernel posted earlier, which is available as part
of the Cell BE Linux distribution from
http://www.bsc.es/projects/deepcomputing/linuxoncell/.
The first patch provides all the interfaces for running
spu application, but does not have any support for
debugging SPU tasks or for scheduling. Both these
functionalities are added in the subsequent patches.
See Documentation/filesystems/spufs.txt on how to use
spufs.
Signed-off-by: Arnd Bergmann <arndb@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Sonny has noticed hotplug CPU on ppc64 is broken in 2.6.15-*. One of the
problems is that htab_initialize_secondary is called when a cpu is being
brought up, but it is marked __init.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On ppc64, when opening a new hugepage region, we need to make sure any
old normal-page SLBs for the area are flushed on all CPUs. There was
a bug in this logic - after putting the new hugepage area masks into
the thread structure, we copied it into the paca (read by the SLB miss
handler) only on one CPU, not on all. This could cause incorrect SLB
entries to be loaded when a multithreaded program was running
simultaneously on several CPUs. This patch corrects the error,
copying the context information into the PACA on all CPUs using the mm
in question before flushing any existing SLB entries.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On most powerpc CPUs, the dcache and icache are not coherent so
between writing and executing a page, the caches must be flushed.
Userspace programs assume pages given to them by the kernel are icache
clean, so we must do this flush between the kernel clearing a page and
it being mapped into userspace for execute. We were not doing this
for hugepages, this patch corrects the situation.
We use the same lazy mechanism as we use for normal pages, delaying
the flush until userspace actually attempts to execute from the page
in question.
Tested on G5.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The 64k pages patch changed the meaning of one argument passed to the
low level hash functions (from "large" it became "psize" or page size
index), but one of the call sites wasn't properly updated, causing
potential random weird problems with huge pages. This fixes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This bug exists in the current code and prevents machines from booting
with numa enabled if there is a node that does not contain memory.
Workaround is to boot with 'numa=off'. Looks like a simple typo.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
There's never been a hardware platform that has both pSeries/RPA LPAR
hypervisor and stab (pre-POWER4 segment management). This removes
the redundant code in stab_initalize().
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Blah. The patch [0] I recently sent fixing errors with
in_hugepage_area() and prepare_hugepage_range() for powerpc itself has
an off-by-one bug. Furthermore, the related functions
touches_hugepage_*_range() and within_hugepage_*_range() are also
buggy. Some of the bugs, like those addressed in [0] originated with
commit 7d24f0b8a5 where we tweaked the
semantics of where hugepages are allowed. Other bugs have been there
essentially forever, and are due to the undefined behaviour of '<<'
with shift counts greater than the type width (LOW_ESID_MASK could
return non-zero for high ranges with the right congruences).
The good news is that I now have a testsuite which should pick up
things like this if they creep in again.
[0] "powerpc-fix-for-hugepage-areas-straddling-4gb-boundary"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Commit 7d24f0b8a5 fixed bugs in the ppc64 SLB
miss handler with respect to hugepage handling, and in the process tweaked
the semantics of the hugepage address masks in mm_context_t.
Unfortunately, it left out a couple of necessary changes to go with that
change. First, the in_hugepage_area() macro was not updated to match,
second prepare_hugepage_range() was not updated to correctly handle
hugepages regions which straddled the 4GB point.
The latter appears only to cause process-hangs when attempting to map such
a region, but the former can cause oopses if a get_user_pages() is
triggered at the wrong point. This patch addresses both bugs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Restore an earlier mod which went missing in the powerpc reshuffle: the 4xx
mmu_mapin_ram does not need to take init_mm.page_table_lock.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update comments (only) on page_table_lock and mmap_sem in arch/powerpc.
Removed the comment on page_table_lock from hash_huge_page: since it's no
longer taking page_table_lock itself, it's irrelevant whether others are; but
how it is safe (even against huge file truncation?) I can't say.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
asm-ppc64/imalloc.h is only included from files in arch/powerpc/mm.
We already have a header for mm local definitions,
arch/powerpc/mm/mmu_decl.h. Thus, this patch moves the contents of
imalloc.h into mmu_decl.h. The only exception are the definitions of
PHBS_IO_BASE, IMALLOC_BASE and IMALLOC_END. Those are moved into
pgtable.h, next to similar definitions of VMALLOC_START and
VMALLOC_SIZE.
Built for multiplatform 32bit and 64bit (ARCH=powerpc).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Somewhere we lost the include of udbg.h in lmb.c. While we're there, add a DBG
macro like every other file has and use it in lmb_dump_all().
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch should fix the crashes we have been seeing on 64-bit
powerpc systems with a memory hole when sparsemem is enabled.
I'd appreciate it if people who know more about NUMA and sparsemem
than me could look over it.
There were two bugs. The first was that if NUMA was enabled but there
was no NUMA information for the machine, the setup_nonnuma() function
was adding a single region, assuming memory was contiguous. The
second was that the loops in mem_init() and show_mem() assumed that
all pages within the span of a pgdat were valid (had a valid struct
page).
I also fixed the incorrect setting of num_physpages that Mike Kravetz
pointed out.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Changed jobs and the Freescale address is no longer valid.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch moves the vdso's to arch/powerpc, adds support for the 32
bits vdso to the 32 bits kernel, rename systemcfg (finally !), and adds
some new (still untested) routines to both vdso's: clock_gettime() with
support for CLOCK_REALTIME and CLOCK_MONOTONIC, clock_getres() (same
clocks) and get_tbfreq() for glibc to retreive the timebase frequency.
Tom,Steve: The implementation of get_tbfreq() I've done for 32 bits
returns a long long (r3, r4) not a long. This is such that if we ever
add support for >4Ghz timebases on ppc32, the userland interface won't
have to change.
I have tested gettimeofday() using some glibc patches in both ppc32 and
ppc64 kernels using 32 bits userland (I haven't had a chance to test a
64 bits userland yet, but the implementation didn't change and was
tested earlier). I haven't tested yet the new functions.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Convert to sparsemem and remove all the discontigmem code in the
process. This has a few advantages:
- The old numa_memory_lookup_table can go away
- All the arch specific discontigmem magic can go away
We also remove the triple pass of memory properties and instead create a
list of per node extents that we iterate through. A final cleanup would
be to change our lmb code to store extents per node, then we can reuse
that information in the numa code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Remove ppc64 specific version of nr_cpus_node and use the generic one
provided.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This also make klimit have the same type on 32-bit as on 64-bit,
namely unsigned long, and defines and initializes it in one place.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch makes the kernel use a different kmem cache for PMD pages
as they are smaller than PTE pages. Avoids waste of memory.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This patch merges platform codes. systemcfg->platform is no longer used,
systemcfg use in general is deprecated as much as possible (and renamed
_systemcfg before it gets completely moved elsewhere in a future patch),
_machine is now used on ppc64 along as ppc32. Platform codes aren't gone
yet but we are getting a step closer. A bunch of asm code in head[_64].S
is also turned into C code.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Early calls to __ioremap() will panic if the hash insertion fails. This
patch makes them return NULL instead. It happens with some pSeries users
who enabled CONFIG_BOOTX_TEXT. The later is getting an incorrect address
for the fame buffer and the hash insertion fails. With this patch, it
will display an error instead of crashing at boot.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
memmap_init_zone() sets page count to 1. Before 'freeing' the
page, we need to clear the count. This is the same that is done
on free_all_bootmem_core() for memory discovered at boot time.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add the create_section_mapping() routine to create hptes for memory
sections dynamically added after system boot.
Signed-off-by: Mike Kravetz <kravetz@us.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
For some stupid reason I can't explain (brown paper bag is at hand), I
removed the check pfn_valid() in the code that does the icache/dcache
coherency on POWER4 and later. That causes us to eventually try to
access non existing struct page when hashing in IO pages.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
On ppc64 we end up with a negative value for the data size in the memory
boot message:
Memory: 2035560k/2097152k available (5792k kernel code, 89564k reserved,
18014398509481632k data, 870k bss, 352k init)
It turns out the section ordering of the linker script is different on
ppc32 and ppc64, so just count data as _edata - _sdata which should work
on both.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Oops, some last minute changes caused the 64K pages patch to break ppc32
build, this fixes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch, however, should be applied on top of the 64k-page-size patch to
fix some problems with hugepage (some pre-existing, another introduced by
this patch).
The patch fixes a bug in the SLB miss handler for hugepages on ppc64
introduced by the dynamic hugepage patch (commit id
c594adad56) due to a misunderstanding of the
srd instruction's behaviour (mea culpa). The problem arises when a 64-bit
process maps some hugepages in the low 4GB of the address space (unusual).
In this case, as well as the 256M segment in question being marked for
hugepages, other segments at 32G intervals will be incorrectly marked for
hugepages.
In the process, this patch tweaks the semantics of the hugepage bitmaps to
be more sensible. Previously, an address below 4G was marked for hugepages
if the appropriate segment bit in the "low areas" bitmask was set *or* if
the low bit in the "high areas" bitmap was set (which would mark all
addresses below 1TB for hugepage). With this patch, any given address is
governed by a single bitmap. Addresses below 4GB are marked for hugepage
if and only if their bit is set in the "low areas" bitmap (256M
granularity). Addresses between 4GB and 1TB are marked for hugepage iff
the low bit in the "high areas" bitmap is set. Higher addresses are marked
for hugepage iff their bit in the "high areas" bitmap is set (1TB
granularity).
To avoid conflicts, this patch must be applied on top of BenH's pending
patch for 64k base page size [0]. As such, this patch also addresses a
hugepage problem introduced by that patch. That patch allows hugepages of
1MB in size on hardware which supports it, however, that won't work when
using 4k pages (4 level pagetable), because in that case hugepage PTEs are
stored at the PMD level, and each PMD entry maps 2MB. This patch simply
disallows hugepages in that case (we can do something cleverer to re-enable
them some other day).
Built, booted, and a handful of hugepage related tests passed on POWER5
LPAR (both ARCH=powerpc and ARCH=ppc64).
[0] http://gate.crashing.org/~benh/ppc64-64k-pages.diff
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Mostly this involves adding #include <asm/smp.h>, since that defines
things like boot_cpuid[_phys] and [gs]et_hard_smp_processor_id, which
are SMP-related but still needed on UP. This incorporates fixes
posted by Olof Johansson and Heikki Lindholm.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The ancient ppcdebug/PPCDBG mechanism is now only used in two places.
First, in the hash setup code, one of the bits allows the size of the
hash table to be reduced by a factor of 8 - which would be better
accomplished with a command line option for that purpose. The other
was a bunch of bus walking related messages in the iSeries code, which
would seem to be insufficient reason to keep the mechanism.
This patch removes the last traces of this mechanism.
Built and booted on iSeries and pSeries POWER5 LPAR (ARCH=powerpc).
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Add nicer printing of faulting address on unresolvable kernel faults.
Makes life a little easier for those who don't know how to decode our
register contents at oops time.
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
base page size to 64K. The resulting kernel still boots on any
hardware. On current machines with 4K pages support only, the kernel
will maintain 16 "subpages" for each 64K page transparently.
Note that while real 64K capable HW has been tested, the current patch
will not enable it yet as such hardware is not released yet, and I'm
still verifying with the firmware architects the proper to get the
information from the newer hypervisors.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We had a static memory_limit in prom.c, and then another one defined
in setup_64.c and used in numa.c, which resulted in the kernel crashing
when mem=xxx was given on the command line. This puts the declaration
in system.h and the definition in mem.c. This also moves the
definition of tce_alloc_start/end out of setup_64.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Change the phys_mem_access_prot() function to take a pfn instead of an
address. This allows mmap64() to work on /dev/mem for addresses above 4G
on 32-bit architectures. We start with a pfn in mmap_mem(), so there's no
need to convert to an address; in fact, it's actively bad, since the
conversion can overflow when the address is above 4G.
Similarly fix the ppc32 page_is_ram() function to avoid a conversion to an
address by directly comparing to max_pfn. Working with max_pfn instead of
high_memory fixes page_is_ram() to give the right answer for highmem pages.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
do_dabr() is not relevant on 40x or Book-E processors so dont build it
Signed-off-by: Kumar K. Gala <kumar.gala@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
With ARCH=powerpc we assume the presence of a device tree, so we don't
require any support for the old bi_recs method of passing boot
parameters. Likewise, we've never needed it for ppc64, but we still
had an include/asm-ppc64/bootinfo.h from which nothing was used. This
patch removes that file, and all references to it in arch/ppc64 and
arch/powerpc. A related, unused variable 'boot_mem_size' is also
removed from setup_32.c. The bootinfo stuff remains in ARCH=ppc for
the time being.
Built and booted on Power5 (ARCH=ppc64 and ARCH=powerpc), built for
32-bit powermac (ARCH=powerpc and ARCH=ppc).
Signed-off-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some minor fixes that are needed if we are building for a book-e
processor.
Signed-off-by: Kumar K. Gala <kumar.gala@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We were initializing the btext stuff from prom_init(), thus breaking
the rule that all communication between prom_init() and the rest of
the kernel has to be via the flattened device tree. This removes
the btext initialization calls from prom_init() and initializes it
instead after the device tree is unflattened. It would be nice to
do it earlier, but that needs some more infrastructure to find the
properties we need in the flattened device tree.
Signed-off-by: Paul Mackerras <paulus@samba.org>
We weren't computing the size of the hash table correctly on iSeries
because the relevant code in prom.c was #ifdef CONFIG_PPC_PSERIES.
This moves the code to hash_utils_64.c, makes it unconditional, and
cleans it up a bit.
Signed-off-by: Paul Mackerras <paulus@samba.org>
On ARCH=ppc64 we were getting htab_hash_mask recalculated
to the correct value for our particular machine by accident.
In the merge tree, that code was commented out, so htab_hash_mask
was being corrupted.
We now set ppc64_pft_size instead which gets htab_has_mask
calculated correctly for us later. We should put an
ibm,pft-size property in the device tree at some point.
Also set -mno-minimal-toc in some makefiles.
Allow iSeries to configure PROC_DEVICETREE.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Now that the register names and bit definitions are all in reg.h,
use that instead of processor.h in assembly code in a few places.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This moves the remaining files in arch/ppc64/mm to arch/powerpc/mm,
and arranges that we use them when compiling with ARCH=ppc64.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This doesn't change any code, just renames things so we consistently
have foo_32.c and foo_64.c where we have separate 32- and 64-bit
versions.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This also creates merged versions of do_init_bootmem, paging_init
and mem_init and moves them to arch/powerpc/mm/mem.c. It gets rid
of the mem_pieces stuff.
I made memory_limit a parameter to lmb_enforce_memory_limit rather
than a global referenced by that function. This will require some
small changes to ppc64 if we want to continue building ARCH=ppc64
using the merged lmb.c.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This merges ppc_ksyms.c, puts back the actual do_execve call in
sys_execve, makes init_MMU call find_end_of_memory rather than
ppc_md.find_end_of_memory (every platform has a device tree
with a /memory node now, right?) and fixes some problems with the
mpic initialization on newworld powermacs.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This creates the directory structure under arch/powerpc and a bunch
of Kconfig files. It does a first-cut merge of arch/powerpc/mm,
arch/powerpc/lib and arch/powerpc/platforms/powermac. This is enough
to build a 32-bit powermac kernel with ARCH=powerpc.
For now we are getting some unmerged files from arch/ppc/kernel and
arch/ppc/syslib, or arch/ppc64/kernel. This makes some minor changes
to files in those directories and files outside arch/powerpc.
The boot directory is still not merged. That's going to be interesting.
Signed-off-by: Paul Mackerras <paulus@samba.org>