linux/arch/powerpc/mm
Dave Hansen 06eccea6c3 powerpc/mm: Fix numa reserve bootmem page selection
Fix the powerpc NUMA reserve bootmem page selection logic.

commit 8f64e1f2d1 (powerpc: Reserve
in bootmem lmb reserved regions that cross NUMA nodes) changed
the logic for how the powerpc LMB reserved regions were converted
to bootmen reserved regions.  As the folowing discussion reports,
the new logic was not correct.

mark_reserved_regions_for_nid() goes through each LMB on the
system that specifies a reserved area.  It searches for
active regions that intersect with that LMB and are on the
specified node.  It attempts to bootmem-reserve only the area
where the active region and the reserved LMB intersect.  We
can not reserve things on other nodes as they may not have
bootmem structures allocated, yet.

We base the size of the bootmem reservation on two possible
things.  Normally, we just make the reservation start and
stop exactly at the start and end of the LMB.

However, the LMB reservations are not aware of NUMA nodes and
on occasion a single LMB may cross into several adjacent
active regions.  Those may even be on different NUMA nodes
and will require separate calls to the bootmem reserve
functions.  So, the bootmem reservation must be trimmed to
fit inside the current active region.

That's all fine and dandy, but we trim the reservation
in a page-aligned fashion.  That's bad because we start the
reservation at a non-page-aligned address: physbase.

The reservation may only span 2 bytes, but that those bytes
may span two pfns and cause a reserve_size of 2*PAGE_SIZE.

Take the case where you reserve 0x2 bytes at 0x0fff and
where the active region ends at 0x1000.  You'll jump into
that if() statment, but node_ar.end_pfn=0x1 and
start_pfn=0x0.  You'll end up with a reserve_size=0x1000,
and then call

  reserve_bootmem_node(node, physbase=0xfff, size=0x1000);

0x1000 may not be on the same node as 0xfff.  Oops.

In almost all the vm code, end_<anything> is not inclusive.
If you have an end_pfn of 0x1234, page 0x1234 is not
included in the range.  Using PFN_UP instead of the
(>> >> PAGE_SHIFT) will make this consistent with the other VM
code.

We also need to do math for the reserved size with physbase
instead of start_pfn.  node_ar.end_pfn << PAGE_SHIFT is
*precisely* the end of the node.  However,
(start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning
of the reserved area.  That is, of course, physbase.
If we don't use physbase here, the reserve_size can be
made too large.

From: Dave Hansen <dave@linux.vnet.ibm.com>
Tested-by: Geoff Levand <geoffrey.levand@am.sony.com>  Tested on PS3.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2009-02-13 16:37:45 +11:00
..
40x_mmu.c powerpc/40x: Limit allocable DRAM during early mapping 2008-11-13 10:10:56 -05:00
44x_mmu.c powerpc: rework 4xx PTE access and TLB miss 2008-07-09 13:36:17 -04:00
fault.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc 2008-12-28 16:54:33 -08:00
fsl_booke_mmu.c powerpc/fsl-booke: Fix mapping functions to use phys_addr_t 2009-02-09 21:11:55 -06:00
gup.c powerpc: Get USE_STRICT_MM_TYPECHECKS working again 2008-10-14 10:35:27 +11:00
hash_low_32.S powerpc/mm: Fix _PAGE_COHERENT support on classic ppc32 HW 2009-02-11 16:07:02 +11:00
hash_low_64.S powerpc: Free a PTE bit on ppc64 with 64K pages 2008-06-30 22:30:53 +10:00
hash_native_64.c [POWERPC] Use 1TB segments 2007-10-12 14:05:17 +10:00
hash_utils_64.c powerpc: Don't use a 16G page if beyond mem= limits 2008-10-22 15:01:21 +11:00
hugetlbpage.c mm: report the MMU pagesize in /proc/pid/smaps 2009-01-06 15:58:58 -08:00
init_32.c powerpc/32: Add the ability for a classic ppc kernel to be loaded at 32M 2008-12-23 15:13:29 +11:00
init_64.c powerpc: Get USE_STRICT_MM_TYPECHECKS working again 2008-10-14 10:35:27 +11:00
Makefile powerpc/mm: Split low level tlb invalidate for nohash processors 2008-12-21 14:21:16 +11:00
mem.c mm: show node to memory section relationship with symlinks in sysfs 2009-01-06 15:59:00 -08:00
mmap.c
mmu_context_hash32.c powerpc/mm: Split mmu_context handling 2008-12-21 14:21:15 +11:00
mmu_context_hash64.c powerpc/mm: Split mmu_context handling 2008-12-21 14:21:15 +11:00
mmu_context_nohash.c powerpc/mm: Runtime allocation of mmu context maps for nohash CPUs 2008-12-21 14:21:16 +11:00
mmu_decl.h Merge commit 'kumar/kumar-next' into next 2009-01-13 13:59:03 +11:00
numa.c powerpc/mm: Fix numa reserve bootmem page selection 2009-02-13 16:37:45 +11:00
pgtable_32.c powerpc/fsl-booke: Fix mapping functions to use phys_addr_t 2009-02-09 21:11:55 -06:00
pgtable_64.c powerpc ioremap_prot 2008-07-24 10:47:15 -07:00
pgtable.c powerpc: Use RCU based pte freeing mechanism for all powerpc 2008-12-03 20:46:35 +11:00
ppc_mmu_32.c powerpc/mm: Fix handling of _PAGE_COHERENT in BAT setup code 2009-01-28 17:15:52 +11:00
slb_low.S [POWERPC] vmemmap fixes to use smaller pages 2008-05-15 20:49:25 +10:00
slb.c [POWERPC] vmemmap fixes to use smaller pages 2008-05-15 20:49:25 +10:00
slice.c powerpc: is_hugepage_only_range() must account for both 4kB and 64kB slices 2009-01-16 16:15:16 +11:00
stab.c powerpc: Change u64/s64 to a long long integer type 2009-01-13 14:47:59 +11:00
subpage-prot.c [POWERPC] Provide a way to protect 4k subpages when using 64k pages 2008-01-24 10:06:01 +11:00
tlb_hash32.c powerpc/mm: Add SMP support to no-hash TLB handling 2008-12-21 14:21:16 +11:00
tlb_hash64.c powerpc/mm: Rename tlb_32.c and tlb_64.c to tlb_hash32.c and tlb_hash64.c 2008-12-16 15:53:30 +11:00
tlb_nohash_low.S powerpc/44x: No need to mask MSR:CE, ME or DE in _tlbil_va on 440 2008-12-21 14:21:16 +11:00
tlb_nohash.c powerpc: Remove the redundant _tlbil_pid at SMP case 2009-01-08 16:25:13 +11:00