linux/arch/x86/mm
Tang Chen 95cf82ecc1 mem-hotplug: handle node hole when initializing numa_meminfo.
When parsing SRAT, all memory ranges are added into numa_meminfo.  In
numa_init(), before entering numa_cleanup_meminfo(), all possible memory
ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
ranges over max_pfn or empty.

But, this only works if the nodes are continuous.  Let's have a look at
the following example:

We have an SRAT like this:
SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug

On boot, only node 0,1,2,3 exist.

And the numa_meminfo will look like this:
numa_meminfo.nr_blks = 9
1. on node 0: [0, 60000000]
2. on node 0: [100000000, 20000000000]
3. on node 1: [20000000000, 40000000000]
4. on node 4: [40000000000, 60000000000]
5. on node 5: [60000000000, 80000000000]
6. on node 2: [80000000000, a0000000000]
7. on node 3: [a0000000000, a0800000000]
8. on node 6: [c0000000000, a0800000000]
9. on node 7: [e0000000000, a0800000000]

And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
removed because their end addresses are less then max_pfn.  But in fact,
node 4 and 5 don't exist.

In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.

Since memory ranges in node 4 and 5 are in numa_meminfo, in
numa_register_memblks(), node 4 and 5 will be mistakenly set to online.

If you run lscpu, it will show:
NUMA node0 CPU(s):     0-14,128-142
NUMA node1 CPU(s):     15-29,143-157
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s):     62-76,190-204
NUMA node5 CPU(s):     78-92,206-220

In this patch, we use memblock_overlaps_region() to check if ranges in
numa_meminfo overlap with ranges in memory_block.  Since memory_block
contains all available memory at boot time, if they overlap, it means the
ranges exist.  If not, then remove them from numa_meminfo.

After this patch, lscpu will show:
NUMA node0 CPU(s):     0-14,128-142
NUMA node1 CPU(s):     15-29,143-157
NUMA node4 CPU(s):     62-76,190-204
NUMA node5 CPU(s):     78-92,206-220

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
..
kmemcheck x86: Replace __get_cpu_var uses 2014-08-26 13:45:49 -04:00
amdtopology.c x86/mm/numa: Simplify some bit mangling 2013-04-10 19:06:26 +02:00
dump_pagetables.c Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-10 13:59:34 -08:00
extable.c x86, extable: Switch to relative exception table entries 2012-04-20 17:22:34 -07:00
fault.c x86/vm86: Clean up vm86.h includes 2015-07-31 13:31:10 +02:00
gup.c mm: convert p[te|md]_numa users to p[te|md]_protnone_numa 2015-02-12 18:54:08 -08:00
highmem_32.c sched/preempt, mm/kmap: Explicitly disable/enable preemption in kmap_atomic_* 2015-05-19 08:39:14 +02:00
hugetlbpage.c mm/hugetlb: pmd_huge() returns true for non-present hugepage 2015-02-11 17:06:01 -08:00
init_32.c Linux 4.2-rc8 2015-08-25 09:59:19 +02:00
init_64.c x86/mm: Use early_param_on_off() for direct_gbpages 2015-03-05 08:02:12 +01:00
init.c x86/mm/pat: Add comments to cachemode translation tables 2015-08-20 21:26:12 +02:00
iomap_32.c Merge branch 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-06-22 17:59:09 -07:00
ioremap.c x86/mm: Fix newly introduced printk format warnings 2015-07-24 16:35:33 +02:00
kasan_init_64.c x86/kasan, mm: Introduce generic kasan_populate_zero_shadow() 2015-08-22 14:54:55 +02:00
kmmio.c x86: Delete non-required instances of include <linux/init.h> 2014-01-06 21:25:18 -08:00
Makefile mm: move memtest under mm 2015-04-14 16:49:06 -07:00
mm_internal.h x86: Enable PAT to use cache mode translation tables 2014-11-16 11:04:26 +01:00
mmap.c x86/mpx: Do not set ->vm_ops on MPX VMAs 2015-07-21 07:57:16 +02:00
mmio-mod.c x86: delete __cpuinit usage from all x86 files 2013-07-14 19:36:56 -04:00
mpx.c x86/mpx: Do not set ->vm_ops on MPX VMAs 2015-07-21 07:57:16 +02:00
numa_32.c x86: Fix the initialization of physnode_map 2014-02-01 22:15:51 -08:00
numa_64.c x86, mm: kill numa_free_all_bootmem() 2012-11-17 11:59:47 -08:00
numa_emulation.c x86: delete __cpuinit usage from all x86 files 2013-07-14 19:36:56 -04:00
numa_internal.h x86-32, mm: Rip out x86_32 NUMA remapping code 2013-01-31 14:12:30 -08:00
numa.c mem-hotplug: handle node hole when initializing numa_meminfo. 2015-09-08 15:35:28 -07:00
pageattr-test.c x86/mm/pat: Make mm/pageattr[-test].c explicitly non-modular 2015-08-25 09:48:38 +02:00
pageattr.c x86/mm/pat: Make mm/pageattr[-test].c explicitly non-modular 2015-08-25 09:48:38 +02:00
pat_internal.h x86/mm/pat: Convert to pr_*() usage 2015-05-27 14:40:59 +02:00
pat_rbtree.c x86/mm/pat: Convert to pr_*() usage 2015-05-27 14:40:59 +02:00
pat.c x86/mm/pat: Extend set_page_memtype() to support Write-Through type 2015-06-07 15:28:59 +02:00
pf_in.c x86: Eliminate various 'set but not used' warnings 2011-05-21 19:10:33 +02:00
pf_in.h
pgtable_32.c x86: Remove set_pmd_pfn 2014-09-01 10:15:31 +02:00
pgtable.c x86/mm/mtrr: Enhance MTRR checks in kernel mapping helpers 2015-05-27 14:40:58 +02:00
physaddr.c x86, mm: Make DEBUG_VIRTUAL work earlier in boot 2013-01-25 16:33:22 -08:00
physaddr.h
setup_nx.c x86: delete __cpuinit usage from all x86 files 2013-07-14 19:36:56 -04:00
srat.c x86/mm: Avoid duplicated pxm_to_node() calls 2014-02-09 15:32:31 +01:00
testmmiotrace.c
tlb.c x86, mm: trace when an IPI is about to be sent 2015-09-04 16:54:41 -07:00