linux/arch/x86/mm
Oscar Salvador 1ca75fa7f1 arch/x86/mm/numa: Do not initialize nodes twice
On x86, prior to ("mm: handle uninitialized numa nodes gracecully"), NUMA
nodes could be allocated at three different places.

 - numa_register_memblks
 - init_cpu_to_node
 - init_gi_nodes

All these calls happen at setup_arch, and have the following order:

setup_arch
  ...
  x86_numa_init
   numa_init
    numa_register_memblks
  ...
  init_cpu_to_node
   init_memory_less_node
    alloc_node_data
    free_area_init_memoryless_node
  init_gi_nodes
   init_memory_less_node
    alloc_node_data
    free_area_init_memoryless_node

numa_register_memblks() is only interested in those nodes which have
memory, so it skips over any memoryless node it founds.  Later on, when
we have read ACPI's SRAT table, we call init_cpu_to_node() and
init_gi_nodes(), which initialize any memoryless node we might have that
have either CPU or Initiator affinity, meaning we allocate pg_data_t
struct for them and we mark them as ONLINE.

So far so good, but the thing is that after ("mm: handle uninitialized
numa nodes gracefully"), we allocate all possible NUMA nodes in
free_area_init(), meaning we have a picture like the following:

setup_arch
  x86_numa_init
   numa_init
    numa_register_memblks  <-- allocate non-memoryless node
  x86_init.paging.pagetable_init
   ...
    free_area_init
     free_area_init_memoryless <-- allocate memoryless node
  init_cpu_to_node
   alloc_node_data             <-- allocate memoryless node with CPU
   free_area_init_memoryless_node
  init_gi_nodes
   alloc_node_data             <-- allocate memoryless node with Initiator
   free_area_init_memoryless_node

free_area_init() already allocates all possible NUMA nodes, but
init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
go ahead and allocate a new pg_data_t struct without checking anything,
meaning we end up allocating twice.

It should be mad clear that this only happens in the case where
memoryless NUMA node happens to have a CPU/Initiator affinity.

So get rid of init_memory_less_node() and just set the node online.

Note that setting the node online is needed, otherwise we choke down the
chain when bringup_nonboot_cpus() ends up calling
__try_online_node()->register_one_node()->...  and we blow up in
bus_add_device().  As can be seen here:

  BUG: kernel NULL pointer dereference, address: 0000000000000060
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.0-rc4-1-default+ #45
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/4
  RIP: 0010:bus_add_device+0x5a/0x140
  Code: 8b 74 24 20 48 89 df e8 84 96 ff ff 85 c0 89 c5 75 38 48 8b 53 50 48 85 d2 0f 84 bb 00 004
  RSP: 0000:ffffc9000022bd10 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888100987400 RCX: ffff8881003e4e19
  RDX: ffff8881009a5e00 RSI: ffff888100987400 RDI: ffff888100987400
  RBP: 0000000000000000 R08: ffff8881003e4e18 R09: ffff8881003e4c98
  R10: 0000000000000000 R11: ffff888100402bc0 R12: ffffffff822ceba0
  R13: 0000000000000000 R14: ffff888100987400 R15: 0000000000000000
  FS:  0000000000000000(0000) GS:ffff88853fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000060 CR3: 000000000200a001 CR4: 00000000001706b0
  Call Trace:
   device_add+0x4c0/0x910
   __register_one_node+0x97/0x2d0
   __try_online_node+0x85/0xc0
   try_online_node+0x25/0x40
   cpu_up+0x4f/0x100
   bringup_nonboot_cpus+0x4f/0x60
   smp_init+0x26/0x79
   kernel_init_freeable+0x130/0x2f1
   kernel_init+0x17/0x150
   ret_from_fork+0x22/0x30

The reason is simple, by the time bringup_nonboot_cpus() gets called, we
did not register the node_subsys bus yet, so we crash when
bus_add_device() tries to dereference bus()->p.

The following shows the order of the calls:

kernel_init_freeable
 smp_init
  bringup_nonboot_cpus
   ...
     bus_add_device()      <- we did not register node_subsys yet
 do_basic_setup
  do_initcalls
   postcore_initcall(register_node_type);
    register_node_type
     subsys_system_register
      subsys_register
       bus_register         <- register node_subsys bus

Why setting the node online saves us then? Well, simply because
__try_online_node() backs off when the node is online, meaning we do not
end up calling register_one_node() in the first place.

This is subtle, broken and deserves a deep analysis and thought about
how to put this into shape, but for now let us have this easy fix for
the leaking memory issue.

[osalvador@suse.de: add comments]
  Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de

Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de
Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully")
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Rafael Aquini <raquini@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Alexey Makhalov <amakhalov@vmware.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:06 -07:00
..
pat Merge branch 'kvm-guest-sev-migration' into kvm-master 2021-11-11 07:40:26 -05:00
amdtopology.c
cpu_entry_area.c x86/sev: Make the #VC exception stacks part of the default stacks storage 2021-10-06 21:48:27 +02:00
debug_pagetables.c mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
dump_pagetables.c mm: don't include asm/pgtable.h if linux/mm.h is already included 2020-06-09 09:39:13 -07:00
extable.c x86/entry_32: Fix segment exceptions 2022-01-12 16:38:25 +01:00
fault.c mm: remove redundant check about FAULT_FLAG_ALLOW_RETRY bit 2022-01-15 16:30:27 +02:00
highmem_32.c x86/mm/highmem: Use generic kmap atomic implementation 2020-11-06 23:14:55 +01:00
hugetlbpage.c mm: remove unneeded includes of <asm/pgalloc.h> 2020-08-07 11:33:26 -07:00
ident_map.c x86/mm/ident_map: Check for errors from ident_pud_init() 2020-10-28 14:48:30 +01:00
init_32.c x86: mm: rename __is_kernel_text() to is_x86_32_kernel_text() 2021-11-09 10:02:51 -08:00
init_64.c bootmem: Use page->index instead of page->freelist 2022-01-06 12:27:03 +01:00
init.c x86/realmode: Add comment for Global bit usage in trampoline_pgd 2021-12-04 13:50:08 +01:00
iomap_32.c io-mapping: Cleanup atomic iomap 2020-11-06 23:14:58 +01:00
ioremap.c x86/boot: Add setup_indirect support in early_memremap_is_setup_data() 2022-03-09 12:49:46 +01:00
kasan_init_64.c memblock: use memblock_free for freeing virtual pointers 2021-11-06 13:30:41 -07:00
kaslr.c x86: Fix various typos in comments 2021-03-18 15:31:53 +01:00
kmmio.c x86: Fix various typos in comments 2021-03-18 15:31:53 +01:00
maccess.c maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00
Makefile x86/sev: Move common memory encryption code to mem_encrypt.c 2021-12-08 16:49:53 +01:00
mem_encrypt_amd.c x86/sev: Move common memory encryption code to mem_encrypt.c 2021-12-08 16:49:53 +01:00
mem_encrypt_boot.S x86: Prepare asm files for straight-line-speculation 2021-12-08 12:25:37 +01:00
mem_encrypt_identity.c - Export sev_es_ghcb_hv_call() so that HyperV Isolation VMs can use it too 2021-11-01 15:52:26 -07:00
mem_encrypt.c x86/sev: Move common memory encryption code to mem_encrypt.c 2021-12-08 16:49:53 +01:00
mm_internal.h x86/mm: thread pgprot_t through init_memory_mapping() 2020-04-10 15:36:21 -07:00
mmap.c x86/mm/mmap: Fix -Wmissing-prototypes warnings 2020-04-22 20:19:48 +02:00
mmio-mod.c x86/mmiotrace: Replace deprecated CPU-hotplug functions. 2021-08-10 14:46:27 +02:00
numa_32.c x86/mm: Drop deprecated DISCONTIGMEM support for 32-bit 2020-05-28 18:34:30 +02:00
numa_64.c
numa_emulation.c memblock: use memblock_free for freeing virtual pointers 2021-11-06 13:30:41 -07:00
numa_internal.h
numa.c arch/x86/mm/numa: Do not initialize nodes twice 2022-03-22 15:57:06 -07:00
pf_in.c
pf_in.h
pgtable_32.c mm: remove unneeded includes of <asm/pgalloc.h> 2020-08-07 11:33:26 -07:00
pgtable.c Revert "mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge" 2021-07-21 11:28:09 +01:00
physaddr.c mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions 2019-12-10 10:12:55 +01:00
physaddr.h
pkeys.c Fixes and improvements for FPU handling on x86: 2021-07-07 11:12:01 -07:00
pti.c x86/process/64: Move cpu_current_top_of_stack out of TSS 2021-03-28 22:40:10 +02:00
setup_nx.c mm: reorder includes after introduction of linux/pgtable.h 2020-06-09 09:39:13 -07:00
srat.c
testmmiotrace.c remove ioremap_nocache and devm_ioremap_nocache 2020-01-06 09:45:59 +01:00
tlb.c - Add a set of thread_info.flags accessors which snapshot it before 2022-01-10 11:34:10 -08:00