linux/arch/x86/kernel
Dave Hansen cebf15eb09 x86, sched: Add new topology for multi-NUMA-node CPUs
I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

  161270fc1f ("x86/smp: Fix topology checks on AMD MCM CPUs")

Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.  In other words, we have to trust
the ACPI tables <shudder>.

This detects the situation where a NUMA node occurs at a place in
the middle of the "CPU" sched domains.  It replaces the default
topology with one that relies on the NUMA information from the
firmware (SRAT table) for all levels of sched domains above the
hyperthreads.

This also fixes a sysfs bug.  We used to freak out when we saw
the "mc" group cross a node boundary, so we stopped building the
MC group.  MC gets exported as the 'core_siblings_list' in
/sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
the same 'physical_package_id' to not be listed together in
'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this patch, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35

Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Added LLC domain and s/match_mc/match_die/ ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: brice.goglin@gmail.com
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-24 14:47:14 +02:00
..
acpi Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-08-13 18:23:32 -06:00
apic x86, irq, PCI: Keep IRQ assignment for runtime power management 2014-08-29 13:38:00 +02:00
cpu PCI changes for the v3.17 merge window (part 2): 2014-08-14 18:10:33 -06:00
kprobes kprobes/x86: Don't try to resolve kprobe faults from userspace 2014-07-16 14:16:32 +02:00
.gitignore
alternative.c kprobes, x86: Allow kprobes on text_poke/hw_breakpoint 2014-04-24 10:03:02 +02:00
amd_gart_64.c x86: enable DMA CMA with swiotlb 2014-06-04 16:53:57 -07:00
amd_nb.c A bunch of EDAC updates all over the place: 2014-04-01 13:54:00 -07:00
apb_timer.c intel_mid: Renamed *mrst* to *intel_mid* 2013-10-17 16:40:47 -07:00
aperture_64.c x86/gart: Tidy messages and add bridge device info 2014-05-23 10:47:19 -06:00
apm_32.c x86: Remove unused variable "polling" 2014-07-16 12:58:47 +02:00
asm-offsets_32.c x86: Get rid of ->hard_math and all the FPU asm fu 2013-06-06 14:32:04 -07:00
asm-offsets_64.c x86, gdt, hibernate: Store/load GDT for hibernate path. 2013-05-02 11:27:35 -07:00
asm-offsets.c sched, x86: Provide a per-cpu preempt_count implementation 2013-09-25 14:07:57 +02:00
audit_64.c
bootflag.c
check.c x86/mm: memblock: switch to use NUMA_NO_NODE 2014-01-21 16:19:47 -08:00
cpuid.c x86, cpuid: Fix CPU hotplug callback registration 2014-03-20 13:43:42 +01:00
crash_dump_32.c
crash_dump_64.c
crash.c kexec: create a new config option CONFIG_KEXEC_FILE for new syscall 2014-08-29 16:28:16 -07:00
devicetree.c x86, irq, devicetree: Release IOAPIC pin when PCI device is disabled 2014-06-21 23:05:44 +02:00
doublefault.c x86: Delete non-required instances of include <linux/init.h> 2014-01-06 21:25:18 -08:00
dumpstack_32.c Merge branch 'x86-threadinfo-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-04-01 10:17:18 -07:00
dumpstack_64.c x86: Fix dumpstack_64 irq stack handling 2014-04-02 11:46:50 -07:00
dumpstack.c kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation 2014-04-24 10:26:38 +02:00
e820.c x86/mm: memblock: switch to use NUMA_NO_NODE 2014-01-21 16:19:47 -08:00
early_printk.c Merge branch 'x86-intel-mid-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-11-12 11:12:22 +09:00
early-quirks.c Merge commit '9e9a928eed8796a0a1aaed7e0b676db86ba84594' into drm-next 2014-06-05 20:28:59 +10:00
entry_32.S x86_32, entry: Clean up sysenter_badsys declaration 2014-08-15 13:45:32 -07:00
entry_64.S Merge branches 'x86-build-for-linus', 'x86-cleanups-for-linus' and 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-08-04 16:56:16 -07:00
espfix_64.c x86/espfix/xen: Fix allocation of pages for paravirt page tables 2014-07-14 13:47:32 -07:00
ftrace.c ftrace/x86: Add call to ftrace_graph_is_dead() in function graph code 2014-07-17 09:45:08 -04:00
head32.c asmlinkage, x86: Add explicit __visible to arch/x86/* 2014-05-05 16:07:44 -07:00
head64.c kernel/printk: use symbolic defines for console loglevels 2014-06-04 16:54:17 -07:00
head_32.S x86: fix compile error due to X86_TRAP_NMI use in asm files 2014-03-07 18:58:40 -08:00
head_64.S x86: fix compile error due to X86_TRAP_NMI use in asm files 2014-03-07 18:58:40 -08:00
head.c x86: Make sure we can boot in the case the BDA contains pure garbage 2013-02-27 13:38:57 -08:00
hpet.c Merge branch 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next 2014-06-05 08:05:29 -07:00
hw_breakpoint.c kprobes, x86: Allow kprobes on text_poke/hw_breakpoint 2014-04-24 10:03:02 +02:00
i386_ksyms_32.c sched, x86: Optimize the preempt_schedule() call 2013-09-25 14:23:07 +02:00
i387.c x86/xsaves: Clear reserved bits in xsave header 2014-05-29 14:33:00 -07:00
i8237.c
i8253.c
i8259.c x86, irq, pic: Probe for legacy PIC and set legacy_pic appropriately 2014-04-14 11:49:55 -07:00
io_delay.c
ioport.c x86: get rid of pt_regs argument of iopl(2) 2013-02-03 18:16:24 -05:00
iosf_mbi.c PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use 2014-08-12 12:15:14 -06:00
irq_32.c x86, threadinfo: Redo "x86: Use inline assembler to get sp" 2014-03-10 17:32:01 -07:00
irq_64.c irq: Consolidate do_softirq() arch overriden implementations 2013-10-01 12:53:25 +02:00
irq_work.c x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible 2013-08-06 14:18:23 -07:00
irq.c Merge branch 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-06-12 20:03:47 -07:00
irqinit.c x86: Fix non-PC platform kernel crash on boot due to NULL dereference 2014-08-25 22:36:57 +02:00
jump_label.c x86/jump_label: expect default_nop if static_key gets enabled on boot-up 2013-10-19 19:45:35 -04:00
kdebugfs.c
kexec-bzimage64.c kexec: verify the signature of signed PE bzImage 2014-08-08 15:57:33 -07:00
kgdb.c x86: Delete non-required instances of include <linux/init.h> 2014-01-06 21:25:18 -08:00
ksysfs.c x86: ksysfs.c build fix 2014-01-03 14:37:13 +00:00
kvm.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-06-12 19:18:49 -07:00
kvmclock.c kvm: remove redundant registration of BSP's hv_clock area 2014-02-22 15:53:32 +01:00
ldt.c Revert "x86-64, modify_ldt: Make support for 16-bit segments a runtime option" 2014-05-21 10:22:59 -07:00
machine_kexec_32.c x86: Delete non-required instances of include <linux/init.h> 2014-01-06 21:25:18 -08:00
machine_kexec_64.c kexec: create a new config option CONFIG_KEXEC_FILE for new syscall 2014-08-29 16:28:16 -07:00
Makefile kexec: create a new config option CONFIG_KEXEC_FILE for new syscall 2014-08-29 16:28:16 -07:00
mcount_64.S ftrace: x86: Remove check of obsolete variable function_trace_stop 2014-07-18 13:57:02 -04:00
mmconf-fam10h_64.c x86: delete __cpuinit usage from all x86 files 2013-07-14 19:36:56 -04:00
module.c x86, kaslr: fix module lock ordering problem 2014-03-24 10:18:26 -07:00
mpparse.c x86, apic: Remove mps_oem_check callback 2014-07-31 08:05:42 -07:00
msr.c x86, msr: Fix CPU hotplug callback registration 2014-03-20 13:43:42 +01:00
nmi_selftest.c
nmi.c kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation 2014-04-24 10:26:38 +02:00
paravirt_patch_32.c
paravirt_patch_64.c x86_64/entry/xen: Do not invoke espfix64 on Xen 2014-07-28 15:25:40 -07:00
paravirt-spinlocks.c x86, ticketlock: Add slowpath logic 2013-08-09 07:54:00 -07:00
paravirt.c kprobes, x86: Prohibit probing on native_set_debugreg()/load_idt() 2014-04-24 10:02:58 +02:00
pci-calgary_64.c x86, calgary: Use 8M TCE table size by default 2014-04-10 19:51:32 -07:00
pci-dma.c arch/x86/kernel/pci-dma.c: fix dma_generic_alloc_coherent() when CONFIG_DMA_CMA is enabled 2014-06-04 16:53:57 -07:00
pci-iommu_table.c
pci-nommu.c x86: Delete non-required instances of include <linux/init.h> 2014-01-06 21:25:18 -08:00
pci-swiotlb.c x86: enable DMA CMA with swiotlb 2014-06-04 16:53:57 -07:00
pcspeaker.c
perf_regs.c perf: Fix off by one test in perf_reg_value() 2012-09-19 17:08:40 +02:00
pmc_atom.c x86/pmc_atom: Silence shift wrapping warnings in pmc_sleep_tmr_show() 2014-08-02 16:52:17 -07:00
preempt.S sched, x86: Optimize the preempt_schedule() call 2013-09-25 14:23:07 +02:00
probe_roms.c x86/pci/probe_roms: Add missing __iomem annotation to pci_map_biosrom() 2012-09-05 10:52:25 +02:00
process_32.c x86: Keep thread_info on thread stack in x86_32 2014-03-06 16:56:55 -08:00
process_64.c Merge branch 'perf/urgent' into perf/core, to resolve conflict and to prepare for new patches 2014-06-06 07:55:06 +02:00
process.c Define kernel API to get address of each state in xsave area 2014-05-29 14:33:09 -07:00
ptrace.c x86: Keep thread_info on thread stack in x86_32 2014-03-06 16:56:55 -08:00
pvclock.c hung_task: add method to reset detector 2013-11-06 09:49:02 +02:00
quirks.c x86/amd/numa: Fix northbridge quirk to assign correct NUMA node 2014-03-14 11:05:36 +01:00
reboot_fixups_32.c
reboot.c x86/reboot: Add EFI reboot quirk for ACPI Hardware Reduced flag 2014-07-18 21:23:52 +01:00
relocate_kernel_32.S x86, asm, cleanup: Replace open-coded control register values with symbolic 2013-06-25 16:26:06 -07:00
relocate_kernel_64.S x86, reloc: Use xorl instead of xorq in relocate_kernel_64.S 2013-06-20 21:30:04 -07:00
resource.c x86: don't exclude low BIOS area when allocating address space for non-PCI cards 2014-07-16 12:29:36 -06:00
rtc.c Merge branch 'x86-intel-mid-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-11-12 11:12:22 +09:00
setup_percpu.c
setup.c arch/x86: Replace plain strings with constants 2014-07-18 21:23:59 +01:00
signal.c x86_32, signal: Fix vdso rt_sigreturn 2014-06-23 15:54:42 -07:00
smp.c asmlinkage, x86: Add explicit __visible to arch/x86/* 2014-05-05 16:07:44 -07:00
smpboot.c x86, sched: Add new topology for multi-NUMA-node CPUs 2014-09-24 14:47:14 +02:00
stacktrace.c
step.c ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL 2013-01-22 10:08:00 -08:00
sys_x86_64.c x86 get_unmapped_area: Access mmap_legacy_base through mm_struct member 2013-08-22 10:19:35 -07:00
syscall_32.c x86, asmlinkage: Make syscall tables visible 2013-08-06 14:20:18 -07:00
syscall_64.c x86, asmlinkage: Make syscall tables visible 2013-08-06 14:20:18 -07:00
sysfb_efi.c x86: sysfb: move EFI quirks from efifb to sysfb 2013-08-02 16:17:47 -07:00
sysfb_simplefb.c x86/simplefb: Mark framebuffer mem-resources as IORESOURCE_BUSY to avoid bootup warning 2013-10-03 07:51:11 +02:00
sysfb.c x86: sysfb: move EFI quirks from efifb to sysfb 2013-08-02 16:17:47 -07:00
tboot.c x86 / tboot / ACPI: Fail extended mode reduced hardware sleep 2013-07-31 14:25:51 +02:00
tce_64.c
test_nx.c
test_rodata.c
time.c x86: Fix non-PC platform kernel crash on boot due to NULL dereference 2014-08-25 22:36:57 +02:00
tls.c make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect 2013-03-03 22:58:33 -05:00
tls.h
topology.c hotplug, powerpc, x86: Remove cpu_hotplug_driver_lock() 2013-09-30 19:55:51 +02:00
trace_clock.c tracing,x86: Add a TSC trace_clock 2012-11-13 15:48:27 -05:00
tracepoint.c x86: Make sure IDT is page aligned 2013-07-16 15:14:48 -07:00
traps.c x86/kprobes: Fix build errors and blacklist context_track_user 2014-06-14 09:07:44 +02:00
tsc_msr.c x86: tsc: Add missing Baytrail frequency to the table 2014-02-19 17:12:24 +01:00
tsc_sync.c x86: Delete non-required instances of include <linux/init.h> 2014-01-06 21:25:18 -08:00
tsc.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-08-05 17:46:42 -07:00
uprobes.c uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates 2014-06-05 16:21:57 +02:00
verify_cpu.S
vm86_32.c x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...) 2013-05-02 20:36:32 -04:00
vmlinux.lds.S x86, vdso: Zero-pad the VVAR page 2014-03-18 12:52:44 -07:00
vsmp_64.c x86/apic/vsmp: Make is_vsmp_box() static 2014-08-01 15:09:45 -07:00
vsyscall_64.c x86_64/vsyscall: Fix warn_bad_vsyscall log output 2014-07-25 16:34:15 -07:00
vsyscall_emu_64.S
vsyscall_gtod.c timekeeping: Create struct tk_read_base and use it in struct timekeeper 2014-07-23 15:01:53 -07:00
vsyscall_trace.h
x86_init.c PCI: Drop "irq" param from *_restore_msi_irqs() 2013-12-13 08:44:30 -07:00
x8664_ksyms_64.c sched, x86: Optimize the preempt_schedule() call 2013-09-25 14:23:07 +02:00
xsave.c x86/xsaves: Clean up code in xstate offsets computation in xsave area 2014-05-30 17:12:41 -07:00