linux/arch/powerpc
Nishanth Aravamudan 2fabf084b6 powerpc: reorder per-cpu NUMA information's initialization
There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b4d ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce903809a ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

the slab consumption decreases from

Slab:             932416 kB
SUnreclaim:       902336 kB

to

Slab:             395264 kB
SUnreclaim:       359424 kB

And we a corresponding increase in the slab efficiency from

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       337 MB   11.28%  100.00%
task_struct                         288 MB    9.93%  100.00%

to

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                        37 MB  100.00%  100.00%
task_struct                          31 MB  100.00%  100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d87f
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c27226119
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-08-13 15:14:05 +10:00
..
boot powerpc/boot: Use correct zlib types for comparison 2014-08-13 15:13:45 +10:00
configs Here are the PPC and ARM changes for KVM, which I separated because 2014-08-07 11:35:30 -07:00
crypto powerpc: Fix compile of sha1-powerpc-asm.S on 32-bit 2013-03-05 16:56:26 +11:00
include powerpc: remove duplicate definition of TEXASR_FS 2014-08-13 15:13:47 +10:00
kernel powerpc: reorder per-cpu NUMA information's initialization 2014-08-13 15:14:05 +10:00
kvm Here are the PPC and ARM changes for KVM, which I separated because 2014-08-07 11:35:30 -07:00
lib powerpc: Add smp_mb()s to arch_spin_unlock_wait() 2014-08-13 15:13:27 +10:00
math-emu powerpc: Correct emulated mtfsf instruction 2014-04-07 10:33:11 +10:00
mm powerpc: reorder per-cpu NUMA information's initialization 2014-08-13 15:14:05 +10:00
net net: filter: split 'struct sk_filter' into socket and bpf parts 2014-08-02 15:03:58 -07:00
oprofile powerpc: Remove oprofile RS64 support 2014-07-28 14:10:25 +10:00
perf powerpc/perf/hv-24x7: Use kmem_cache_free 2014-08-13 15:14:04 +10:00
platforms powerpc/pseries/hvcserver: Fix endian issue in hvcs_get_partner_info 2014-08-13 15:14:04 +10:00
sysdev Merge remote-tracking branch 'scott/next' into next 2014-08-05 14:13:41 +10:00
xmon powerpc: Hard disable interrupts in xmon 2014-08-13 15:13:48 +10:00
Kconfig kexec: load and relocate purgatory at kernel load time 2014-08-08 15:57:32 -07:00
Kconfig.debug Patch queue for ppc - 2014-08-01 2014-08-05 09:58:11 +02:00
Makefile Merge branch 'merge' into next 2014-05-28 13:30:12 +10:00
relocs_check.pl Fix warning typo "CONFIG_RELCOATABLE" 2013-05-29 15:11:30 +02:00