linux/Documentation
Nitin Gupta facdaa917c mm: proactive compaction
For some applications, we need to allocate almost all memory as hugepages.
However, on a running system, higher-order allocations can fail if the
memory is fragmented.  Linux kernel currently does on-demand compaction as
we request more hugepages, but this style of compaction incurs very high
latency.  Experiments with one-time full memory compaction (followed by
hugepage allocations) show that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system.  Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping
allocation latencies low.

For a more proactive compaction, the approach taken here is to define a
new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
external fragmentation which kcompactd tries to maintain.

The tunable takes a value in range [0, 100], with a default of 20.

Note that a previous version of this patch [1] was found to introduce too
many tunables (per-order extfrag{low, high}), but this one reduces them to
just one sysctl.  Also, the new tunable is an opaque value instead of
asking for specific bounds of "external fragmentation", which would have
been difficult to estimate.  The internal interpretation of this opaque
value allows for future fine-tuning.

Currently, we use a simple translation from this tunable to [low, high]
"fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
The score for a node is defined as weighted mean of per-zone external
fragmentation.  A zone's present_pages determines its weight.

To periodically check per-node score, we reuse per-node kcompactd threads,
which are woken up every 500 milliseconds to check the same.  If a node's
score exceeds its high threshold (as derived from user-provided
proactiveness value), proactive compaction is started until its score
reaches its low threshold value.  By default, proactiveness is set to 20,
which implies threshold values of low=80 and high=90.

This patch is largely based on ideas from Michal Hocko [2].  See also the
LWN article [3].

Performance data
================

System: x64_64, 1T RAM, 80 CPU threads.
Kernel: 5.6.0-rc3 + this patch

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Before starting the driver, the system was fragmented from a userspace
program that allocates all memory and then for each 2M aligned section,
frees 3/4 of base pages using munmap.  The workload is mainly anonymous
userspace pages, which are easy to move around.  I intentionally avoided
unmovable pages in this test to see how much latency we incur when
hugepage allocations hit direct compaction.

1. Kernel hugepage allocation latencies

With the system in such a fragmented state, a kernel driver then allocates
as many hugepages as possible and measures allocation latency:

(all latency values are in microseconds)

- With vanilla 5.6.0-rc3

  percentile latency
  –––––––––– –––––––
	   5    7894
	  10    9496
	  25   12561
	  30   15295
	  40   18244
	  50   21229
	  60   27556
	  75   30147
	  80   31047
	  90   32859
	  95   33799

Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

- With 5.6.0-rc3 + this patch, with proactiveness=20

sysctl -w vm.compaction_proactiveness=20

  percentile latency
  –––––––––– –––––––
	   5       2
	  10       2
	  25       3
	  30       3
	  40       3
	  50       4
	  60       4
	  75       4
	  80       4
	  90       5
	  95     429

Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
total free => 98% of free memory could be allocated as hugepages)

2. JAVA heap allocation

In this test, we first fragment memory using the same method as for (1).

Then, we start a Java process with a heap size set to 700G and request the
heap to be allocated with THP hugepages.  We also set THP to madvise to
allow hugepage backing of this heap.

/usr/bin/time
 java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch

The above command allocates 700G of Java heap using hugepages.

- With vanilla 5.6.0-rc3

17.39user 1666.48system 27:37.89elapsed

- With 5.6.0-rc3 + this patch, with proactiveness=20

8.35user 194.58system 3:19.62elapsed

Elapsed time remains around 3:15, as proactiveness is further increased.

Note that proactive compaction happens throughout the runtime of these
workloads.  The situation of one-time compaction, sufficient to supply
hugepages for following allocation stream, can probably happen for more
extreme proactiveness values, like 80 or 90.

In the above Java workload, proactiveness is set to 20.  The test starts
with a node's score of 80 or higher, depending on the delay between the
fragmentation step and starting the benchmark, which gives more-or-less
time for the initial round of compaction.  As t he benchmark consumes
hugepages, node's score quickly rises above the high threshold (90) and
proactive compaction starts again, which brings down the score to the low
threshold level (80).  Repeat.

bpftrace also confirms proactive compaction running 20+ times during the
runtime of this Java benchmark.  kcompactd threads consume 100% of one of
the CPUs while it tries to bring a node's score within thresholds.

Backoff behavior
================

Above workloads produce a memory state which is easy to compact.  However,
if memory is filled with unmovable pages, proactive compaction should
essentially back off.  To test this aspect:

- Created a kernel driver that allocates almost all memory as hugepages
  followed by freeing first 3/4 of each hugepage.
- Set proactiveness=40
- Note that proactive_compact_node() is deferred maximum number of times
  with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
  (=> ~30 seconds between retries).

[1] https://patchwork.kernel.org/patch/11098289/
[2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
[3] https://lwn.net/Articles/817905/

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Oleksandr Natalenko <oleksandr@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nitin Gupta <ngupta@nitingupta.dev>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:57:56 -07:00
..
ABI f2fs-for-5.9-rc1 2020-08-10 18:33:22 -07:00
accounting
admin-guide mm: proactive compaction 2020-08-12 10:57:56 -07:00
arm ARM development for 5.9-rc1: 2020-08-06 10:17:00 -07:00
arm64 It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
block for-5.9/drivers-20200803 2020-08-05 10:51:40 -07:00
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-08-05 20:13:21 -07:00
cdrom
core-api powerpc updates for 5.9 2020-08-07 10:33:50 -07:00
cpu-freq
crypto It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
dev-tools kasan: update documentation for generic kasan 2020-08-07 11:33:28 -07:00
devicetree Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2020-08-10 16:35:57 -07:00
doc-guide
driver-api Minor cleanups to the IPMI driver for 5.9 2020-08-08 09:32:18 -07:00
fault-injection
fb drm next for 5.9-rc1 2020-08-05 19:50:06 -07:00
features powerpc updates for 5.9 2020-08-07 10:33:50 -07:00
filesystems f2fs-for-5.9-rc1 2020-08-10 18:33:22 -07:00
firmware_class
firmware-guide ACPI: Replace HTTP links with HTTPS ones 2020-07-27 14:47:08 +02:00
fpga Char/Misc driver patches for 5.9-rc1 2020-08-05 11:43:47 -07:00
gpu drm next for 5.9-rc1 2020-08-05 19:50:06 -07:00
hid
hwmon hwmon updates for v5.9 2020-08-05 13:13:57 -07:00
i2c It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
ia64 docs: ia64: correct typo 2020-07-31 11:09:09 -06:00
ide
iio
infiniband
input Input: uinput - fix typo in function name documentation 2020-07-28 18:24:11 -07:00
isdn
kbuild Kbuild updates for v5.9 2020-08-09 14:10:26 -07:00
kernel-hacking
leds LEDs changes for 5.9-rc1. 2020-08-05 19:24:27 -07:00
litmus-tests
livepatch
locking A set of locking fixes and updates: 2020-08-10 19:07:44 -07:00
m68k
maintainer
mhi
mips It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
misc-devices
netlabel
networking wireless-drivers-next patches for v5.9 2020-08-04 12:57:02 -07:00
nios2
nvdimm
openrisc
parisc
PCI pci-v5.9-changes 2020-08-07 18:48:15 -07:00
pcmcia
power Merge branches 'pm-sleep', 'pm-domains', 'powercap' and 'pm-tools' 2020-08-03 13:12:44 +02:00
powerpc powerpc updates for 5.9 2020-08-07 10:33:50 -07:00
process It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
RCU These are the latest RCU bits for v5.9: 2020-08-03 14:31:33 -07:00
riscv
s390 It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
scheduler sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst 2020-08-01 09:19:43 +02:00
scsi
security
sh
sound ASoC: Updates for v5.9 2020-08-03 14:41:43 +02:00
sparc
sphinx
sphinx-static
spi
staging docs: staging/tee.rst: convert into definition list 2020-07-23 14:25:12 -06:00
target
timers docs: timers: drop documentation about LB_BIAS 2020-07-23 14:32:44 -06:00
trace It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
translations It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
usb USB: Replace HTTP links with HTTPS ones 2020-07-21 13:41:57 +02:00
userspace-api media updates for v5.9-rc1 2020-08-07 13:00:53 -07:00
virt powerpc updates for 5.9 2020-08-07 10:33:50 -07:00
vm mm/sparse: cleanup the code surrounding memory_present() 2020-08-07 11:33:27 -07:00
w1
watchdog
x86 It's been a busy cycle for documentation - hopefully the busiest for a 2020-08-04 22:47:54 -07:00
xtensa
.gitignore
asm-annotations.rst
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py
COPYING-logo
docutils.conf
dontdiff Documentation: dontdiff: Add zstd compressed files 2020-07-31 11:51:10 +02:00
index.rst docs: index.rst: Add watch_queue 2020-07-23 14:13:23 -06:00
Kconfig
logo.gif
Makefile
memory-barriers.txt powerpc updates for 5.9 2020-08-07 10:33:50 -07:00
SubmittingPatches
watch_queue.rst