linux/Documentation
Chris Down 9783aa9917 mm, memcg: proportional memory.{low,min} reclaim
cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection).  While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim.  This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.

This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information.  Imagine the following
timeline, with the numbers the lruvec size in this zone:

1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
   scanned. (?!)

* Of course, we won't usually scan all available pages in the zone even
  without this patch because of scan control priority, over-reclaim
  protection, etc.  However, as shown by the tests at the end, these
  techniques don't sufficiently throttle such an extreme change in input,
  so cliff-like behaviour isn't really averted by their existence alone.

Here's an example of how this plays out in practice.  At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study).  In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating.  This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc.  As such we need to ballpark memory.low,
but doing this is currently problematic:

1. If we end up setting it too low for the workload, it won't have
   *any* effect (see discussion above).  The group will receive the full
   weight of reclaim and won't have any priority while competing with the
   less important system software, as if we had no memory.low configured
   at all.

2. Because of this behaviour, we end up erring on the side of setting
   it too high, such that the comfort range is reliably covered.  However,
   protected memory is completely unavailable to the rest of the system,
   so we might cause undue memory and IO pressure there when we *know* we
   have some elasticity in the workload.

3. Even if we get the value totally right, smack in the middle of the
   comfort zone, we get extreme jumps between no pressure and full
   pressure that cause unpredictable pressure spikes in the workload due
   to the current binary reclaim behaviour.

With this patch, we can set it to our ballpark estimation without too much
worry.  Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off.  This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.

As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance.  Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits.  Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again.  By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost.  This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   +------------+-----------------+--------------------+--------------+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   +------------+-----------------+--------------------+--------------+
   |        21G |               0 |                  0 | N/A          |
   |        17G |             867 |               3799 | 23%          |
   |        12G |            1203 |               3543 | 34%          |
   |         8G |            2534 |               3979 | 64%          |
   |         4G |            3980 |               4147 | 96%          |
   |          0 |            3799 |               3980 | 95%          |
   +------------+-----------------+--------------------+--------------+

   As you can see, the test kernel (with a kernel containing this
   patch) ramps up page scanning significantly more gradually than the
   control kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention.  This script runs in a highly
   protected slice taking up the majority of available system memory.
   Watching vmstat revealed that page scanning continued essentially
   nominally between test and control, without causing forward reclaim
   progress to become arrested.

[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
  Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07 15:47:20 -07:00
..
ABI Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity 2019-09-27 19:37:27 -07:00
accounting docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
admin-guide mm, memcg: proportional memory.{low,min} reclaim 2019-10-07 15:47:20 -07:00
arm Documentation/arm/samsung-s3c24xx: Remove stray U+FEFF character to fix title 2019-08-12 15:25:32 -06:00
arm64 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-09-17 11:42:15 -07:00
block docs: block: null_blk: enhance document style 2019-09-11 16:04:22 -06:00
bpf bpf/flow_dissector: document flags 2019-07-25 18:00:41 -07:00
cdrom docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
core-api kernel-doc: core-api: include string.h into core-api 2019-09-25 17:51:39 -07:00
cpu-freq Documentation: cpufreq: Update policy notifier documentation 2019-09-02 22:44:05 +02:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2019-09-18 12:11:14 -07:00
dev-tools Merge branch 'pdf_fixes_v1' of https://git.linuxtv.org/mchehab/experimental into mauro 2019-07-22 13:51:20 -06:00
devicetree dt-bindings: phy: lantiq: Fix Property Name 2019-10-02 14:14:58 -05:00
doc-guide docs: remove extra conf.py files 2019-07-17 06:57:52 -03:00
driver-api This is the bulk of pin control changes for the v5.4 kernel 2019-09-19 14:19:33 -07:00
EDID docs: driver-api: add a series of orphaned documents 2019-07-15 11:03:02 -03:00
fault-injection docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
fb docs conversion for v5.3-rc1 2019-07-16 12:21:41 -07:00
features It's a somewhat calmer cycle for docs this time, as the churn of the mass 2019-09-17 16:22:26 -07:00
filesystems add virtio-fs 2019-09-27 15:54:24 -07:00
firmware_class
firmware-guide Documentation: ACPI: DSD: Convert LED documentation to ReST 2019-08-20 23:53:46 +02:00
fpga Documentation: fpga: dfl: add descriptions for virtualization and new interfaces. 2019-09-03 19:35:42 -07:00
gpu Merge drm/drm-next into drm-intel-next-queued 2019-08-22 00:10:36 -07:00
hid docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
hwmon It's a somewhat calmer cycle for docs this time, as the churn of the mass 2019-09-17 16:22:26 -07:00
i2c docs: i2c: convert to ReST and add to driver-api bookset 2019-07-31 13:25:27 -06:00
ia64 docs: add SPDX tags to new index files 2019-07-15 11:03:03 -03:00
ide docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
iio docs: add some documentation dirs to the driver-api book 2019-07-15 11:03:02 -03:00
infiniband Documentation/infiniband: update name of some functions 2019-09-13 16:55:55 -03:00
input Input: docs: fix spelling mistake "potocol" -> "protocol" 2019-08-06 11:24:49 -06:00
ioctl fs-verity: add UAPI header 2019-07-28 16:59:16 -07:00
isdn docs: isdn: convert to ReST and add to kAPI bookset 2019-07-31 13:30:25 -06:00
kbuild Modules updates for v5.4 2019-09-22 10:34:46 -07:00
kernel-hacking docs: Add documentation for Symbol Namespaces 2019-09-10 10:30:49 +02:00
leds leds: core: Add support for composing LED class device names 2019-07-25 20:07:52 +02:00
livepatch docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
locking doc🔒 remove reference to clever use of read-write lock 2019-09-14 01:53:27 -06:00
m68k docs: README.buddha: convert to ReST and add to m68k book 2019-07-31 13:30:10 -06:00
maintainer docs: Fix typo on pull requests guide 2019-08-12 15:14:14 -06:00
media drm main pull for 5.4-rc1 2019-09-19 16:24:24 -07:00
mic docs: driver-api: add remaining converted dirs to it 2019-07-15 11:03:03 -03:00
mips Main MIPS changes for v5.4: 2019-09-22 09:30:30 -07:00
misc-devices Docs: misc: xilinx_sdfec: Add documentation 2019-08-15 17:54:38 +02:00
netlabel docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
networking Documentation: Clarify trap's description 2019-09-27 20:33:19 +02:00
nios2 docs: nios2: add it to the main Documentation body 2019-07-31 13:31:51 -06:00
openrisc docs: openrisc: convert to ReST and add to documentation body 2019-07-31 13:30:20 -06:00
parisc docs: parisc: convert to ReST and add to documentation body 2019-07-31 13:30:15 -06:00
PCI Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2019-08-27 14:23:31 -07:00
pcmcia docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
power Merge branches 'pm-opp', 'pm-qos', 'acpi-pm', 'pm-domains' and 'pm-tools' 2019-09-17 09:49:19 +02:00
powerpc docs: powerpc: Add missing documentation reference 2019-09-17 23:59:34 +10:00
process Documentation/process update for 5.4-rc1 2019-09-29 19:52:52 -07:00
RCU Merge branches 'consolidate.2019.08.01b', 'fixes.2019.08.12a', 'lists.2019.08.13a' and 'torture.2019.08.01b' into HEAD 2019-08-13 14:30:30 -07:00
riscv It's a somewhat calmer cycle for docs this time, as the churn of the mass 2019-09-17 16:22:26 -07:00
s390 Documentation/s390: remove outdated debugging390 documentation 2019-08-21 12:41:43 +02:00
scheduler sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices 2019-08-08 09:09:30 +02:00
scsi scsi: ufs: Documentation: Announce ufs-tool v1.0 2019-06-26 22:47:51 -04:00
security Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity 2019-09-27 19:37:27 -07:00
sh docs: remove extra conf.py files 2019-07-17 06:57:52 -03:00
sound sound updates for 5.4 2019-09-17 17:43:33 -07:00
sparc docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
sphinx Documentation: sphinx: Don't parse socket() as identifier reference 2019-08-12 14:55:30 -06:00
sphinx-static
spi spi: docs: convert to ReST and add it to the kABI bookset 2019-07-31 14:13:13 -06:00
target docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
timers docs: add some directories to the main documentation index 2019-07-15 11:03:03 -03:00
trace Tracing updates: 2019-09-20 11:19:48 -07:00
translations doc: arm64: fix grammar dtb placed in no attributes region 2019-09-06 08:44:34 -06:00
usb USB: Move wusbcore and UWB to staging as it is obsolete 2019-08-08 07:52:01 +02:00
userspace-api docs: remove extra conf.py files 2019-07-17 06:57:52 -03:00
virt KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSH 2019-09-24 13:37:13 +02:00
virtual cpuidle: add haltpoll governor 2019-07-30 17:27:37 +02:00
vm mm: treewide: clarify pgtable_page_{ctor,dtor}() naming 2019-09-26 10:10:44 -07:00
w1 docs: w1: convert to ReST and add to the kAPI group of docs 2019-07-31 14:16:17 -06:00
watchdog linux-watchdog 5.4-rc1 tag 2019-09-27 11:17:38 -07:00
x86 dma-mapping: fix filename references 2019-09-03 08:36:30 +02:00
xtensa docs: add arch doc directories to the index 2019-07-15 11:03:01 -03:00
.gitignore
atomic_bitops.txt
atomic_t.txt Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-08 16:12:03 -07:00
bus-virt-phys-mapping.txt
Changes
CodingStyle
conf.py docs: conf.py: only use CJK if the font is available 2019-07-17 06:57:51 -03:00
COPYING-logo docs: logo.txt: rename it to COPYING-logo 2019-07-15 09:20:27 -03:00
crc32.txt
debugging-modules.txt
debugging-via-ohci1394.txt
digsig.txt
DMA-API-HOWTO.txt docs: DMA-API-HOWTO.txt: fix an unmarked code block 2019-07-15 09:20:24 -03:00
DMA-API.txt dma-mapping: remove dma_release_declared_memory 2019-09-04 11:13:19 +02:00
DMA-attributes.txt
DMA-ISA-LPC.txt
docutils.conf doc-rst: Add missing newline at end of file 2019-06-20 14:16:56 -06:00
dontdiff kbuild: create *.mod with full directory path and remove MODVERDIR 2019-07-18 02:19:31 +09:00
futex-requeue-pi.txt
hwspinlock.txt hwspinlock: add the 'in_atomic' API 2019-06-29 21:08:14 -07:00
index.rst Main MIPS changes for v5.4: 2019-09-22 09:30:30 -07:00
io_ordering.txt
io-mapping.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
Kconfig docs: Kbuild/Makefile: allow check for missing docs at build time 2019-06-07 11:33:16 -06:00
kobject.txt
kprobes.txt Merge branch 'parisc-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2019-05-07 19:34:17 -07:00
kref.txt
logo.gif
lzo.txt
mailbox.txt
Makefile docs: Kbuild/Makefile: allow check for missing docs at build time 2019-06-07 11:33:16 -06:00
memory-barriers.txt docs: fix broken doc references due to renames 2019-07-17 06:57:51 -03:00
nommu-mmap.txt
padata.txt padata: allocate workqueue internally 2019-09-13 21:15:39 +10:00
percpu-rw-semaphore.txt
pi-futex.txt docs: locking: convert docs to ReST and rename to *.rst 2019-07-15 08:53:27 -03:00
preempt-locking.txt
rbtree.txt docs: rbtree.txt: fix Sphinx build warnings 2019-07-15 09:20:24 -03:00
remoteproc.txt remoteproc: add vendor resources handling 2019-06-29 12:02:17 -07:00
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
speculation.txt
static-keys.txt
SubmittingPatches
tee.txt Documentation: tee: Grammar s/the its/its/ 2019-06-07 11:23:38 -06:00
this_cpu_ops.txt
unaligned-memory-access.txt
xz.txt