linux

History

Chris Down 9783aa9917 mm, memcg: proportional memory.{low,min} reclaim cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have any effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we know we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources without the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting some protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ \| memory.low \| test (pgscan/s) \| control (pgscan/s) \| % of control \| +------------+-----------------+--------------------+--------------+ \| 21G \| 0 \| 0 \| N/A \| \| 17G \| 867 \| 3799 \| 23% \| \| 12G \| 1203 \| 3543 \| 34% \| \| 8G \| 2534 \| 3979 \| 64% \| \| 4G \| 3980 \| 4147 \| 96% \| \| 0 \| 3799 \| 3980 \| 95% \| +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2019-10-07 15:47:20 -07:00
..
ABI	Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity	2019-09-27 19:37:27 -07:00
accounting	docs: add some documentation dirs to the driver-api book	2019-07-15 11:03:02 -03:00
admin-guide	mm, memcg: proportional memory.{low,min} reclaim	2019-10-07 15:47:20 -07:00
arm	Documentation/arm/samsung-s3c24xx: Remove stray U+FEFF character to fix title	2019-08-12 15:25:32 -06:00
arm64	Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2019-09-17 11:42:15 -07:00
block	docs: block: null_blk: enhance document style	2019-09-11 16:04:22 -06:00
bpf	bpf/flow_dissector: document flags	2019-07-25 18:00:41 -07:00
cdrom	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
core-api	kernel-doc: core-api: include string.h into core-api	2019-09-25 17:51:39 -07:00
cpu-freq	Documentation: cpufreq: Update policy notifier documentation	2019-09-02 22:44:05 +02:00
crypto	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6	2019-09-18 12:11:14 -07:00
dev-tools	Merge branch 'pdf_fixes_v1' of https://git.linuxtv.org/mchehab/experimental into mauro	2019-07-22 13:51:20 -06:00
devicetree	dt-bindings: phy: lantiq: Fix Property Name	2019-10-02 14:14:58 -05:00
doc-guide	docs: remove extra conf.py files	2019-07-17 06:57:52 -03:00
driver-api	This is the bulk of pin control changes for the v5.4 kernel	2019-09-19 14:19:33 -07:00
EDID	docs: driver-api: add a series of orphaned documents	2019-07-15 11:03:02 -03:00
fault-injection	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
fb	docs conversion for v5.3-rc1	2019-07-16 12:21:41 -07:00
features	It's a somewhat calmer cycle for docs this time, as the churn of the mass	2019-09-17 16:22:26 -07:00
filesystems	add virtio-fs	2019-09-27 15:54:24 -07:00
firmware_class
firmware-guide	Documentation: ACPI: DSD: Convert LED documentation to ReST	2019-08-20 23:53:46 +02:00
fpga	Documentation: fpga: dfl: add descriptions for virtualization and new interfaces.	2019-09-03 19:35:42 -07:00
gpu	Merge drm/drm-next into drm-intel-next-queued	2019-08-22 00:10:36 -07:00
hid	docs: add some documentation dirs to the driver-api book	2019-07-15 11:03:02 -03:00
hwmon	It's a somewhat calmer cycle for docs this time, as the churn of the mass	2019-09-17 16:22:26 -07:00
i2c	docs: i2c: convert to ReST and add to driver-api bookset	2019-07-31 13:25:27 -06:00
ia64	docs: add SPDX tags to new index files	2019-07-15 11:03:03 -03:00
ide	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
iio	docs: add some documentation dirs to the driver-api book	2019-07-15 11:03:02 -03:00
infiniband	Documentation/infiniband: update name of some functions	2019-09-13 16:55:55 -03:00
input	Input: docs: fix spelling mistake "potocol" -> "protocol"	2019-08-06 11:24:49 -06:00
ioctl	fs-verity: add UAPI header	2019-07-28 16:59:16 -07:00
isdn	docs: isdn: convert to ReST and add to kAPI bookset	2019-07-31 13:30:25 -06:00
kbuild	Modules updates for v5.4	2019-09-22 10:34:46 -07:00
kernel-hacking	docs: Add documentation for Symbol Namespaces	2019-09-10 10:30:49 +02:00
leds	leds: core: Add support for composing LED class device names	2019-07-25 20:07:52 +02:00
livepatch	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
locking	doc🔒 remove reference to clever use of read-write lock	2019-09-14 01:53:27 -06:00
m68k	docs: README.buddha: convert to ReST and add to m68k book	2019-07-31 13:30:10 -06:00
maintainer	docs: Fix typo on pull requests guide	2019-08-12 15:14:14 -06:00
media	drm main pull for 5.4-rc1	2019-09-19 16:24:24 -07:00
mic	docs: driver-api: add remaining converted dirs to it	2019-07-15 11:03:03 -03:00
mips	Main MIPS changes for v5.4:	2019-09-22 09:30:30 -07:00
misc-devices	Docs: misc: xilinx_sdfec: Add documentation	2019-08-15 17:54:38 +02:00
netlabel	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
networking	Documentation: Clarify trap's description	2019-09-27 20:33:19 +02:00
nios2	docs: nios2: add it to the main Documentation body	2019-07-31 13:31:51 -06:00
openrisc	docs: openrisc: convert to ReST and add to documentation body	2019-07-31 13:30:20 -06:00
parisc	docs: parisc: convert to ReST and add to documentation body	2019-07-31 13:30:15 -06:00
PCI	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2019-08-27 14:23:31 -07:00
pcmcia	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
power	Merge branches 'pm-opp', 'pm-qos', 'acpi-pm', 'pm-domains' and 'pm-tools'	2019-09-17 09:49:19 +02:00
powerpc	docs: powerpc: Add missing documentation reference	2019-09-17 23:59:34 +10:00
process	Documentation/process update for 5.4-rc1	2019-09-29 19:52:52 -07:00
RCU	Merge branches 'consolidate.2019.08.01b', 'fixes.2019.08.12a', 'lists.2019.08.13a' and 'torture.2019.08.01b' into HEAD	2019-08-13 14:30:30 -07:00
riscv	It's a somewhat calmer cycle for docs this time, as the churn of the mass	2019-09-17 16:22:26 -07:00
s390	Documentation/s390: remove outdated debugging390 documentation	2019-08-21 12:41:43 +02:00
scheduler	sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices	2019-08-08 09:09:30 +02:00
scsi	scsi: ufs: Documentation: Announce ufs-tool v1.0	2019-06-26 22:47:51 -04:00
security	Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity	2019-09-27 19:37:27 -07:00
sh	docs: remove extra conf.py files	2019-07-17 06:57:52 -03:00
sound	sound updates for 5.4	2019-09-17 17:43:33 -07:00
sparc	docs: add arch doc directories to the index	2019-07-15 11:03:01 -03:00
sphinx	Documentation: sphinx: Don't parse socket() as identifier reference	2019-08-12 14:55:30 -06:00
sphinx-static
spi	spi: docs: convert to ReST and add it to the kABI bookset	2019-07-31 14:13:13 -06:00
target	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
timers	docs: add some directories to the main documentation index	2019-07-15 11:03:03 -03:00
trace	Tracing updates:	2019-09-20 11:19:48 -07:00
translations	doc: arm64: fix grammar dtb placed in no attributes region	2019-09-06 08:44:34 -06:00
usb	USB: Move wusbcore and UWB to staging as it is obsolete	2019-08-08 07:52:01 +02:00
userspace-api	docs: remove extra conf.py files	2019-07-17 06:57:52 -03:00
virt	KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSH	2019-09-24 13:37:13 +02:00
virtual	cpuidle: add haltpoll governor	2019-07-30 17:27:37 +02:00
vm	mm: treewide: clarify pgtable_page_{ctor,dtor}() naming	2019-09-26 10:10:44 -07:00
w1	docs: w1: convert to ReST and add to the kAPI group of docs	2019-07-31 14:16:17 -06:00
watchdog	linux-watchdog 5.4-rc1 tag	2019-09-27 11:17:38 -07:00
x86	dma-mapping: fix filename references	2019-09-03 08:36:30 +02:00
xtensa	docs: add arch doc directories to the index	2019-07-15 11:03:01 -03:00
.gitignore
atomic_bitops.txt
atomic_t.txt	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2019-07-08 16:12:03 -07:00
bus-virt-phys-mapping.txt
Changes
CodingStyle
conf.py	docs: conf.py: only use CJK if the font is available	2019-07-17 06:57:51 -03:00
COPYING-logo	docs: logo.txt: rename it to COPYING-logo	2019-07-15 09:20:27 -03:00
crc32.txt
debugging-modules.txt
debugging-via-ohci1394.txt
digsig.txt
DMA-API-HOWTO.txt	docs: DMA-API-HOWTO.txt: fix an unmarked code block	2019-07-15 09:20:24 -03:00
DMA-API.txt	dma-mapping: remove dma_release_declared_memory	2019-09-04 11:13:19 +02:00
DMA-attributes.txt
DMA-ISA-LPC.txt
docutils.conf	doc-rst: Add missing newline at end of file	2019-06-20 14:16:56 -06:00
dontdiff	kbuild: create *.mod with full directory path and remove MODVERDIR	2019-07-18 02:19:31 +09:00
futex-requeue-pi.txt
hwspinlock.txt	hwspinlock: add the 'in_atomic' API	2019-06-29 21:08:14 -07:00
index.rst	Main MIPS changes for v5.4:	2019-09-22 09:30:30 -07:00
io_ordering.txt
io-mapping.txt
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
Kconfig	docs: Kbuild/Makefile: allow check for missing docs at build time	2019-06-07 11:33:16 -06:00
kobject.txt
kprobes.txt	Merge branch 'parisc-5.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux	2019-05-07 19:34:17 -07:00
kref.txt
logo.gif
lzo.txt
mailbox.txt
Makefile	docs: Kbuild/Makefile: allow check for missing docs at build time	2019-06-07 11:33:16 -06:00
memory-barriers.txt	docs: fix broken doc references due to renames	2019-07-17 06:57:51 -03:00
nommu-mmap.txt
padata.txt	padata: allocate workqueue internally	2019-09-13 21:15:39 +10:00
percpu-rw-semaphore.txt
pi-futex.txt	docs: locking: convert docs to ReST and rename to *.rst	2019-07-15 08:53:27 -03:00
preempt-locking.txt
rbtree.txt	docs: rbtree.txt: fix Sphinx build warnings	2019-07-15 09:20:24 -03:00
remoteproc.txt	remoteproc: add vendor resources handling	2019-06-29 12:02:17 -07:00
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt
speculation.txt
static-keys.txt
SubmittingPatches
tee.txt	Documentation: tee: Grammar s/the its/its/	2019-06-07 11:23:38 -06:00
this_cpu_ops.txt
unaligned-memory-access.txt
xz.txt