linux/Documentation
Johannes Weiner 8a931f8013 mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says.  The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.

Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.

Consider memory in a data center host.  At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing.  Both branches are further subdivided into individual
services, job components etc.

We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other.  Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree.  Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general.  This
results in very inefficient resource distribution.

Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level.  Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads.  It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g.  rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.

It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.

To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.

That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now.  However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size.  Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree.  But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.

E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.

This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.

The floating protection composes naturally with fixed protection.
Consider the following example tree:

		A            A: low = 2G
               / \          A1: low = 1G
              A1 A2         A2: low = 0G

As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.

There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree.  Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior.  Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 09:35:28 -07:00
..
ABI TTY/Serial patches for 5.7-rc1 2020-03-31 16:18:55 -07:00
accounting doc: cgroup: improve formatting of references 2020-03-02 12:57:03 -07:00
admin-guide mm: memcontrol: recursive memory.low protection 2020-04-02 09:35:28 -07:00
arm docs: arm: tcm: Fix a few typos 2020-02-19 02:42:21 -07:00
arm64 arm64 updates for 5.7: 2020-03-31 10:05:01 -07:00
block block: Document genhd capability flags 2020-03-12 07:47:22 -06:00
bpf bpf: lsm: Add Documentation 2020-03-30 01:35:12 +02:00
cdrom
core-api mm: dump_page(): additional diagnostics for huge pinned pages 2020-04-02 09:35:27 -07:00
cpu-freq docs: cpu-freq: convert cpufreq-stats.txt to ReST 2020-03-06 00:01:02 +01:00
crypto
dev-tools This has been a busy cycle for documentation work. Highlights include: 2020-03-30 12:45:23 -07:00
devicetree Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-03-31 17:29:33 -07:00
doc-guide Documentation: build warnings related to missing blank lines after explicit markups has been fixed 2020-02-05 10:30:03 -07:00
driver-api Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2020-03-30 16:13:08 -07:00
fault-injection
fb fbdev: fbmem: allow overriding the number of bootup logos 2020-01-03 14:27:40 +01:00
features documentation: vm: Advertise support for pte_special in riscv 2020-02-19 02:41:01 -07:00
filesystems fscrypt updates for 5.7 2020-03-31 12:58:36 -07:00
firmware_class
firmware-guide
fpga
gpu docs: gpu: i915.rst: fix warnings due to file renames 2020-02-25 03:02:35 -07:00
hid
hwmon docs: hwmon: Update documentation for isl68137 pmbus driver 2020-03-22 16:42:54 -07:00
i2c docs: i2c: writing-clients: properly name the stop condition 2020-01-29 22:02:09 +01:00
ia64
ide
iio
infiniband
input
isdn
kbuild Kbuild updates for v5.7 2020-03-31 16:03:39 -07:00
kernel-hacking docs: locking: Drop :c:func: throughout 2020-03-20 17:16:24 -06:00
leds
livepatch
locking Documentation/locking/locktypes: Minor copy editor fixes 2020-03-28 12:47:34 +01:00
m68k
maintainer Add a maintainer entry profile for documentation 2020-01-24 09:48:39 -07:00
media media updates for v5.7-rc1 2020-03-30 13:42:05 -07:00
mips docs: mips: remove no longer needed au1xxx_ide.rst documentation 2020-03-24 15:53:48 +01:00
misc-devices docs: Move Intel Many Integrated Core documentation (mic) under misc-devices 2020-03-10 11:12:34 -06:00
netlabel
networking Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-03-31 17:29:33 -07:00
nios2
nvdimm docs: nvdimm: use ReST notation for subsection 2020-01-24 09:54:42 -07:00
openrisc
parisc
PCI docs: fix pointers to io-mapping.rst and io_ordering.rst files 2020-03-11 14:15:20 -06:00
pcmcia
power Merge branches 'pm-core', 'pm-sleep', 'pm-acpi' and 'pm-domains' 2020-03-30 14:46:58 +02:00
powerpc docs: prevent warnings due to autosectionlabel 2020-03-20 17:01:29 -06:00
process This has been a busy cycle for documentation work. Highlights include: 2020-03-30 12:45:23 -07:00
RCU doc: Add rcutorture scripting to torture.txt 2020-02-27 07:03:14 -08:00
riscv It has been a relatively quiet cycle for documentation, but there's still a 2020-01-29 15:27:31 -08:00
s390
scheduler
scsi scsi: simplify scsi_partsize 2020-03-24 07:57:07 -06:00
security docs: prevent warnings due to autosectionlabel 2020-03-20 17:01:29 -06:00
sh
sound sound updates for 5.6-rc1 2020-01-28 16:26:57 -08:00
sparc
sphinx docs: Fix empty parallelism argument 2020-02-25 03:11:04 -07:00
sphinx-static
spi
target docs: prevent warnings due to autosectionlabel 2020-03-20 17:01:29 -06:00
timers
trace Power management updates for 5.7-rc1 2020-03-30 15:05:01 -07:00
translations media updates for v5.7-rc1 2020-03-30 13:42:05 -07:00
usb usb: gadget: add raw-gadget interface 2020-03-15 11:34:48 +02:00
userspace-api docs: userspace: ioctl-number: remove mc146818rtc conflict 2020-02-13 11:42:02 -07:00
virt KVM: SVM: document KVM_MEM_ENCRYPT_OP, let userspace detect if SEV is available 2020-03-20 13:47:52 -04:00
vm mm/zswap.c: add allocation hysteresis if pool limit is hit 2020-01-31 10:30:39 -08:00
w1 docs: w1: Fix a typo in omap-hdq.rst 2019-12-30 11:58:02 -07:00
watchdog
x86 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2020-03-31 11:04:05 -07:00
xtensa
.gitignore
asm-annotations.rst Documentation: Call out example SYM_FUNC_* usage as x86-specific 2020-01-16 12:53:16 -07:00
atomic_bitops.txt
atomic_t.txt
bus-virt-phys-mapping.txt
Changes
CodingStyle
conf.py docs: conf.py: avoid thousands of duplicate label warning on Sphinx 2020-03-20 17:01:34 -06:00
COPYING-logo
crc32.txt
debugging-via-ohci1394.txt
digsig.txt
DMA-API-HOWTO.txt
DMA-API.txt
DMA-attributes.txt
DMA-ISA-LPC.txt
docutils.conf
dontdiff
futex-requeue-pi.txt
hwspinlock.txt
index.rst Power management updates for 5.7-rc1 2020-03-30 15:05:01 -07:00
IPMI.txt
IRQ-affinity.txt
IRQ-domain.txt
IRQ.txt
irqflags-tracing.txt
Kconfig
kprobes.txt
kref.txt docs: kref: Clarify the use of two kref_put() in example code 2020-02-25 03:39:10 -07:00
logo.gif
lzo.txt
mailbox.txt
Makefile Kbuild updates for v5.7 2020-03-31 16:03:39 -07:00
memory-barriers.txt Documentation/memory-barriers: Fix typos 2020-02-27 07:03:14 -08:00
nommu-mmap.txt
percpu-rw-semaphore.txt
pi-futex.txt
preempt-locking.txt
rbtree.txt
remoteproc.txt
robust-futex-ABI.txt threads: Update PID limit comment according to futex UAPI change 2020-03-21 17:48:13 +01:00
robust-futexes.txt
rpmsg.txt
speculation.txt
static-keys.txt
SubmittingPatches
tee.txt Documentation: tee: add AMD-TEE driver details 2020-01-04 13:49:51 +08:00
this_cpu_ops.txt
unaligned-memory-access.txt
xz.txt