forked from Minki/linux
360 lines
15 KiB
Plaintext
360 lines
15 KiB
Plaintext
|
|
||
|
Cgroup unified hierarchy
|
||
|
|
||
|
April, 2014 Tejun Heo <tj@kernel.org>
|
||
|
|
||
|
This document describes the changes made by unified hierarchy and
|
||
|
their rationales. It will eventually be merged into the main cgroup
|
||
|
documentation.
|
||
|
|
||
|
CONTENTS
|
||
|
|
||
|
1. Background
|
||
|
2. Basic Operation
|
||
|
2-1. Mounting
|
||
|
2-2. cgroup.subtree_control
|
||
|
2-3. cgroup.controllers
|
||
|
3. Structural Constraints
|
||
|
3-1. Top-down
|
||
|
3-2. No internal tasks
|
||
|
4. Other Changes
|
||
|
4-1. [Un]populated Notification
|
||
|
4-2. Other Core Changes
|
||
|
4-3. Per-Controller Changes
|
||
|
4-3-1. blkio
|
||
|
4-3-2. cpuset
|
||
|
4-3-3. memory
|
||
|
5. Planned Changes
|
||
|
5-1. CAP for resource control
|
||
|
|
||
|
|
||
|
1. Background
|
||
|
|
||
|
cgroup allows an arbitrary number of hierarchies and each hierarchy
|
||
|
can host any number of controllers. While this seems to provide a
|
||
|
high level of flexibility, it isn't quite useful in practice.
|
||
|
|
||
|
For example, as there is only one instance of each controller, utility
|
||
|
type controllers such as freezer which can be useful in all
|
||
|
hierarchies can only be used in one. The issue is exacerbated by the
|
||
|
fact that controllers can't be moved around once hierarchies are
|
||
|
populated. Another issue is that all controllers bound to a hierarchy
|
||
|
are forced to have exactly the same view of the hierarchy. It isn't
|
||
|
possible to vary the granularity depending on the specific controller.
|
||
|
|
||
|
In practice, these issues heavily limit which controllers can be put
|
||
|
on the same hierarchy and most configurations resort to putting each
|
||
|
controller on its own hierarchy. Only closely related ones, such as
|
||
|
the cpu and cpuacct controllers, make sense to put on the same
|
||
|
hierarchy. This often means that userland ends up managing multiple
|
||
|
similar hierarchies repeating the same steps on each hierarchy
|
||
|
whenever a hierarchy management operation is necessary.
|
||
|
|
||
|
Unfortunately, support for multiple hierarchies comes at a steep cost.
|
||
|
Internal implementation in cgroup core proper is dazzlingly
|
||
|
complicated but more importantly the support for multiple hierarchies
|
||
|
restricts how cgroup is used in general and what controllers can do.
|
||
|
|
||
|
There's no limit on how many hierarchies there may be, which means
|
||
|
that a task's cgroup membership can't be described in finite length.
|
||
|
The key may contain any varying number of entries and is unlimited in
|
||
|
length, which makes it highly awkward to handle and leads to addition
|
||
|
of controllers which exist only to identify membership, which in turn
|
||
|
exacerbates the original problem.
|
||
|
|
||
|
Also, as a controller can't have any expectation regarding what shape
|
||
|
of hierarchies other controllers would be on, each controller has to
|
||
|
assume that all other controllers are operating on completely
|
||
|
orthogonal hierarchies. This makes it impossible, or at least very
|
||
|
cumbersome, for controllers to cooperate with each other.
|
||
|
|
||
|
In most use cases, putting controllers on hierarchies which are
|
||
|
completely orthogonal to each other isn't necessary. What usually is
|
||
|
called for is the ability to have differing levels of granularity
|
||
|
depending on the specific controller. In other words, hierarchy may
|
||
|
be collapsed from leaf towards root when viewed from specific
|
||
|
controllers. For example, a given configuration might not care about
|
||
|
how memory is distributed beyond a certain level while still wanting
|
||
|
to control how CPU cycles are distributed.
|
||
|
|
||
|
Unified hierarchy is the next version of cgroup interface. It aims to
|
||
|
address the aforementioned issues by having more structure while
|
||
|
retaining enough flexibility for most use cases. Various other
|
||
|
general and controller-specific interface issues are also addressed in
|
||
|
the process.
|
||
|
|
||
|
|
||
|
2. Basic Operation
|
||
|
|
||
|
2-1. Mounting
|
||
|
|
||
|
Currently, unified hierarchy can be mounted with the following mount
|
||
|
command. Note that this is still under development and scheduled to
|
||
|
change soon.
|
||
|
|
||
|
mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
|
||
|
|
||
|
All controllers which are not bound to other hierarchies are
|
||
|
automatically bound to unified hierarchy and show up at the root of
|
||
|
it. Controllers which are enabled only in the root of unified
|
||
|
hierarchy can be bound to other hierarchies at any time. This allows
|
||
|
mixing unified hierarchy with the traditional multiple hierarchies in
|
||
|
a fully backward compatible way.
|
||
|
|
||
|
|
||
|
2-2. cgroup.subtree_control
|
||
|
|
||
|
All cgroups on unified hierarchy have a "cgroup.subtree_control" file
|
||
|
which governs which controllers are enabled on the children of the
|
||
|
cgroup. Let's assume a hierarchy like the following.
|
||
|
|
||
|
root - A - B - C
|
||
|
\ D
|
||
|
|
||
|
root's "cgroup.subtree_control" file determines which controllers are
|
||
|
enabled on A. A's on B. B's on C and D. This coincides with the
|
||
|
fact that controllers on the immediate sub-level are used to
|
||
|
distribute the resources of the parent. In fact, it's natural to
|
||
|
assume that resource control knobs of a child belong to its parent.
|
||
|
Enabling a controller in a "cgroup.subtree_control" file declares that
|
||
|
distribution of the respective resources of the cgroup will be
|
||
|
controlled. Note that this means that controller enable states are
|
||
|
shared among siblings.
|
||
|
|
||
|
When read, the file contains a space-separated list of currently
|
||
|
enabled controllers. A write to the file should contain a
|
||
|
space-separated list of controllers with '+' or '-' prefixed (without
|
||
|
the quotes). Controllers prefixed with '+' are enabled and '-'
|
||
|
disabled. If a controller is listed multiple times, the last entry
|
||
|
wins. The specific operations are executed atomically - either all
|
||
|
succeed or fail.
|
||
|
|
||
|
|
||
|
2-3. cgroup.controllers
|
||
|
|
||
|
Read-only "cgroup.controllers" file contains a space-separated list of
|
||
|
controllers which can be enabled in the cgroup's
|
||
|
"cgroup.subtree_control" file.
|
||
|
|
||
|
In the root cgroup, this lists controllers which are not bound to
|
||
|
other hierarchies and the content changes as controllers are bound to
|
||
|
and unbound from other hierarchies.
|
||
|
|
||
|
In non-root cgroups, the content of this file equals that of the
|
||
|
parent's "cgroup.subtree_control" file as only controllers enabled
|
||
|
from the parent can be used in its children.
|
||
|
|
||
|
|
||
|
3. Structural Constraints
|
||
|
|
||
|
3-1. Top-down
|
||
|
|
||
|
As it doesn't make sense to nest control of an uncontrolled resource,
|
||
|
all non-root "cgroup.subtree_control" files can only contain
|
||
|
controllers which are enabled in the parent's "cgroup.subtree_control"
|
||
|
file. A controller can be enabled only if the parent has the
|
||
|
controller enabled and a controller can't be disabled if one or more
|
||
|
children have it enabled.
|
||
|
|
||
|
|
||
|
3-2. No internal tasks
|
||
|
|
||
|
One long-standing issue that cgroup faces is the competition between
|
||
|
tasks belonging to the parent cgroup and its children cgroups. This
|
||
|
is inherently nasty as two different types of entities compete and
|
||
|
there is no agreed-upon obvious way to handle it. Different
|
||
|
controllers are doing different things.
|
||
|
|
||
|
The cpu controller considers tasks and cgroups as equivalents and maps
|
||
|
nice levels to cgroup weights. This works for some cases but falls
|
||
|
flat when children should be allocated specific ratios of CPU cycles
|
||
|
and the number of internal tasks fluctuates - the ratios constantly
|
||
|
change as the number of competing entities fluctuates. There also are
|
||
|
other issues. The mapping from nice level to weight isn't obvious or
|
||
|
universal, and there are various other knobs which simply aren't
|
||
|
available for tasks.
|
||
|
|
||
|
The blkio controller implicitly creates a hidden leaf node for each
|
||
|
cgroup to host the tasks. The hidden leaf has its own copies of all
|
||
|
the knobs with "leaf_" prefixed. While this allows equivalent control
|
||
|
over internal tasks, it's with serious drawbacks. It always adds an
|
||
|
extra layer of nesting which may not be necessary, makes the interface
|
||
|
messy and significantly complicates the implementation.
|
||
|
|
||
|
The memory controller currently doesn't have a way to control what
|
||
|
happens between internal tasks and child cgroups and the behavior is
|
||
|
not clearly defined. There have been attempts to add ad-hoc behaviors
|
||
|
and knobs to tailor the behavior to specific workloads. Continuing
|
||
|
this direction will lead to problems which will be extremely difficult
|
||
|
to resolve in the long term.
|
||
|
|
||
|
Multiple controllers struggle with internal tasks and came up with
|
||
|
different ways to deal with it; unfortunately, all the approaches in
|
||
|
use now are severely flawed and, furthermore, the widely different
|
||
|
behaviors make cgroup as whole highly inconsistent.
|
||
|
|
||
|
It is clear that this is something which needs to be addressed from
|
||
|
cgroup core proper in a uniform way so that controllers don't need to
|
||
|
worry about it and cgroup as a whole shows a consistent and logical
|
||
|
behavior. To achieve that, unified hierarchy enforces the following
|
||
|
structural constraint:
|
||
|
|
||
|
Except for the root, only cgroups which don't contain any task may
|
||
|
have controllers enabled in their "cgroup.subtree_control" files.
|
||
|
|
||
|
Combined with other properties, this guarantees that, when a
|
||
|
controller is looking at the part of the hierarchy which has it
|
||
|
enabled, tasks are always only on the leaves. This rules out
|
||
|
situations where child cgroups compete against internal tasks of the
|
||
|
parent.
|
||
|
|
||
|
There are two things to note. Firstly, the root cgroup is exempt from
|
||
|
the restriction. Root contains tasks and anonymous resource
|
||
|
consumption which can't be associated with any other cgroup and
|
||
|
requires special treatment from most controllers. How resource
|
||
|
consumption in the root cgroup is governed is up to each controller.
|
||
|
|
||
|
Secondly, the restriction doesn't take effect if there is no enabled
|
||
|
controller in the cgroup's "cgroup.subtree_control" file. This is
|
||
|
important as otherwise it wouldn't be possible to create children of a
|
||
|
populated cgroup. To control resource distribution of a cgroup, the
|
||
|
cgroup must create children and transfer all its tasks to the children
|
||
|
before enabling controllers in its "cgroup.subtree_control" file.
|
||
|
|
||
|
|
||
|
4. Other Changes
|
||
|
|
||
|
4-1. [Un]populated Notification
|
||
|
|
||
|
cgroup users often need a way to determine when a cgroup's
|
||
|
subhierarchy becomes empty so that it can be cleaned up. cgroup
|
||
|
currently provides release_agent for it; unfortunately, this mechanism
|
||
|
is riddled with issues.
|
||
|
|
||
|
- It delivers events by forking and execing a userland binary
|
||
|
specified as the release_agent. This is a long deprecated method of
|
||
|
notification delivery. It's extremely heavy, slow and cumbersome to
|
||
|
integrate with larger infrastructure.
|
||
|
|
||
|
- There is single monitoring point at the root. There's no way to
|
||
|
delegate management of a subtree.
|
||
|
|
||
|
- The event isn't recursive. It triggers when a cgroup doesn't have
|
||
|
any tasks or child cgroups. Events for internal nodes trigger only
|
||
|
after all children are removed. This again makes it impossible to
|
||
|
delegate management of a subtree.
|
||
|
|
||
|
- Events are filtered from the kernel side. A "notify_on_release"
|
||
|
file is used to subscribe to or suppress release events. This is
|
||
|
unnecessarily complicated and probably done this way because event
|
||
|
delivery itself was expensive.
|
||
|
|
||
|
Unified hierarchy implements an interface file "cgroup.populated"
|
||
|
which can be used to monitor whether the cgroup's subhierarchy has
|
||
|
tasks in it or not. Its value is 0 if there is no task in the cgroup
|
||
|
and its descendants; otherwise, 1. poll and [id]notify events are
|
||
|
triggered when the value changes.
|
||
|
|
||
|
This is significantly lighter and simpler and trivially allows
|
||
|
delegating management of subhierarchy - subhierarchy monitoring can
|
||
|
block further propagation simply by putting itself or another process
|
||
|
in the subhierarchy and monitor events that it's interested in from
|
||
|
there without interfering with monitoring higher in the tree.
|
||
|
|
||
|
In unified hierarchy, the release_agent mechanism is no longer
|
||
|
supported and the interface files "release_agent" and
|
||
|
"notify_on_release" do not exist.
|
||
|
|
||
|
|
||
|
4-2. Other Core Changes
|
||
|
|
||
|
- None of the mount options is allowed.
|
||
|
|
||
|
- remount is disallowed.
|
||
|
|
||
|
- rename(2) is disallowed.
|
||
|
|
||
|
- The "tasks" file is removed. Everything should at process
|
||
|
granularity. Use the "cgroup.procs" file instead.
|
||
|
|
||
|
- The "cgroup.procs" file is not sorted. pids will be unique unless
|
||
|
they got recycled in-between reads.
|
||
|
|
||
|
- The "cgroup.clone_children" file is removed.
|
||
|
|
||
|
|
||
|
4-3. Per-Controller Changes
|
||
|
|
||
|
4-3-1. blkio
|
||
|
|
||
|
- blk-throttle becomes properly hierarchical.
|
||
|
|
||
|
|
||
|
4-3-2. cpuset
|
||
|
|
||
|
- Tasks are kept in empty cpusets after hotplug and take on the masks
|
||
|
of the nearest non-empty ancestor, instead of being moved to it.
|
||
|
|
||
|
- A task can be moved into an empty cpuset, and again it takes on the
|
||
|
masks of the nearest non-empty ancestor.
|
||
|
|
||
|
|
||
|
4-3-3. memory
|
||
|
|
||
|
- use_hierarchy is on by default and the cgroup file for the flag is
|
||
|
not created.
|
||
|
|
||
|
|
||
|
5. Planned Changes
|
||
|
|
||
|
5-1. CAP for resource control
|
||
|
|
||
|
Unified hierarchy will require one of the capabilities(7), which is
|
||
|
yet to be decided, for all resource control related knobs. Process
|
||
|
organization operations - creation of sub-cgroups and migration of
|
||
|
processes in sub-hierarchies may be delegated by changing the
|
||
|
ownership and/or permissions on the cgroup directory and
|
||
|
"cgroup.procs" interface file; however, all operations which affect
|
||
|
resource control - writes to a "cgroup.subtree_control" file or any
|
||
|
controller-specific knobs - will require an explicit CAP privilege.
|
||
|
|
||
|
This, in part, is to prevent the cgroup interface from being
|
||
|
inadvertently promoted to programmable API used by non-privileged
|
||
|
binaries. cgroup exposes various aspects of the system in ways which
|
||
|
aren't properly abstracted for direct consumption by regular programs.
|
||
|
This is an administration interface much closer to sysctl knobs than
|
||
|
system calls. Even the basic access model, being filesystem path
|
||
|
based, isn't suitable for direct consumption. There's no way to
|
||
|
access "my cgroup" in a race-free way or make multiple operations
|
||
|
atomic against migration to another cgroup.
|
||
|
|
||
|
Another aspect is that, for better or for worse, the cgroup interface
|
||
|
goes through far less scrutiny than regular interfaces for
|
||
|
unprivileged userland. The upside is that cgroup is able to expose
|
||
|
useful features which may not be suitable for general consumption in a
|
||
|
reasonable time frame. It provides a relatively short path between
|
||
|
internal details and userland-visible interface. Of course, this
|
||
|
shortcut comes with high risk. We go through what we go through for
|
||
|
general kernel APIs for good reasons. It may end up leaking internal
|
||
|
details in a way which can exert significant pain by locking the
|
||
|
kernel into a contract that can't be maintained in a reasonable
|
||
|
manner.
|
||
|
|
||
|
Also, due to the specific nature, cgroup and its controllers don't
|
||
|
tend to attract attention from a wide scope of developers. cgroup's
|
||
|
short history is already fraught with severely mis-designed
|
||
|
interfaces, unnecessary commitments to and exposing of internal
|
||
|
details, broken and dangerous implementations of various features.
|
||
|
|
||
|
Keeping cgroup as an administration interface is both advantageous for
|
||
|
its role and imperative given its nature. Some of the cgroup features
|
||
|
may make sense for unprivileged access. If deemed justified, those
|
||
|
must be further abstracted and implemented as a different interface,
|
||
|
be it a system call or process-private filesystem, and survive through
|
||
|
the scrutiny that any interface for general consumption is required to
|
||
|
go through.
|
||
|
|
||
|
Requiring CAP is not a complete solution but should serve as a
|
||
|
significant deterrent against spraying cgroup usages in non-privileged
|
||
|
programs.
|