Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
/*
|
|
|
|
* Generic process-grouping system.
|
|
|
|
*
|
|
|
|
* Based originally on the cpuset system, extracted by Paul Menage
|
|
|
|
* Copyright (C) 2006 Google, Inc
|
|
|
|
*
|
2010-03-10 23:22:20 +00:00
|
|
|
* Notifications support
|
|
|
|
* Copyright (C) 2009 Nokia Corporation
|
|
|
|
* Author: Kirill A. Shutemov
|
|
|
|
*
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
* Copyright notices from the original cpuset code:
|
|
|
|
* --------------------------------------------------
|
|
|
|
* Copyright (C) 2003 BULL SA.
|
|
|
|
* Copyright (C) 2004-2006 Silicon Graphics, Inc.
|
|
|
|
*
|
|
|
|
* Portions derived from Patrick Mochel's sysfs code.
|
|
|
|
* sysfs is Copyright (c) 2001-3 Patrick Mochel
|
|
|
|
*
|
|
|
|
* 2003-10-10 Written by Simon Derr.
|
|
|
|
* 2003-10-22 Updates by Stephen Hemminger.
|
|
|
|
* 2004 May-July Rework by Paul Jackson.
|
|
|
|
* ---------------------------------------------------
|
|
|
|
*
|
|
|
|
* This file is subject to the terms and conditions of the GNU General Public
|
|
|
|
* License. See the file COPYING in the main directory of the Linux
|
|
|
|
* distribution for more details.
|
|
|
|
*/
|
|
|
|
|
2014-04-25 22:28:03 +00:00
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
|
2016-12-27 19:49:06 +00:00
|
|
|
#include "cgroup-internal.h"
|
|
|
|
|
2021-12-16 02:55:37 +00:00
|
|
|
#include <linux/bpf-cgroup.h>
|
2011-06-02 11:20:51 +00:00
|
|
|
#include <linux/cred.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/errno.h>
|
2011-06-02 11:20:51 +00:00
|
|
|
#include <linux/init_task.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/kernel.h>
|
2014-04-26 07:40:28 +00:00
|
|
|
#include <linux/magic.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/mutex.h>
|
|
|
|
#include <linux/mount.h>
|
|
|
|
#include <linux/pagemap.h>
|
2007-10-19 06:39:35 +00:00
|
|
|
#include <linux/proc_fs.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/rcupdate.h>
|
|
|
|
#include <linux/sched.h>
|
2017-02-08 17:51:36 +00:00
|
|
|
#include <linux/sched/task.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/spinlock.h>
|
2015-09-16 16:53:17 +00:00
|
|
|
#include <linux/percpu-rwsem.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/string.h>
|
2013-01-10 03:49:27 +00:00
|
|
|
#include <linux/hashtable.h>
|
2009-09-23 22:56:23 +00:00
|
|
|
#include <linux/idr.h>
|
2012-04-21 07:13:46 +00:00
|
|
|
#include <linux/kthread.h>
|
2011-07-26 23:09:06 +00:00
|
|
|
#include <linux/atomic.h>
|
2016-01-19 17:18:41 +00:00
|
|
|
#include <linux/cpuset.h>
|
2016-01-29 08:54:06 +00:00
|
|
|
#include <linux/proc_ns.h>
|
|
|
|
#include <linux/nsproxy.h>
|
2016-06-30 17:28:42 +00:00
|
|
|
#include <linux/file.h>
|
2019-01-17 05:22:58 +00:00
|
|
|
#include <linux/fs_parser.h>
|
2018-04-26 21:29:04 +00:00
|
|
|
#include <linux/sched/cputime.h>
|
2023-05-08 07:58:51 +00:00
|
|
|
#include <linux/sched/deadline.h>
|
2018-10-26 22:06:31 +00:00
|
|
|
#include <linux/psi.h>
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
#include <net/sock.h>
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-08-10 15:23:44 +00:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/cgroup.h>
|
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
#define CGROUP_FILE_NAME_MAX (MAX_CGROUP_TYPE_NAMELEN + \
|
|
|
|
MAX_CFTYPE_NAME + 2)
|
2018-04-26 21:29:04 +00:00
|
|
|
/* let's not notify more than 100 times per second */
|
|
|
|
#define CGROUP_FILE_NOTIFY_MIN_INTV DIV_ROUND_UP(HZ, 100)
|
2014-02-11 16:52:48 +00:00
|
|
|
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
/*
|
|
|
|
* To avoid confusing the compiler (and generating warnings) with code
|
|
|
|
* that attempts to access what would be a 0-element array (i.e. sized
|
|
|
|
* to a potentially empty array when CGROUP_SUBSYS_COUNT == 0), this
|
|
|
|
* constant expression can be added.
|
|
|
|
*/
|
|
|
|
#define CGROUP_HAS_SUBSYS_CONFIG (CGROUP_SUBSYS_COUNT > 0)
|
|
|
|
|
2011-12-13 02:12:21 +00:00
|
|
|
/*
|
|
|
|
* cgroup_mutex is the master lock. Any modification to cgroup or its
|
|
|
|
* hierarchy must be performed while holding it.
|
|
|
|
*
|
2015-10-15 20:41:53 +00:00
|
|
|
* css_set_lock protects task->cgroups pointer, the list of css_set
|
2014-02-25 15:04:03 +00:00
|
|
|
* objects, and the chain of tasks off each css_set.
|
2011-12-13 02:12:21 +00:00
|
|
|
*
|
2014-02-25 15:04:03 +00:00
|
|
|
* These locks are exported if CONFIG_PROVE_RCU so that accessors in
|
|
|
|
* cgroup.h can use them for lockdep annotations.
|
2011-12-13 02:12:21 +00:00
|
|
|
*/
|
2013-04-07 16:29:51 +00:00
|
|
|
DEFINE_MUTEX(cgroup_mutex);
|
2015-10-15 20:41:53 +00:00
|
|
|
DEFINE_SPINLOCK(css_set_lock);
|
2016-12-27 19:49:06 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_PROVE_RCU
|
2014-02-25 15:04:03 +00:00
|
|
|
EXPORT_SYMBOL_GPL(cgroup_mutex);
|
2015-10-15 20:41:53 +00:00
|
|
|
EXPORT_SYMBOL_GPL(css_set_lock);
|
2013-04-07 16:29:51 +00:00
|
|
|
#endif
|
|
|
|
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 21:48:54 +00:00
|
|
|
DEFINE_SPINLOCK(trace_cgroup_path_lock);
|
|
|
|
char trace_cgroup_path[TRACE_CGROUP_PATH_LEN];
|
2022-05-17 11:25:23 +00:00
|
|
|
static bool cgroup_debug __read_mostly;
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 21:48:54 +00:00
|
|
|
|
2014-05-04 19:09:13 +00:00
|
|
|
/*
|
2014-05-04 19:09:14 +00:00
|
|
|
* Protects cgroup_idr and css_idr so that IDs can be released without
|
|
|
|
* grabbing cgroup_mutex.
|
2014-05-04 19:09:13 +00:00
|
|
|
*/
|
|
|
|
static DEFINE_SPINLOCK(cgroup_idr_lock);
|
|
|
|
|
2015-11-05 05:12:24 +00:00
|
|
|
/*
|
|
|
|
* Protects cgroup_file->kn for !self csses. It synchronizes notifications
|
|
|
|
* against file removal/re-creation across css hiding.
|
|
|
|
*/
|
|
|
|
static DEFINE_SPINLOCK(cgroup_file_kn_lock);
|
|
|
|
|
2019-04-23 16:32:41 +00:00
|
|
|
DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
|
2015-09-16 16:53:17 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
#define cgroup_assert_mutex_or_rcu_locked() \
|
2015-06-18 22:50:02 +00:00
|
|
|
RCU_LOCKDEP_WARN(!rcu_read_lock_held() && \
|
|
|
|
!lockdep_is_held(&cgroup_mutex), \
|
2014-05-13 16:19:23 +00:00
|
|
|
"cgroup_mutex or RCU read lock required");
|
2013-12-06 20:11:56 +00:00
|
|
|
|
cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
2013-11-22 22:14:39 +00:00
|
|
|
/*
|
|
|
|
* cgroup destruction makes heavy use of work items and there can be a lot
|
|
|
|
* of concurrent destructions. Use a separate workqueue so that cgroup
|
|
|
|
* destruction work items don't end up filling up max_active of system_wq
|
|
|
|
* which may lead to deadlock.
|
|
|
|
*/
|
|
|
|
static struct workqueue_struct *cgroup_destroy_wq;
|
|
|
|
|
2014-02-08 15:36:58 +00:00
|
|
|
/* generate an array of cgroup subsystem pointers */
|
2014-02-08 15:36:58 +00:00
|
|
|
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
|
2016-12-27 19:49:06 +00:00
|
|
|
struct cgroup_subsys *cgroup_subsys[] = {
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/cgroup_subsys.h>
|
|
|
|
};
|
2014-02-08 15:36:58 +00:00
|
|
|
#undef SUBSYS
|
|
|
|
|
|
|
|
/* array of cgroup subsystem names */
|
|
|
|
#define SUBSYS(_x) [_x ## _cgrp_id] = #_x,
|
|
|
|
static const char *cgroup_subsys_name[] = {
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
#include <linux/cgroup_subsys.h>
|
|
|
|
};
|
2014-02-08 15:36:58 +00:00
|
|
|
#undef SUBSYS
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2015-09-18 15:56:28 +00:00
|
|
|
/* array of static_keys for cgroup_subsys_enabled() and cgroup_subsys_on_dfl() */
|
|
|
|
#define SUBSYS(_x) \
|
|
|
|
DEFINE_STATIC_KEY_TRUE(_x ## _cgrp_subsys_enabled_key); \
|
|
|
|
DEFINE_STATIC_KEY_TRUE(_x ## _cgrp_subsys_on_dfl_key); \
|
|
|
|
EXPORT_SYMBOL_GPL(_x ## _cgrp_subsys_enabled_key); \
|
|
|
|
EXPORT_SYMBOL_GPL(_x ## _cgrp_subsys_on_dfl_key);
|
|
|
|
#include <linux/cgroup_subsys.h>
|
|
|
|
#undef SUBSYS
|
|
|
|
|
|
|
|
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys_enabled_key,
|
|
|
|
static struct static_key_true *cgroup_subsys_enabled_key[] = {
|
|
|
|
#include <linux/cgroup_subsys.h>
|
|
|
|
};
|
|
|
|
#undef SUBSYS
|
|
|
|
|
|
|
|
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys_on_dfl_key,
|
|
|
|
static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
|
|
|
|
#include <linux/cgroup_subsys.h>
|
|
|
|
};
|
|
|
|
#undef SUBSYS
|
|
|
|
|
2018-04-26 21:29:04 +00:00
|
|
|
static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);
|
2017-09-25 15:12:05 +00:00
|
|
|
|
2020-05-13 02:13:11 +00:00
|
|
|
/* the default hierarchy */
|
2018-04-26 21:29:04 +00:00
|
|
|
struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = &cgrp_dfl_root_rstat_cpu };
|
2015-08-05 20:03:19 +00:00
|
|
|
EXPORT_SYMBOL_GPL(cgrp_dfl_root);
|
2013-06-24 22:21:47 +00:00
|
|
|
|
2014-03-19 14:23:55 +00:00
|
|
|
/*
|
|
|
|
* The default hierarchy always exists but is hidden until mounted for the
|
|
|
|
* first time. This is for backward compatibility.
|
|
|
|
*/
|
2016-02-23 15:00:50 +00:00
|
|
|
static bool cgrp_dfl_visible;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-05-14 23:33:07 +00:00
|
|
|
/* some controllers are not supported in the default hierarchy */
|
2016-02-23 15:00:50 +00:00
|
|
|
static u16 cgrp_dfl_inhibit_ss_mask;
|
2014-05-14 23:33:07 +00:00
|
|
|
|
2016-03-08 16:51:26 +00:00
|
|
|
/* some controllers are implicitly enabled on the default hierarchy */
|
2017-01-20 17:06:08 +00:00
|
|
|
static u16 cgrp_dfl_implicit_ss_mask;
|
2016-03-08 16:51:26 +00:00
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
/* some controllers can be threaded on the default hierarchy */
|
|
|
|
static u16 cgrp_dfl_threaded_ss_mask;
|
|
|
|
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
/* The list of hierarchy roots */
|
2016-12-27 19:49:06 +00:00
|
|
|
LIST_HEAD(cgroup_roots);
|
2013-06-24 22:21:47 +00:00
|
|
|
static int cgroup_root_count;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-02-08 15:37:01 +00:00
|
|
|
/* hierarchy ID allocation and mapping, protected by cgroup_mutex */
|
2013-04-14 18:36:58 +00:00
|
|
|
static DEFINE_IDR(cgroup_hierarchy_idr);
|
2009-09-23 22:56:23 +00:00
|
|
|
|
2013-06-18 10:53:53 +00:00
|
|
|
/*
|
2014-05-16 17:22:49 +00:00
|
|
|
* Assign a monotonically increasing serial number to csses. It guarantees
|
|
|
|
* cgroups with bigger numbers are newer than those with smaller numbers.
|
|
|
|
* Also, as csses are always appended to the parent's ->children list, it
|
|
|
|
* guarantees that sibling csses are always sorted in the ascending serial
|
|
|
|
* number order on the list. Protected by cgroup_mutex.
|
2013-06-18 10:53:53 +00:00
|
|
|
*/
|
2014-05-16 17:22:49 +00:00
|
|
|
static u64 css_serial_nr_next = 1;
|
2013-06-18 10:53:53 +00:00
|
|
|
|
2015-06-06 00:02:14 +00:00
|
|
|
/*
|
2017-01-20 17:06:08 +00:00
|
|
|
* These bitmasks identify subsystems with specific features to avoid
|
|
|
|
* having to do iterative checks repeatedly.
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
*/
|
2016-02-23 03:25:47 +00:00
|
|
|
static u16 have_fork_callback __read_mostly;
|
|
|
|
static u16 have_exit_callback __read_mostly;
|
2019-01-28 16:00:13 +00:00
|
|
|
static u16 have_release_callback __read_mostly;
|
2017-01-20 17:06:08 +00:00
|
|
|
static u16 have_canfork_callback __read_mostly;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2023-09-27 14:25:40 +00:00
|
|
|
static bool have_favordynmods __ro_after_init = IS_ENABLED(CONFIG_CGROUP_FAVOR_DYNMODS);
|
|
|
|
|
2016-01-29 08:54:06 +00:00
|
|
|
/* cgroup namespace for init task */
|
|
|
|
struct cgroup_namespace init_cgroup_ns = {
|
2020-08-03 10:16:50 +00:00
|
|
|
.ns.count = REFCOUNT_INIT(2),
|
2016-01-29 08:54:06 +00:00
|
|
|
.user_ns = &init_user_ns,
|
|
|
|
.ns.ops = &cgroupns_operations,
|
|
|
|
.ns.inum = PROC_CGROUP_INIT_INO,
|
|
|
|
.root_cset = &init_css_set,
|
|
|
|
};
|
|
|
|
|
2015-11-16 16:13:34 +00:00
|
|
|
static struct file_system_type cgroup2_fs_type;
|
2016-12-27 19:49:08 +00:00
|
|
|
static struct cftype cgroup_base_files[];
|
2022-09-06 19:38:55 +00:00
|
|
|
static struct cftype cgroup_psi_files[];
|
2013-06-28 23:24:11 +00:00
|
|
|
|
2021-05-24 19:53:39 +00:00
|
|
|
/* cgroup optional features */
|
|
|
|
enum cgroup_opt_features {
|
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
OPT_FEATURE_PRESSURE,
|
|
|
|
#endif
|
|
|
|
OPT_FEATURE_COUNT
|
|
|
|
};
|
|
|
|
|
|
|
|
static const char *cgroup_opt_feature_names[OPT_FEATURE_COUNT] = {
|
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
"pressure",
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
|
|
|
static u16 cgroup_feature_disable_mask __read_mostly;
|
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
static int cgroup_apply_control(struct cgroup *cgrp);
|
|
|
|
static void cgroup_finalize_control(struct cgroup *cgrp, int ret);
|
2019-05-31 17:38:58 +00:00
|
|
|
static void css_task_iter_skip(struct css_task_iter *it,
|
|
|
|
struct task_struct *task);
|
2012-11-19 16:13:37 +00:00
|
|
|
static int cgroup_destroy_locked(struct cgroup *cgrp);
|
2016-03-03 14:57:58 +00:00
|
|
|
static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
|
|
|
|
struct cgroup_subsys *ss);
|
2014-05-14 13:15:02 +00:00
|
|
|
static void css_release(struct percpu_ref *ref);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
static void kill_css(struct cgroup_subsys_state *css);
|
2015-09-18 21:54:23 +00:00
|
|
|
static int cgroup_addrm_files(struct cgroup_subsys_state *css,
|
|
|
|
struct cgroup *cgrp, struct cftype cfts[],
|
2013-08-09 00:11:23 +00:00
|
|
|
bool is_add);
|
2012-11-19 16:13:37 +00:00
|
|
|
|
2022-10-28 20:45:44 +00:00
|
|
|
#ifdef CONFIG_DEBUG_CGROUP_REF
|
|
|
|
#define CGROUP_REF_FN_ATTRS noinline
|
2022-10-31 17:12:13 +00:00
|
|
|
#define CGROUP_REF_EXPORT(fn) EXPORT_SYMBOL_GPL(fn);
|
2022-10-28 20:45:44 +00:00
|
|
|
#include <linux/cgroup_refcnt.h>
|
|
|
|
#endif
|
|
|
|
|
2015-09-18 15:56:28 +00:00
|
|
|
/**
|
|
|
|
* cgroup_ssid_enabled - cgroup subsys enabled test by subsys ID
|
|
|
|
* @ssid: subsys ID of interest
|
|
|
|
*
|
|
|
|
* cgroup_subsys_enabled() can only be used with literal subsys names which
|
|
|
|
* is fine for individual subsystems but unsuitable for cgroup core. This
|
|
|
|
* is slower static_key_enabled() based test indexed by @ssid.
|
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
bool cgroup_ssid_enabled(int ssid)
|
2015-09-18 15:56:28 +00:00
|
|
|
{
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (!CGROUP_HAS_SUBSYS_CONFIG)
|
2016-03-14 23:21:06 +00:00
|
|
|
return false;
|
|
|
|
|
2015-09-18 15:56:28 +00:00
|
|
|
return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
|
|
|
|
}
|
|
|
|
|
2015-09-18 15:56:28 +00:00
|
|
|
/**
|
|
|
|
* cgroup_on_dfl - test whether a cgroup is on the default hierarchy
|
|
|
|
* @cgrp: the cgroup of interest
|
|
|
|
*
|
|
|
|
* The default hierarchy is the v2 interface of cgroup and this function
|
|
|
|
* can be used to test whether a cgroup is on the default hierarchy for
|
2020-11-09 10:31:11 +00:00
|
|
|
* cases where a subsystem should behave differently depending on the
|
2015-09-18 15:56:28 +00:00
|
|
|
* interface version.
|
|
|
|
*
|
|
|
|
* List of changed behaviors:
|
|
|
|
*
|
|
|
|
* - Mount options "noprefix", "xattr", "clone_children", "release_agent"
|
|
|
|
* and "name" are disallowed.
|
|
|
|
*
|
|
|
|
* - When mounting an existing superblock, mount options should match.
|
|
|
|
*
|
|
|
|
* - rename(2) is disallowed.
|
|
|
|
*
|
|
|
|
* - "tasks" is removed. Everything should be at process granularity. Use
|
|
|
|
* "cgroup.procs" instead.
|
|
|
|
*
|
|
|
|
* - "cgroup.procs" is not sorted. pids will be unique unless they got
|
2020-11-09 10:31:11 +00:00
|
|
|
* recycled in-between reads.
|
2015-09-18 15:56:28 +00:00
|
|
|
*
|
|
|
|
* - "release_agent" and "notify_on_release" are removed. Replacement
|
|
|
|
* notification mechanism will be implemented.
|
|
|
|
*
|
|
|
|
* - "cgroup.clone_children" is removed.
|
|
|
|
*
|
|
|
|
* - "cgroup.subtree_populated" is available. Its value is 0 if the cgroup
|
|
|
|
* and its descendants contain no task; otherwise, 1. The file also
|
|
|
|
* generates kernfs notification which can be monitored through poll and
|
|
|
|
* [di]notify when the value of the file changes.
|
|
|
|
*
|
|
|
|
* - cpuset: tasks will be kept in empty cpusets when hotplug happens and
|
|
|
|
* take masks of ancestors with non-empty cpus/mems, instead of being
|
|
|
|
* moved to an ancestor.
|
|
|
|
*
|
|
|
|
* - cpuset: a task can be moved into an empty cpuset, and again it takes
|
|
|
|
* masks of ancestors.
|
|
|
|
*
|
|
|
|
* - blkcg: blk-throttle becomes properly hierarchical.
|
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
bool cgroup_on_dfl(const struct cgroup *cgrp)
|
2015-09-18 15:56:28 +00:00
|
|
|
{
|
|
|
|
return cgrp->root == &cgrp_dfl_root;
|
|
|
|
}
|
|
|
|
|
2014-05-04 19:09:13 +00:00
|
|
|
/* IDR wrappers which synchronize using cgroup_idr_lock */
|
|
|
|
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
|
|
|
|
gfp_t gfp_mask)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
idr_preload(gfp_mask);
|
2014-05-13 16:10:59 +00:00
|
|
|
spin_lock_bh(&cgroup_idr_lock);
|
2015-11-07 00:28:21 +00:00
|
|
|
ret = idr_alloc(idr, ptr, start, end, gfp_mask & ~__GFP_DIRECT_RECLAIM);
|
2014-05-13 16:10:59 +00:00
|
|
|
spin_unlock_bh(&cgroup_idr_lock);
|
2014-05-04 19:09:13 +00:00
|
|
|
idr_preload_end();
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void *cgroup_idr_replace(struct idr *idr, void *ptr, int id)
|
|
|
|
{
|
|
|
|
void *ret;
|
|
|
|
|
2014-05-13 16:10:59 +00:00
|
|
|
spin_lock_bh(&cgroup_idr_lock);
|
2014-05-04 19:09:13 +00:00
|
|
|
ret = idr_replace(idr, ptr, id);
|
2014-05-13 16:10:59 +00:00
|
|
|
spin_unlock_bh(&cgroup_idr_lock);
|
2014-05-04 19:09:13 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void cgroup_idr_remove(struct idr *idr, int id)
|
|
|
|
{
|
2014-05-13 16:10:59 +00:00
|
|
|
spin_lock_bh(&cgroup_idr_lock);
|
2014-05-04 19:09:13 +00:00
|
|
|
idr_remove(idr, id);
|
2014-05-13 16:10:59 +00:00
|
|
|
spin_unlock_bh(&cgroup_idr_lock);
|
2014-05-04 19:09:13 +00:00
|
|
|
}
|
|
|
|
|
2017-07-17 01:44:18 +00:00
|
|
|
static bool cgroup_has_tasks(struct cgroup *cgrp)
|
2014-05-16 17:22:48 +00:00
|
|
|
{
|
2017-07-17 01:44:18 +00:00
|
|
|
return cgrp->nr_populated_csets;
|
|
|
|
}
|
2014-05-16 17:22:48 +00:00
|
|
|
|
2023-06-03 07:13:04 +00:00
|
|
|
static bool cgroup_is_threaded(struct cgroup *cgrp)
|
2017-05-15 13:34:02 +00:00
|
|
|
{
|
|
|
|
return cgrp->dom_cgrp != cgrp;
|
|
|
|
}
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
/* can @cgrp host both domain and threaded children? */
|
|
|
|
static bool cgroup_is_mixable(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Root isn't under domain level resource control exempting it from
|
|
|
|
* the no-internal-process constraint, so it can serve as a thread
|
|
|
|
* root and a parent of resource domains at the same time.
|
|
|
|
*/
|
|
|
|
return !cgroup_parent(cgrp);
|
|
|
|
}
|
|
|
|
|
2020-11-09 10:31:11 +00:00
|
|
|
/* can @cgrp become a thread root? Should always be true for a thread root */
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
static bool cgroup_can_be_thread_root(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
/* mixables don't care */
|
|
|
|
if (cgroup_is_mixable(cgrp))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* domain roots can't be nested under threaded */
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* can only have either domain or threaded children */
|
|
|
|
if (cgrp->nr_populated_domain_children)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* and no domain controllers can be enabled */
|
|
|
|
if (cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* is @cgrp root of a threaded subtree? */
|
2023-06-03 07:13:04 +00:00
|
|
|
static bool cgroup_is_thread_root(struct cgroup *cgrp)
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
{
|
|
|
|
/* thread root should be a domain */
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* a domain w/ threaded children is a thread root */
|
|
|
|
if (cgrp->nr_threaded_children)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* A domain which has tasks and explicit threaded controllers
|
|
|
|
* enabled is a thread root.
|
|
|
|
*/
|
|
|
|
if (cgroup_has_tasks(cgrp) &&
|
|
|
|
(cgrp->subtree_control & cgrp_dfl_threaded_ss_mask))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* a domain which isn't connected to the root w/o brekage can't be used */
|
|
|
|
static bool cgroup_is_valid_domain(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
/* the cgroup itself can be a thread root */
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
/* but the ancestors can't be unless mixable */
|
|
|
|
while ((cgrp = cgroup_parent(cgrp))) {
|
|
|
|
if (!cgroup_is_mixable(cgrp) && cgroup_is_thread_root(cgrp))
|
|
|
|
return false;
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
2014-05-16 17:22:48 +00:00
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
/* subsystems visibly enabled on a cgroup */
|
|
|
|
static u16 cgroup_control(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *parent = cgroup_parent(cgrp);
|
|
|
|
u16 root_ss_mask = cgrp->root->subsys_mask;
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
if (parent) {
|
|
|
|
u16 ss_mask = parent->subtree_control;
|
|
|
|
|
|
|
|
/* threaded cgroups can only have threaded controllers */
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
ss_mask &= cgrp_dfl_threaded_ss_mask;
|
|
|
|
return ss_mask;
|
|
|
|
}
|
2016-03-03 14:57:58 +00:00
|
|
|
|
|
|
|
if (cgroup_on_dfl(cgrp))
|
2016-03-08 16:51:26 +00:00
|
|
|
root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask |
|
|
|
|
cgrp_dfl_implicit_ss_mask);
|
2016-03-03 14:57:58 +00:00
|
|
|
return root_ss_mask;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* subsystems enabled on a cgroup */
|
|
|
|
static u16 cgroup_ss_mask(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *parent = cgroup_parent(cgrp);
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
if (parent) {
|
|
|
|
u16 ss_mask = parent->subtree_ss_mask;
|
|
|
|
|
|
|
|
/* threaded cgroups can only have threaded controllers */
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
ss_mask &= cgrp_dfl_threaded_ss_mask;
|
|
|
|
return ss_mask;
|
|
|
|
}
|
2016-03-03 14:57:58 +00:00
|
|
|
|
|
|
|
return cgrp->root->subsys_mask;
|
|
|
|
}
|
|
|
|
|
2013-08-09 00:11:27 +00:00
|
|
|
/**
|
|
|
|
* cgroup_css - obtain a cgroup's css for the specified subsystem
|
|
|
|
* @cgrp: the cgroup of interest
|
2014-05-14 13:15:00 +00:00
|
|
|
* @ss: the subsystem of interest (%NULL returns @cgrp->self)
|
2013-08-09 00:11:27 +00:00
|
|
|
*
|
2013-08-26 22:40:56 +00:00
|
|
|
* Return @cgrp's css (cgroup_subsys_state) associated with @ss. This
|
|
|
|
* function must be called either under cgroup_mutex or rcu_read_lock() and
|
|
|
|
* the caller is responsible for pinning the returned css if it wants to
|
|
|
|
* keep accessing it outside the said locks. This function may return
|
|
|
|
* %NULL if @cgrp doesn't have @subsys_id enabled.
|
2013-08-09 00:11:27 +00:00
|
|
|
*/
|
|
|
|
static struct cgroup_subsys_state *cgroup_css(struct cgroup *cgrp,
|
2013-08-26 22:40:56 +00:00
|
|
|
struct cgroup_subsys *ss)
|
2013-08-09 00:11:27 +00:00
|
|
|
{
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (CGROUP_HAS_SUBSYS_CONFIG && ss)
|
2014-02-08 15:36:58 +00:00
|
|
|
return rcu_dereference_check(cgrp->subsys[ss->id],
|
cgroup: introduce cgroup_tree_mutex
Currently cgroup uses combination of inode->i_mutex'es and
cgroup_mutex for synchronization. With the scheduled kernfs
conversion, i_mutex'es will be removed. Unfortunately, just using
cgroup_mutex isn't possible. All kernfs file and syscall operations,
most of which require grabbing cgroup_mutex, will be called with
kernfs active ref held and, if we try to perform kernfs removals under
cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
target node.
Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
stuff used during hierarchy changing operations - cftypes and all the
operations which may affect the cgroupfs. It also covers css
association and iteration. This allows cgroup_css(), for_each_css()
and other css iterators to be called under cgroup_tree_mutex. The new
mutex will nest above both kernfs's active ref protection and
cgroup_mutex. By protecting tree modifications with a separate outer
mutex, we can get rid of the forementioned deadlock condition.
Actual file additions and removals now require cgroup_tree_mutex
instead of cgroup_mutex. Currently, cgroup_tree_mutex is never used
without cgroup_mutex; however, we'll soon add hierarchy modification
sections which are only protected by cgroup_tree_mutex. In the
future, we might want to make the locking more granular by better
splitting the coverages of the two mutexes. For now, this should do.
v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11 16:52:47 +00:00
|
|
|
lockdep_is_held(&cgroup_mutex));
|
2013-08-26 22:40:56 +00:00
|
|
|
else
|
2014-05-14 13:15:00 +00:00
|
|
|
return &cgrp->self;
|
2013-08-09 00:11:27 +00:00
|
|
|
}
|
2012-11-19 16:13:37 +00:00
|
|
|
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
/**
|
2018-12-05 17:10:36 +00:00
|
|
|
* cgroup_e_css_by_mask - obtain a cgroup's effective css for the specified ss
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
* @cgrp: the cgroup of interest
|
2014-05-14 13:15:00 +00:00
|
|
|
* @ss: the subsystem of interest (%NULL returns @cgrp->self)
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
*
|
2015-04-23 11:57:33 +00:00
|
|
|
* Similar to cgroup_css() but returns the effective css, which is defined
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
* as the matching css of the nearest ancestor including self which has @ss
|
|
|
|
* enabled. If @ss is associated with the hierarchy @cgrp is on, this
|
|
|
|
* function is guaranteed to return non-NULL css.
|
|
|
|
*/
|
2018-12-05 17:10:36 +00:00
|
|
|
static struct cgroup_subsys_state *cgroup_e_css_by_mask(struct cgroup *cgrp,
|
|
|
|
struct cgroup_subsys *ss)
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
{
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
if (!ss)
|
2014-05-14 13:15:00 +00:00
|
|
|
return &cgrp->self;
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
|
2014-11-18 07:49:52 +00:00
|
|
|
/*
|
|
|
|
* This function is used while updating css associations and thus
|
2016-03-03 14:57:58 +00:00
|
|
|
* can't test the csses directly. Test ss_mask.
|
2014-11-18 07:49:52 +00:00
|
|
|
*/
|
2016-03-03 14:57:58 +00:00
|
|
|
while (!(cgroup_ss_mask(cgrp) & (1 << ss->id))) {
|
2014-05-16 17:22:48 +00:00
|
|
|
cgrp = cgroup_parent(cgrp);
|
2016-03-03 14:57:58 +00:00
|
|
|
if (!cgrp)
|
|
|
|
return NULL;
|
|
|
|
}
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
|
|
|
|
return cgroup_css(cgrp, ss);
|
2013-08-09 00:11:27 +00:00
|
|
|
}
|
2012-11-19 16:13:37 +00:00
|
|
|
|
2018-12-05 17:10:36 +00:00
|
|
|
/**
|
|
|
|
* cgroup_e_css - obtain a cgroup's effective css for the specified subsystem
|
|
|
|
* @cgrp: the cgroup of interest
|
|
|
|
* @ss: the subsystem of interest
|
|
|
|
*
|
|
|
|
* Find and get the effective css of @cgrp for @ss. The effective css is
|
|
|
|
* defined as the matching css of the nearest ancestor including self which
|
|
|
|
* has @ss enabled. If @ss is not mounted on the hierarchy @cgrp is on,
|
|
|
|
* the root css is returned, so this function always returns a valid css.
|
|
|
|
*
|
|
|
|
* The returned css is not guaranteed to be online, and therefore it is the
|
2020-11-09 10:31:11 +00:00
|
|
|
* callers responsibility to try get a reference for it.
|
2018-12-05 17:10:36 +00:00
|
|
|
*/
|
|
|
|
struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
|
|
|
|
struct cgroup_subsys *ss)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (!CGROUP_HAS_SUBSYS_CONFIG)
|
|
|
|
return NULL;
|
|
|
|
|
2018-12-05 17:10:36 +00:00
|
|
|
do {
|
|
|
|
css = cgroup_css(cgrp, ss);
|
|
|
|
|
|
|
|
if (css)
|
|
|
|
return css;
|
|
|
|
cgrp = cgroup_parent(cgrp);
|
|
|
|
} while (cgrp);
|
|
|
|
|
|
|
|
return init_css_set.subsys[ss->id];
|
|
|
|
}
|
|
|
|
|
2014-11-18 07:49:52 +00:00
|
|
|
/**
|
|
|
|
* cgroup_get_e_css - get a cgroup's effective css for the specified subsystem
|
|
|
|
* @cgrp: the cgroup of interest
|
|
|
|
* @ss: the subsystem of interest
|
|
|
|
*
|
|
|
|
* Find and get the effective css of @cgrp for @ss. The effective css is
|
|
|
|
* defined as the matching css of the nearest ancestor including self which
|
|
|
|
* has @ss enabled. If @ss is not mounted on the hierarchy @cgrp is on,
|
|
|
|
* the root css is returned, so this function always returns a valid css.
|
|
|
|
* The returned css must be put using css_put().
|
|
|
|
*/
|
|
|
|
struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
|
|
|
|
struct cgroup_subsys *ss)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (!CGROUP_HAS_SUBSYS_CONFIG)
|
|
|
|
return NULL;
|
|
|
|
|
2014-11-18 07:49:52 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
do {
|
|
|
|
css = cgroup_css(cgrp, ss);
|
|
|
|
|
|
|
|
if (css && css_tryget_online(css))
|
|
|
|
goto out_unlock;
|
|
|
|
cgrp = cgroup_parent(cgrp);
|
|
|
|
} while (cgrp);
|
|
|
|
|
|
|
|
css = init_css_set.subsys[ss->id];
|
|
|
|
css_get(css);
|
|
|
|
out_unlock:
|
|
|
|
rcu_read_unlock();
|
|
|
|
return css;
|
|
|
|
}
|
2021-06-29 02:38:21 +00:00
|
|
|
EXPORT_SYMBOL_GPL(cgroup_get_e_css);
|
2014-11-18 07:49:52 +00:00
|
|
|
|
2017-04-28 19:14:55 +00:00
|
|
|
static void cgroup_get_live(struct cgroup *cgrp)
|
2015-10-15 20:41:50 +00:00
|
|
|
{
|
|
|
|
WARN_ON_ONCE(cgroup_is_dead(cgrp));
|
2023-06-02 07:43:46 +00:00
|
|
|
cgroup_get(cgrp);
|
2015-10-15 20:41:50 +00:00
|
|
|
}
|
|
|
|
|
2019-04-19 17:03:02 +00:00
|
|
|
/**
|
|
|
|
* __cgroup_task_count - count the number of tasks in a cgroup. The caller
|
|
|
|
* is responsible for taking the css_set_lock.
|
|
|
|
* @cgrp: the cgroup in question
|
|
|
|
*/
|
|
|
|
int __cgroup_task_count(const struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
int count = 0;
|
|
|
|
struct cgrp_cset_link *link;
|
|
|
|
|
|
|
|
lockdep_assert_held(&css_set_lock);
|
|
|
|
|
|
|
|
list_for_each_entry(link, &cgrp->cset_links, cset_link)
|
|
|
|
count += link->cset->nr_tasks;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_task_count - count the number of tasks in a cgroup.
|
|
|
|
* @cgrp: the cgroup in question
|
|
|
|
*/
|
|
|
|
int cgroup_task_count(const struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
int count;
|
|
|
|
|
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
count = __cgroup_task_count(cgrp);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
2014-05-13 16:16:21 +00:00
|
|
|
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
|
2014-02-11 16:52:49 +00:00
|
|
|
{
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct cgroup *cgrp = of->kn->parent->priv;
|
2014-05-13 16:16:21 +00:00
|
|
|
struct cftype *cft = of_cft(of);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is open and unprotected implementation of cgroup_css().
|
|
|
|
* seq_css() is only called from a kernfs file operation which has
|
|
|
|
* an active reference on the file. Because all the subsystem
|
|
|
|
* files are drained before a css is disassociated with a cgroup,
|
|
|
|
* the matching css from the cgroup's subsys table is guaranteed to
|
|
|
|
* be and stay valid until the enclosing operation is complete.
|
|
|
|
*/
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (CGROUP_HAS_SUBSYS_CONFIG && cft->ss)
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
|
|
|
|
else
|
2014-05-14 13:15:00 +00:00
|
|
|
return &cgrp->self;
|
2014-02-11 16:52:49 +00:00
|
|
|
}
|
2014-05-13 16:16:21 +00:00
|
|
|
EXPORT_SYMBOL_GPL(of_css);
|
2014-02-11 16:52:49 +00:00
|
|
|
|
2013-12-06 20:11:56 +00:00
|
|
|
/**
|
|
|
|
* for_each_css - iterate all css's of a cgroup
|
|
|
|
* @css: the iteration cursor
|
|
|
|
* @ssid: the index of the subsystem, CGROUP_SUBSYS_COUNT after reaching the end
|
|
|
|
* @cgrp: the target cgroup to iterate css's of
|
|
|
|
*
|
2023-06-27 11:40:59 +00:00
|
|
|
* Should be called under cgroup_mutex.
|
2013-12-06 20:11:56 +00:00
|
|
|
*/
|
|
|
|
#define for_each_css(css, ssid, cgrp) \
|
|
|
|
for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT; (ssid)++) \
|
|
|
|
if (!((css) = rcu_dereference_check( \
|
|
|
|
(cgrp)->subsys[(ssid)], \
|
|
|
|
lockdep_is_held(&cgroup_mutex)))) { } \
|
|
|
|
else
|
|
|
|
|
2015-06-06 00:02:14 +00:00
|
|
|
/**
|
2016-02-23 03:25:46 +00:00
|
|
|
* do_each_subsys_mask - filter for_each_subsys with a bitmask
|
2015-06-06 00:02:14 +00:00
|
|
|
* @ss: the iteration cursor
|
|
|
|
* @ssid: the index of @ss, CGROUP_SUBSYS_COUNT after reaching the end
|
2016-02-23 03:25:46 +00:00
|
|
|
* @ss_mask: the bitmask
|
2015-06-06 00:02:14 +00:00
|
|
|
*
|
|
|
|
* The block will only run for cases where the ssid-th bit (1 << ssid) of
|
2016-02-23 03:25:46 +00:00
|
|
|
* @ss_mask is set.
|
2015-06-06 00:02:14 +00:00
|
|
|
*/
|
2016-02-23 03:25:46 +00:00
|
|
|
#define do_each_subsys_mask(ss, ssid, ss_mask) do { \
|
|
|
|
unsigned long __ss_mask = (ss_mask); \
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (!CGROUP_HAS_SUBSYS_CONFIG) { \
|
2015-06-09 11:32:07 +00:00
|
|
|
(ssid) = 0; \
|
2016-02-23 03:25:46 +00:00
|
|
|
break; \
|
|
|
|
} \
|
|
|
|
for_each_set_bit(ssid, &__ss_mask, CGROUP_SUBSYS_COUNT) { \
|
|
|
|
(ss) = cgroup_subsys[ssid]; \
|
|
|
|
{
|
|
|
|
|
|
|
|
#define while_each_subsys_mask() \
|
|
|
|
} \
|
|
|
|
} \
|
|
|
|
} while (false)
|
2015-06-06 00:02:14 +00:00
|
|
|
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
/* iterate over child cgrps, lock should be held throughout iteration */
|
|
|
|
#define cgroup_for_each_live_child(child, cgrp) \
|
2014-05-16 17:22:48 +00:00
|
|
|
list_for_each_entry((child), &(cgrp)->self.children, self.sibling) \
|
2014-05-13 16:19:23 +00:00
|
|
|
if (({ lockdep_assert_held(&cgroup_mutex); \
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
cgroup_is_dead(child); })) \
|
|
|
|
; \
|
|
|
|
else
|
2013-04-07 16:29:51 +00:00
|
|
|
|
2020-11-09 10:31:11 +00:00
|
|
|
/* walk live descendants in pre order */
|
2016-03-03 14:57:59 +00:00
|
|
|
#define cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) \
|
|
|
|
css_for_each_descendant_pre((d_css), cgroup_css((cgrp), NULL)) \
|
|
|
|
if (({ lockdep_assert_held(&cgroup_mutex); \
|
|
|
|
(dsct) = (d_css)->cgroup; \
|
|
|
|
cgroup_is_dead(dsct); })) \
|
|
|
|
; \
|
|
|
|
else
|
|
|
|
|
|
|
|
/* walk live descendants in postorder */
|
|
|
|
#define cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) \
|
|
|
|
css_for_each_descendant_post((d_css), cgroup_css((cgrp), NULL)) \
|
|
|
|
if (({ lockdep_assert_held(&cgroup_mutex); \
|
|
|
|
(dsct) = (d_css)->cgroup; \
|
|
|
|
cgroup_is_dead(dsct); })) \
|
|
|
|
; \
|
|
|
|
else
|
|
|
|
|
2014-03-19 14:23:53 +00:00
|
|
|
/*
|
|
|
|
* The default css_set - used by init and its children prior to any
|
2007-10-19 06:39:36 +00:00
|
|
|
* hierarchies being mounted. It contains a pointer to the root state
|
|
|
|
* for each subsystem. Also used to anchor the list of css_sets. Not
|
|
|
|
* reference-counted, to improve performance when child cgroups
|
|
|
|
* haven't been created.
|
|
|
|
*/
|
2014-05-08 01:31:17 +00:00
|
|
|
struct css_set init_css_set = {
|
2017-03-08 08:00:40 +00:00
|
|
|
.refcount = REFCOUNT_INIT(1),
|
2017-05-15 13:34:02 +00:00
|
|
|
.dom_cset = &init_css_set,
|
2014-03-19 14:23:53 +00:00
|
|
|
.tasks = LIST_HEAD_INIT(init_css_set.tasks),
|
|
|
|
.mg_tasks = LIST_HEAD_INIT(init_css_set.mg_tasks),
|
2019-05-31 17:38:58 +00:00
|
|
|
.dying_tasks = LIST_HEAD_INIT(init_css_set.dying_tasks),
|
2016-12-27 19:49:05 +00:00
|
|
|
.task_iters = LIST_HEAD_INIT(init_css_set.task_iters),
|
2017-05-15 13:34:02 +00:00
|
|
|
.threaded_csets = LIST_HEAD_INIT(init_css_set.threaded_csets),
|
2016-12-27 19:49:05 +00:00
|
|
|
.cgrp_links = LIST_HEAD_INIT(init_css_set.cgrp_links),
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
.mg_src_preload_node = LIST_HEAD_INIT(init_css_set.mg_src_preload_node),
|
|
|
|
.mg_dst_preload_node = LIST_HEAD_INIT(init_css_set.mg_dst_preload_node),
|
2014-03-19 14:23:53 +00:00
|
|
|
.mg_node = LIST_HEAD_INIT(init_css_set.mg_node),
|
2017-09-25 20:50:20 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The following field is re-initialized when this cset gets linked
|
|
|
|
* in cgroup_init(). However, let's initialize the field
|
|
|
|
* statically too so that the default cgroup can be accessed safely
|
|
|
|
* early during boot.
|
|
|
|
*/
|
|
|
|
.dfl_cgrp = &cgrp_dfl_root.cgrp,
|
2014-03-19 14:23:53 +00:00
|
|
|
};
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2014-03-19 14:23:53 +00:00
|
|
|
static int css_set_count = 1; /* 1 for init_css_set */
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2017-05-15 13:34:02 +00:00
|
|
|
static bool css_set_threaded(struct css_set *cset)
|
|
|
|
{
|
|
|
|
return cset->dom_cset != cset;
|
|
|
|
}
|
|
|
|
|
2015-10-15 20:41:49 +00:00
|
|
|
/**
|
|
|
|
* css_set_populated - does a css_set contain any tasks?
|
|
|
|
* @cset: target css_set
|
2017-06-13 21:18:01 +00:00
|
|
|
*
|
|
|
|
* css_set_populated() should be the same as !!cset->nr_tasks at steady
|
|
|
|
* state. However, css_set_populated() can be called while a task is being
|
|
|
|
* added to or removed from the linked list before the nr_tasks is
|
|
|
|
* properly updated. Hence, we can't just look at ->nr_tasks here.
|
2015-10-15 20:41:49 +00:00
|
|
|
*/
|
|
|
|
static bool css_set_populated(struct css_set *cset)
|
|
|
|
{
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2015-10-15 20:41:49 +00:00
|
|
|
|
|
|
|
return !list_empty(&cset->tasks) || !list_empty(&cset->mg_tasks);
|
|
|
|
}
|
|
|
|
|
2014-04-25 22:28:02 +00:00
|
|
|
/**
|
2017-07-17 01:43:33 +00:00
|
|
|
* cgroup_update_populated - update the populated count of a cgroup
|
2014-04-25 22:28:02 +00:00
|
|
|
* @cgrp: the target cgroup
|
|
|
|
* @populated: inc or dec populated count
|
|
|
|
*
|
2015-10-15 20:41:49 +00:00
|
|
|
* One of the css_sets associated with @cgrp is either getting its first
|
2017-07-17 01:43:33 +00:00
|
|
|
* task or losing the last. Update @cgrp->nr_populated_* accordingly. The
|
|
|
|
* count is propagated towards root so that a given cgroup's
|
|
|
|
* nr_populated_children is zero iff none of its descendants contain any
|
|
|
|
* tasks.
|
2014-04-25 22:28:02 +00:00
|
|
|
*
|
2017-07-17 01:43:33 +00:00
|
|
|
* @cgrp's interface file "cgroup.populated" is zero if both
|
|
|
|
* @cgrp->nr_populated_csets and @cgrp->nr_populated_children are zero and
|
|
|
|
* 1 otherwise. When the sum changes from or to zero, userland is notified
|
|
|
|
* that the content of the interface file has changed. This can be used to
|
|
|
|
* detect when @cgrp and its descendants become populated or empty.
|
2014-04-25 22:28:02 +00:00
|
|
|
*/
|
|
|
|
static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
|
|
|
|
{
|
2017-07-17 01:43:33 +00:00
|
|
|
struct cgroup *child = NULL;
|
|
|
|
int adj = populated ? 1 : -1;
|
|
|
|
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2014-04-25 22:28:02 +00:00
|
|
|
|
|
|
|
do {
|
2017-07-17 01:43:33 +00:00
|
|
|
bool was_populated = cgroup_is_populated(cgrp);
|
2014-04-25 22:28:02 +00:00
|
|
|
|
2017-05-15 13:34:02 +00:00
|
|
|
if (!child) {
|
2017-07-17 01:43:33 +00:00
|
|
|
cgrp->nr_populated_csets += adj;
|
2017-05-15 13:34:02 +00:00
|
|
|
} else {
|
|
|
|
if (cgroup_is_threaded(child))
|
|
|
|
cgrp->nr_populated_threaded_children += adj;
|
|
|
|
else
|
|
|
|
cgrp->nr_populated_domain_children += adj;
|
|
|
|
}
|
2014-04-25 22:28:02 +00:00
|
|
|
|
2017-07-17 01:43:33 +00:00
|
|
|
if (was_populated == cgroup_is_populated(cgrp))
|
2014-04-25 22:28:02 +00:00
|
|
|
break;
|
|
|
|
|
2016-12-27 19:49:08 +00:00
|
|
|
cgroup1_check_for_release(cgrp);
|
2019-04-19 17:03:08 +00:00
|
|
|
TRACE_CGROUP_PATH(notify_populated, cgrp,
|
|
|
|
cgroup_is_populated(cgrp));
|
2015-09-18 21:54:23 +00:00
|
|
|
cgroup_file_notify(&cgrp->events_file);
|
|
|
|
|
2017-07-17 01:43:33 +00:00
|
|
|
child = cgrp;
|
2014-05-16 17:22:48 +00:00
|
|
|
cgrp = cgroup_parent(cgrp);
|
2014-04-25 22:28:02 +00:00
|
|
|
} while (cgrp);
|
|
|
|
}
|
|
|
|
|
2015-10-15 20:41:49 +00:00
|
|
|
/**
|
|
|
|
* css_set_update_populated - update populated state of a css_set
|
|
|
|
* @cset: target css_set
|
|
|
|
* @populated: whether @cset is populated or depopulated
|
|
|
|
*
|
|
|
|
* @cset is either getting the first task or losing the last. Update the
|
2017-07-17 01:43:33 +00:00
|
|
|
* populated counters of all associated cgroups accordingly.
|
2015-10-15 20:41:49 +00:00
|
|
|
*/
|
|
|
|
static void css_set_update_populated(struct css_set *cset, bool populated)
|
|
|
|
{
|
|
|
|
struct cgrp_cset_link *link;
|
|
|
|
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2015-10-15 20:41:49 +00:00
|
|
|
|
|
|
|
list_for_each_entry(link, &cset->cgrp_links, cgrp_link)
|
|
|
|
cgroup_update_populated(link->cgrp, populated);
|
|
|
|
}
|
|
|
|
|
2019-05-31 17:38:58 +00:00
|
|
|
/*
|
|
|
|
* @task is leaving, advance task iterators which are pointing to it so
|
|
|
|
* that they can resume at the next position. Advancing an iterator might
|
|
|
|
* remove it from the list, use safe walk. See css_task_iter_skip() for
|
|
|
|
* details.
|
|
|
|
*/
|
|
|
|
static void css_set_skip_task_iters(struct css_set *cset,
|
|
|
|
struct task_struct *task)
|
|
|
|
{
|
|
|
|
struct css_task_iter *it, *pos;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(it, pos, &cset->task_iters, iters_node)
|
|
|
|
css_task_iter_skip(it, task);
|
|
|
|
}
|
|
|
|
|
2015-10-15 20:41:52 +00:00
|
|
|
/**
|
|
|
|
* css_set_move_task - move a task from one css_set to another
|
|
|
|
* @task: task being moved
|
|
|
|
* @from_cset: css_set @task currently belongs to (may be NULL)
|
|
|
|
* @to_cset: new css_set @task is being moved to (may be NULL)
|
|
|
|
* @use_mg_tasks: move to @to_cset->mg_tasks instead of ->tasks
|
|
|
|
*
|
|
|
|
* Move @task from @from_cset to @to_cset. If @task didn't belong to any
|
|
|
|
* css_set, @from_cset can be NULL. If @task is being disassociated
|
|
|
|
* instead of moved, @to_cset can be NULL.
|
|
|
|
*
|
2017-07-17 01:43:33 +00:00
|
|
|
* This function automatically handles populated counter updates and
|
2015-10-15 20:41:52 +00:00
|
|
|
* css_task_iter adjustments but the caller is responsible for managing
|
|
|
|
* @from_cset and @to_cset's reference counts.
|
2015-10-15 20:41:52 +00:00
|
|
|
*/
|
|
|
|
static void css_set_move_task(struct task_struct *task,
|
|
|
|
struct css_set *from_cset, struct css_set *to_cset,
|
|
|
|
bool use_mg_tasks)
|
|
|
|
{
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2016-03-03 14:57:57 +00:00
|
|
|
if (to_cset && !css_set_populated(to_cset))
|
|
|
|
css_set_update_populated(to_cset, true);
|
|
|
|
|
2015-10-15 20:41:52 +00:00
|
|
|
if (from_cset) {
|
|
|
|
WARN_ON_ONCE(list_empty(&task->cg_list));
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2019-05-31 17:38:58 +00:00
|
|
|
css_set_skip_task_iters(from_cset, task);
|
2015-10-15 20:41:52 +00:00
|
|
|
list_del_init(&task->cg_list);
|
|
|
|
if (!css_set_populated(from_cset))
|
|
|
|
css_set_update_populated(from_cset, false);
|
|
|
|
} else {
|
|
|
|
WARN_ON_ONCE(!list_empty(&task->cg_list));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (to_cset) {
|
|
|
|
/*
|
|
|
|
* We are synchronized through cgroup_threadgroup_rwsem
|
|
|
|
* against PF_EXITING setting such that we can't race
|
2019-10-04 10:57:39 +00:00
|
|
|
* against cgroup_exit()/cgroup_free() dropping the css_set.
|
2015-10-15 20:41:52 +00:00
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(task->flags & PF_EXITING);
|
|
|
|
|
2018-10-26 22:06:31 +00:00
|
|
|
cgroup_move_task(task, to_cset);
|
2015-10-15 20:41:52 +00:00
|
|
|
list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks :
|
|
|
|
&to_cset->tasks);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-09-23 22:56:22 +00:00
|
|
|
/*
|
|
|
|
* hash table for cgroup groups. This improves the performance to find
|
|
|
|
* an existing css_set. This hash doesn't (currently) take into
|
|
|
|
* account cgroups in empty hierarchies.
|
|
|
|
*/
|
2008-04-29 08:00:11 +00:00
|
|
|
#define CSS_SET_HASH_BITS 7
|
2013-01-10 03:49:27 +00:00
|
|
|
static DEFINE_HASHTABLE(css_set_table, CSS_SET_HASH_BITS);
|
2008-04-29 08:00:11 +00:00
|
|
|
|
2023-08-17 17:19:13 +00:00
|
|
|
static unsigned long css_set_hash(struct cgroup_subsys_state **css)
|
2008-04-29 08:00:11 +00:00
|
|
|
{
|
2013-01-10 03:49:27 +00:00
|
|
|
unsigned long key = 0UL;
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int i;
|
2008-04-29 08:00:11 +00:00
|
|
|
|
2013-06-25 18:53:37 +00:00
|
|
|
for_each_subsys(ss, i)
|
2013-01-10 03:49:27 +00:00
|
|
|
key += (unsigned long)css[i];
|
|
|
|
key = (key >> 16) ^ key;
|
2008-04-29 08:00:11 +00:00
|
|
|
|
2013-01-10 03:49:27 +00:00
|
|
|
return key;
|
2008-04-29 08:00:11 +00:00
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:09 +00:00
|
|
|
void put_css_set_locked(struct css_set *cset)
|
2007-10-19 06:39:33 +00:00
|
|
|
{
|
2013-06-13 04:04:50 +00:00
|
|
|
struct cgrp_cset_link *link, *tmp_link;
|
2014-04-23 15:13:15 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
2013-06-13 04:04:49 +00:00
|
|
|
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2014-02-13 11:58:40 +00:00
|
|
|
|
2017-03-08 08:00:40 +00:00
|
|
|
if (!refcount_dec_and_test(&cset->refcount))
|
2008-10-19 03:28:03 +00:00
|
|
|
return;
|
2007-10-19 06:39:38 +00:00
|
|
|
|
2017-05-15 13:34:02 +00:00
|
|
|
WARN_ON_ONCE(!list_empty(&cset->threaded_csets));
|
|
|
|
|
2020-11-09 10:31:11 +00:00
|
|
|
/* This css_set is dead. Unlink it and release cgroup and css refs */
|
2015-11-23 19:55:41 +00:00
|
|
|
for_each_subsys(ss, ssid) {
|
2014-04-23 15:13:15 +00:00
|
|
|
list_del(&cset->e_cset_node[ssid]);
|
2015-11-23 19:55:41 +00:00
|
|
|
css_put(cset->subsys[ssid]);
|
|
|
|
}
|
2013-06-13 04:04:49 +00:00
|
|
|
hash_del(&cset->hlist);
|
2009-09-23 22:56:23 +00:00
|
|
|
css_set_count--;
|
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
|
|
|
|
list_del(&link->cset_link);
|
|
|
|
list_del(&link->cgrp_link);
|
2015-10-15 20:41:51 +00:00
|
|
|
if (cgroup_parent(link->cgrp))
|
|
|
|
cgroup_put(link->cgrp);
|
2009-09-23 22:56:23 +00:00
|
|
|
kfree(link);
|
2007-10-19 06:39:38 +00:00
|
|
|
}
|
2009-09-23 22:56:23 +00:00
|
|
|
|
2017-05-15 13:34:02 +00:00
|
|
|
if (css_set_threaded(cset)) {
|
|
|
|
list_del(&cset->threaded_csets_node);
|
|
|
|
put_css_set_locked(cset->dom_cset);
|
|
|
|
}
|
|
|
|
|
2013-06-13 04:04:49 +00:00
|
|
|
kfree_rcu(cset, rcu_head);
|
2007-10-19 06:39:33 +00:00
|
|
|
}
|
|
|
|
|
2013-06-24 22:21:48 +00:00
|
|
|
/**
|
2009-09-23 22:56:22 +00:00
|
|
|
* compare_css_sets - helper function for find_existing_css_set().
|
2013-06-13 04:04:49 +00:00
|
|
|
* @cset: candidate css_set being tested
|
|
|
|
* @old_cset: existing css_set for a task
|
2009-09-23 22:56:22 +00:00
|
|
|
* @new_cgrp: cgroup that's being entered by the task
|
|
|
|
* @template: desired set of css pointers in css_set (pre-calculated)
|
|
|
|
*
|
2013-07-31 08:18:36 +00:00
|
|
|
* Returns true if "cset" matches "old_cset" except for the hierarchy
|
2009-09-23 22:56:22 +00:00
|
|
|
* which "new_cgrp" belongs to, for which it should match "new_cgrp".
|
|
|
|
*/
|
2013-06-13 04:04:49 +00:00
|
|
|
static bool compare_css_sets(struct css_set *cset,
|
|
|
|
struct css_set *old_cset,
|
2009-09-23 22:56:22 +00:00
|
|
|
struct cgroup *new_cgrp,
|
|
|
|
struct cgroup_subsys_state *template[])
|
|
|
|
{
|
2017-05-15 13:34:02 +00:00
|
|
|
struct cgroup *new_dfl_cgrp;
|
2009-09-23 22:56:22 +00:00
|
|
|
struct list_head *l1, *l2;
|
|
|
|
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
/*
|
|
|
|
* On the default hierarchy, there can be csets which are
|
|
|
|
* associated with the same set of cgroups but different csses.
|
|
|
|
* Let's first ensure that csses match.
|
|
|
|
*/
|
|
|
|
if (memcmp(template, cset->subsys, sizeof(cset->subsys)))
|
2009-09-23 22:56:22 +00:00
|
|
|
return false;
|
|
|
|
|
2017-05-15 13:34:02 +00:00
|
|
|
|
|
|
|
/* @cset's domain should match the default cgroup's */
|
|
|
|
if (cgroup_on_dfl(new_cgrp))
|
|
|
|
new_dfl_cgrp = new_cgrp;
|
|
|
|
else
|
|
|
|
new_dfl_cgrp = old_cset->dfl_cgrp;
|
|
|
|
|
|
|
|
if (new_dfl_cgrp->dom_cgrp != cset->dom_cset->dfl_cgrp)
|
|
|
|
return false;
|
|
|
|
|
2009-09-23 22:56:22 +00:00
|
|
|
/*
|
|
|
|
* Compare cgroup pointers in order to distinguish between
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
* different cgroups in hierarchies. As different cgroups may
|
|
|
|
* share the same effective css, this comparison is always
|
|
|
|
* necessary.
|
2009-09-23 22:56:22 +00:00
|
|
|
*/
|
2013-06-13 04:04:50 +00:00
|
|
|
l1 = &cset->cgrp_links;
|
|
|
|
l2 = &old_cset->cgrp_links;
|
2009-09-23 22:56:22 +00:00
|
|
|
while (1) {
|
2013-06-13 04:04:50 +00:00
|
|
|
struct cgrp_cset_link *link1, *link2;
|
2013-06-13 04:04:49 +00:00
|
|
|
struct cgroup *cgrp1, *cgrp2;
|
2009-09-23 22:56:22 +00:00
|
|
|
|
|
|
|
l1 = l1->next;
|
|
|
|
l2 = l2->next;
|
|
|
|
/* See if we reached the end - both lists are equal length. */
|
2013-06-13 04:04:50 +00:00
|
|
|
if (l1 == &cset->cgrp_links) {
|
|
|
|
BUG_ON(l2 != &old_cset->cgrp_links);
|
2009-09-23 22:56:22 +00:00
|
|
|
break;
|
|
|
|
} else {
|
2013-06-13 04:04:50 +00:00
|
|
|
BUG_ON(l2 == &old_cset->cgrp_links);
|
2009-09-23 22:56:22 +00:00
|
|
|
}
|
|
|
|
/* Locate the cgroups associated with these links. */
|
2013-06-13 04:04:50 +00:00
|
|
|
link1 = list_entry(l1, struct cgrp_cset_link, cgrp_link);
|
|
|
|
link2 = list_entry(l2, struct cgrp_cset_link, cgrp_link);
|
|
|
|
cgrp1 = link1->cgrp;
|
|
|
|
cgrp2 = link2->cgrp;
|
2009-09-23 22:56:22 +00:00
|
|
|
/* Hierarchies should be linked in the same order. */
|
2013-06-13 04:04:49 +00:00
|
|
|
BUG_ON(cgrp1->root != cgrp2->root);
|
2009-09-23 22:56:22 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If this hierarchy is the hierarchy of the cgroup
|
|
|
|
* that's changing, then we need to check that this
|
|
|
|
* css_set points to the new cgroup; if it's any other
|
|
|
|
* hierarchy, then this css_set should point to the
|
|
|
|
* same cgroup as the old css_set.
|
|
|
|
*/
|
2013-06-13 04:04:49 +00:00
|
|
|
if (cgrp1->root == new_cgrp->root) {
|
|
|
|
if (cgrp1 != new_cgrp)
|
2009-09-23 22:56:22 +00:00
|
|
|
return false;
|
|
|
|
} else {
|
2013-06-13 04:04:49 +00:00
|
|
|
if (cgrp1 != cgrp2)
|
2009-09-23 22:56:22 +00:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2013-06-24 22:21:48 +00:00
|
|
|
/**
|
|
|
|
* find_existing_css_set - init css array and find the matching css_set
|
|
|
|
* @old_cset: the css_set that we're using before the cgroup transition
|
|
|
|
* @cgrp: the cgroup that we're moving into
|
|
|
|
* @template: out param for the new set of csses, should be clear on entry
|
2007-10-19 06:39:36 +00:00
|
|
|
*/
|
2013-06-13 04:04:49 +00:00
|
|
|
static struct css_set *find_existing_css_set(struct css_set *old_cset,
|
|
|
|
struct cgroup *cgrp,
|
2023-08-17 17:19:13 +00:00
|
|
|
struct cgroup_subsys_state **template)
|
2007-10-19 06:39:33 +00:00
|
|
|
{
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup_root *root = cgrp->root;
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2013-06-13 04:04:49 +00:00
|
|
|
struct css_set *cset;
|
2013-01-10 03:49:27 +00:00
|
|
|
unsigned long key;
|
2013-06-24 22:21:48 +00:00
|
|
|
int i;
|
2007-10-19 06:39:36 +00:00
|
|
|
|
cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree. This is
mainly useful for classifiers and subsystems that hook into components
that are already modules. cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.
It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.
Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear. Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.
Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.
Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.
Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.
Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves. In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy). The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored. Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.
This patch:
Make subsys[] able to be dynamically populated to support modular
subsystems
This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-10 23:22:07 +00:00
|
|
|
/*
|
|
|
|
* Build the set of subsystem state objects that we want to see in the
|
2020-11-09 10:31:11 +00:00
|
|
|
* new css_set. While subsystems can change globally, the entries here
|
cgroups: revamp subsys array
This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree. This is
mainly useful for classifiers and subsystems that hook into components
that are already modules. cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.
It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.
Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear. Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.
Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.
Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.
Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.
Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves. In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy). The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored. Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.
This patch:
Make subsys[] able to be dynamically populated to support modular
subsystems
This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-10 23:22:07 +00:00
|
|
|
* won't change, so no need for locking.
|
|
|
|
*/
|
2013-06-25 18:53:37 +00:00
|
|
|
for_each_subsys(ss, i) {
|
2014-04-23 15:13:14 +00:00
|
|
|
if (root->subsys_mask & (1UL << i)) {
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
/*
|
|
|
|
* @ss is in this hierarchy, so we want the
|
|
|
|
* effective css from @cgrp.
|
|
|
|
*/
|
2018-12-05 17:10:36 +00:00
|
|
|
template[i] = cgroup_e_css_by_mask(cgrp, ss);
|
2007-10-19 06:39:36 +00:00
|
|
|
} else {
|
cgroup: introduce effective cgroup_subsys_state
In the planned default unified hierarchy, controllers may get
dynamically attached to and detached from a cgroup and a cgroup may
not have csses for all the controllers associated with the hierarchy.
When a cgroup doesn't have its own css for a given controller, the css
of the nearest ancestor with the controller enabled will be used,
which is called the effective css. This patch introduces
cgroup_e_css() and for_each_e_css() to access the effective csses and
convert compare_css_sets(), find_existing_css_set() and
cgroup_migrate() to use the effective csses so that they can handle
cgroups with partial csses correctly.
This means that for two css_sets to be considered identical, they
should have both matching csses and cgroups. compare_css_sets()
already compares both, not for correctness but for optimization. As
this now becomes a matter of correctness, update the comments
accordingly.
For all !default hierarchies, cgroup_e_css() always equals
cgroup_css(), so this patch doesn't change behavior.
While at it, fix incorrect locking comment for for_each_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:14 +00:00
|
|
|
/*
|
|
|
|
* @ss is not in this hierarchy, so we don't want
|
|
|
|
* to change the css.
|
|
|
|
*/
|
2013-06-13 04:04:49 +00:00
|
|
|
template[i] = old_cset->subsys[i];
|
2007-10-19 06:39:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-01-10 03:49:27 +00:00
|
|
|
key = css_set_hash(template);
|
2013-06-13 04:04:49 +00:00
|
|
|
hash_for_each_possible(css_set_table, cset, hlist, key) {
|
|
|
|
if (!compare_css_sets(cset, old_cset, cgrp, template))
|
2009-09-23 22:56:22 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/* This css_set matches what we need */
|
2013-06-13 04:04:49 +00:00
|
|
|
return cset;
|
2008-04-29 08:00:11 +00:00
|
|
|
}
|
2007-10-19 06:39:36 +00:00
|
|
|
|
|
|
|
/* No existing cgroup group matched */
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
static void free_cgrp_cset_links(struct list_head *links_to_free)
|
2008-07-30 05:33:19 +00:00
|
|
|
{
|
2013-06-13 04:04:50 +00:00
|
|
|
struct cgrp_cset_link *link, *tmp_link;
|
2008-07-30 05:33:19 +00:00
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
list_for_each_entry_safe(link, tmp_link, links_to_free, cset_link) {
|
|
|
|
list_del(&link->cset_link);
|
2008-07-30 05:33:19 +00:00
|
|
|
kfree(link);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
/**
|
|
|
|
* allocate_cgrp_cset_links - allocate cgrp_cset_links
|
|
|
|
* @count: the number of links to allocate
|
|
|
|
* @tmp_links: list_head the allocated links are put on
|
|
|
|
*
|
|
|
|
* Allocate @count cgrp_cset_link structures and chain them on @tmp_links
|
|
|
|
* through ->cset_link. Returns 0 on success or -errno.
|
2007-10-19 06:39:36 +00:00
|
|
|
*/
|
2013-06-13 04:04:50 +00:00
|
|
|
static int allocate_cgrp_cset_links(int count, struct list_head *tmp_links)
|
2007-10-19 06:39:36 +00:00
|
|
|
{
|
2013-06-13 04:04:50 +00:00
|
|
|
struct cgrp_cset_link *link;
|
2007-10-19 06:39:36 +00:00
|
|
|
int i;
|
2013-06-13 04:04:50 +00:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(tmp_links);
|
|
|
|
|
2007-10-19 06:39:36 +00:00
|
|
|
for (i = 0; i < count; i++) {
|
2013-06-13 04:04:51 +00:00
|
|
|
link = kzalloc(sizeof(*link), GFP_KERNEL);
|
2007-10-19 06:39:36 +00:00
|
|
|
if (!link) {
|
2013-06-13 04:04:50 +00:00
|
|
|
free_cgrp_cset_links(tmp_links);
|
2007-10-19 06:39:36 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2013-06-13 04:04:50 +00:00
|
|
|
list_add(&link->cset_link, tmp_links);
|
2007-10-19 06:39:36 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-01-08 02:07:42 +00:00
|
|
|
/**
|
|
|
|
* link_css_set - a helper function to link a css_set to a cgroup
|
2013-06-13 04:04:50 +00:00
|
|
|
* @tmp_links: cgrp_cset_link objects allocated by allocate_cgrp_cset_links()
|
2013-06-13 04:04:49 +00:00
|
|
|
* @cset: the css_set to be linked
|
2009-01-08 02:07:42 +00:00
|
|
|
* @cgrp: the destination cgroup
|
|
|
|
*/
|
2013-06-13 04:04:50 +00:00
|
|
|
static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
|
|
|
|
struct cgroup *cgrp)
|
2009-01-08 02:07:42 +00:00
|
|
|
{
|
2013-06-13 04:04:50 +00:00
|
|
|
struct cgrp_cset_link *link;
|
2009-01-08 02:07:42 +00:00
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
BUG_ON(list_empty(tmp_links));
|
2014-04-23 15:13:16 +00:00
|
|
|
|
|
|
|
if (cgroup_on_dfl(cgrp))
|
|
|
|
cset->dfl_cgrp = cgrp;
|
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
|
|
|
|
link->cset = cset;
|
2009-09-23 22:56:22 +00:00
|
|
|
link->cgrp = cgrp;
|
2014-04-25 22:28:02 +00:00
|
|
|
|
2009-09-23 22:56:22 +00:00
|
|
|
/*
|
2015-10-15 20:41:51 +00:00
|
|
|
* Always add links to the tail of the lists so that the lists are
|
2020-11-09 10:31:11 +00:00
|
|
|
* in chronological order.
|
2009-09-23 22:56:22 +00:00
|
|
|
*/
|
2015-10-15 20:41:51 +00:00
|
|
|
list_move_tail(&link->cset_link, &cgrp->cset_links);
|
2013-06-13 04:04:50 +00:00
|
|
|
list_add_tail(&link->cgrp_link, &cset->cgrp_links);
|
2015-10-15 20:41:51 +00:00
|
|
|
|
|
|
|
if (cgroup_parent(cgrp))
|
2017-04-28 19:14:55 +00:00
|
|
|
cgroup_get_live(cgrp);
|
2009-01-08 02:07:42 +00:00
|
|
|
}
|
|
|
|
|
2013-06-24 22:21:48 +00:00
|
|
|
/**
|
|
|
|
* find_css_set - return a new css_set with one cgroup updated
|
|
|
|
* @old_cset: the baseline css_set
|
|
|
|
* @cgrp: the cgroup to be updated
|
|
|
|
*
|
|
|
|
* Return a new css_set that's equivalent to @old_cset, but with @cgrp
|
|
|
|
* substituted into the appropriate hierarchy.
|
2007-10-19 06:39:36 +00:00
|
|
|
*/
|
2013-06-13 04:04:49 +00:00
|
|
|
static struct css_set *find_css_set(struct css_set *old_cset,
|
|
|
|
struct cgroup *cgrp)
|
2007-10-19 06:39:36 +00:00
|
|
|
{
|
2013-06-24 22:21:48 +00:00
|
|
|
struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT] = { };
|
2013-06-13 04:04:49 +00:00
|
|
|
struct css_set *cset;
|
2013-06-13 04:04:50 +00:00
|
|
|
struct list_head tmp_links;
|
|
|
|
struct cgrp_cset_link *link;
|
2014-04-23 15:13:15 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2013-01-10 03:49:27 +00:00
|
|
|
unsigned long key;
|
2014-04-23 15:13:15 +00:00
|
|
|
int ssid;
|
2008-04-29 08:00:11 +00:00
|
|
|
|
2013-06-24 22:21:48 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2007-10-19 06:39:36 +00:00
|
|
|
/* First see if we already have a cgroup group that matches
|
|
|
|
* the desired set */
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2013-06-13 04:04:49 +00:00
|
|
|
cset = find_existing_css_set(old_cset, cgrp, template);
|
|
|
|
if (cset)
|
|
|
|
get_css_set(cset);
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2013-06-13 04:04:49 +00:00
|
|
|
if (cset)
|
|
|
|
return cset;
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2013-06-13 04:04:51 +00:00
|
|
|
cset = kzalloc(sizeof(*cset), GFP_KERNEL);
|
2013-06-13 04:04:49 +00:00
|
|
|
if (!cset)
|
2007-10-19 06:39:36 +00:00
|
|
|
return NULL;
|
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
/* Allocate all the cgrp_cset_link objects that we'll need */
|
2013-06-24 22:21:47 +00:00
|
|
|
if (allocate_cgrp_cset_links(cgroup_root_count, &tmp_links) < 0) {
|
2013-06-13 04:04:49 +00:00
|
|
|
kfree(cset);
|
2007-10-19 06:39:36 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2017-03-08 08:00:40 +00:00
|
|
|
refcount_set(&cset->refcount, 1);
|
2017-05-15 13:34:02 +00:00
|
|
|
cset->dom_cset = cset;
|
2013-06-13 04:04:49 +00:00
|
|
|
INIT_LIST_HEAD(&cset->tasks);
|
cgroup: add css_set->mg_tasks
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, we're going to use task->cg_list during
migration too. Instead of building a separate array, target tasks
will be linked into a dedicated migration list_head on the owning
css_set. Tasks on the migration list are treated the same as tasks on
the usual tasks list; however, being on a separate list allows cgroup
migration code path to keep track of the target tasks by simply
keeping the list of css_sets with tasks being migrated, making
unpredictable dynamic allocation unnecessary.
In prepartion of such migration path update, this patch introduces
css_set->mg_tasks list and updates css_set task iterations so that
they walk both css_set->tasks and ->mg_tasks. Note that ->mg_tasks
isn't used yet.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
INIT_LIST_HEAD(&cset->mg_tasks);
|
2019-05-31 17:38:58 +00:00
|
|
|
INIT_LIST_HEAD(&cset->dying_tasks);
|
2015-10-15 20:41:52 +00:00
|
|
|
INIT_LIST_HEAD(&cset->task_iters);
|
2017-05-15 13:34:02 +00:00
|
|
|
INIT_LIST_HEAD(&cset->threaded_csets);
|
2013-06-13 04:04:49 +00:00
|
|
|
INIT_HLIST_NODE(&cset->hlist);
|
2016-12-27 19:49:05 +00:00
|
|
|
INIT_LIST_HEAD(&cset->cgrp_links);
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
INIT_LIST_HEAD(&cset->mg_src_preload_node);
|
|
|
|
INIT_LIST_HEAD(&cset->mg_dst_preload_node);
|
2016-12-27 19:49:05 +00:00
|
|
|
INIT_LIST_HEAD(&cset->mg_node);
|
2007-10-19 06:39:36 +00:00
|
|
|
|
|
|
|
/* Copy the set of subsystem state objects generated in
|
|
|
|
* find_existing_css_set() */
|
2013-06-13 04:04:49 +00:00
|
|
|
memcpy(cset->subsys, template, sizeof(cset->subsys));
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2007-10-19 06:39:36 +00:00
|
|
|
/* Add reference counts and links from the new css_set. */
|
2013-06-13 04:04:50 +00:00
|
|
|
list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) {
|
2009-09-23 22:56:22 +00:00
|
|
|
struct cgroup *c = link->cgrp;
|
2013-06-13 04:04:50 +00:00
|
|
|
|
2009-09-23 22:56:22 +00:00
|
|
|
if (c->root == cgrp->root)
|
|
|
|
c = cgrp;
|
2013-06-13 04:04:50 +00:00
|
|
|
link_css_set(&tmp_links, cset, c);
|
2009-09-23 22:56:22 +00:00
|
|
|
}
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2013-06-13 04:04:50 +00:00
|
|
|
BUG_ON(!list_empty(&tmp_links));
|
2007-10-19 06:39:36 +00:00
|
|
|
|
|
|
|
css_set_count++;
|
2008-04-29 08:00:11 +00:00
|
|
|
|
2014-04-23 15:13:15 +00:00
|
|
|
/* Add @cset to the hash table */
|
2013-06-13 04:04:49 +00:00
|
|
|
key = css_set_hash(cset->subsys);
|
|
|
|
hash_add(css_set_table, &cset->hlist, key);
|
2008-04-29 08:00:11 +00:00
|
|
|
|
2015-11-23 19:55:41 +00:00
|
|
|
for_each_subsys(ss, ssid) {
|
|
|
|
struct cgroup_subsys_state *css = cset->subsys[ssid];
|
|
|
|
|
2014-04-23 15:13:15 +00:00
|
|
|
list_add_tail(&cset->e_cset_node[ssid],
|
2015-11-23 19:55:41 +00:00
|
|
|
&css->cgroup->e_csets[ssid]);
|
|
|
|
css_get(css);
|
|
|
|
}
|
2014-04-23 15:13:15 +00:00
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2017-05-15 13:34:02 +00:00
|
|
|
/*
|
|
|
|
* If @cset should be threaded, look up the matching dom_cset and
|
|
|
|
* link them up. We first fully initialize @cset then look for the
|
|
|
|
* dom_cset. It's simpler this way and safe as @cset is guaranteed
|
|
|
|
* to stay empty until we return.
|
|
|
|
*/
|
|
|
|
if (cgroup_is_threaded(cset->dfl_cgrp)) {
|
|
|
|
struct css_set *dcset;
|
|
|
|
|
|
|
|
dcset = find_css_set(cset, cset->dfl_cgrp->dom_cgrp);
|
|
|
|
if (!dcset) {
|
|
|
|
put_css_set(cset);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
cset->dom_cset = dcset;
|
|
|
|
list_add_tail(&cset->threaded_csets_node,
|
|
|
|
&dcset->threaded_csets);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
}
|
|
|
|
|
2013-06-13 04:04:49 +00:00
|
|
|
return cset;
|
2007-10-19 06:39:33 +00:00
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:06 +00:00
|
|
|
struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root)
|
2009-09-23 22:56:22 +00:00
|
|
|
{
|
2022-02-22 07:07:13 +00:00
|
|
|
struct cgroup *root_cgrp = kernfs_root_to_node(kf_root)->priv;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
return root_cgrp->root;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
}
|
|
|
|
|
2022-07-23 14:28:28 +00:00
|
|
|
void cgroup_favor_dynmods(struct cgroup_root *root, bool favor)
|
|
|
|
{
|
|
|
|
bool favoring = root->flags & CGRP_ROOT_FAVOR_DYNMODS;
|
|
|
|
|
|
|
|
/* see the comment above CGRP_ROOT_FAVOR_DYNMODS definition */
|
|
|
|
if (favor && !favoring) {
|
|
|
|
rcu_sync_enter(&cgroup_threadgroup_rwsem.rss);
|
|
|
|
root->flags |= CGRP_ROOT_FAVOR_DYNMODS;
|
|
|
|
} else if (!favor && favoring) {
|
|
|
|
rcu_sync_exit(&cgroup_threadgroup_rwsem.rss);
|
|
|
|
root->flags &= ~CGRP_ROOT_FAVOR_DYNMODS;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
static int cgroup_init_root_id(struct cgroup_root *root)
|
2014-02-11 16:52:49 +00:00
|
|
|
{
|
|
|
|
int id;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2014-03-19 14:23:53 +00:00
|
|
|
id = idr_alloc_cyclic(&cgroup_hierarchy_idr, root, 0, 0, GFP_KERNEL);
|
2014-02-11 16:52:49 +00:00
|
|
|
if (id < 0)
|
|
|
|
return id;
|
|
|
|
|
|
|
|
root->hierarchy_id = id;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
static void cgroup_exit_root_id(struct cgroup_root *root)
|
2014-02-11 16:52:49 +00:00
|
|
|
{
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2016-06-17 16:23:59 +00:00
|
|
|
idr_remove(&cgroup_hierarchy_idr, root->hierarchy_id);
|
2014-02-11 16:52:49 +00:00
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:08 +00:00
|
|
|
void cgroup_free_root(struct cgroup_root *root)
|
2014-02-11 16:52:49 +00:00
|
|
|
{
|
2023-10-29 06:14:29 +00:00
|
|
|
kfree_rcu(root, rcu);
|
2014-02-11 16:52:49 +00:00
|
|
|
}
|
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
static void cgroup_destroy_root(struct cgroup_root *root)
|
2014-02-11 16:52:49 +00:00
|
|
|
{
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup *cgrp = &root->cgrp;
|
2014-02-11 16:52:49 +00:00
|
|
|
struct cgrp_cset_link *link, *tmp_link;
|
|
|
|
|
2016-08-10 15:23:44 +00:00
|
|
|
trace_cgroup_destroy_root(root);
|
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
|
2014-02-11 16:52:49 +00:00
|
|
|
|
2014-02-12 14:29:50 +00:00
|
|
|
BUG_ON(atomic_read(&root->nr_cgrps));
|
2014-05-16 17:22:48 +00:00
|
|
|
BUG_ON(!list_empty(&cgrp->self.children));
|
2014-02-11 16:52:49 +00:00
|
|
|
|
|
|
|
/* Rebind all subsystems back to the default hierarchy */
|
2016-03-03 14:58:01 +00:00
|
|
|
WARN_ON(rebind_subsystems(&cgrp_dfl_root, root->subsys_mask));
|
2009-09-23 22:56:22 +00:00
|
|
|
|
|
|
|
/*
|
2014-02-11 16:52:49 +00:00
|
|
|
* Release all the links from cset_links to this hierarchy's
|
|
|
|
* root cgroup
|
2009-09-23 22:56:22 +00:00
|
|
|
*/
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2014-02-11 16:52:49 +00:00
|
|
|
|
|
|
|
list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) {
|
|
|
|
list_del(&link->cset_link);
|
|
|
|
list_del(&link->cgrp_link);
|
|
|
|
kfree(link);
|
|
|
|
}
|
2015-10-15 20:41:53 +00:00
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2014-02-11 16:52:49 +00:00
|
|
|
|
2023-10-29 06:14:28 +00:00
|
|
|
WARN_ON_ONCE(list_empty(&root->root_list));
|
2023-10-29 06:14:29 +00:00
|
|
|
list_del_rcu(&root->root_list);
|
2023-10-29 06:14:28 +00:00
|
|
|
cgroup_root_count--;
|
2014-02-11 16:52:49 +00:00
|
|
|
|
2023-09-27 14:25:40 +00:00
|
|
|
if (!have_favordynmods)
|
|
|
|
cgroup_favor_dynmods(root, false);
|
|
|
|
|
2014-02-11 16:52:49 +00:00
|
|
|
cgroup_exit_root_id(root);
|
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2014-02-11 16:52:49 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
cgroup_rstat_exit(cgrp);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
kernfs_destroy_root(root->kf_root);
|
2014-02-11 16:52:49 +00:00
|
|
|
cgroup_free_root(root);
|
|
|
|
}
|
|
|
|
|
2022-10-10 08:29:18 +00:00
|
|
|
/*
|
|
|
|
* Returned cgroup is without refcount but it's valid as long as cset pins it.
|
|
|
|
*/
|
2022-06-16 10:38:30 +00:00
|
|
|
static inline struct cgroup *__cset_cgroup_from_root(struct css_set *cset,
|
|
|
|
struct cgroup_root *root)
|
|
|
|
{
|
|
|
|
struct cgroup *res_cgroup = NULL;
|
|
|
|
|
|
|
|
if (cset == &init_css_set) {
|
|
|
|
res_cgroup = &root->cgrp;
|
|
|
|
} else if (root == &cgrp_dfl_root) {
|
|
|
|
res_cgroup = cset->dfl_cgrp;
|
|
|
|
} else {
|
|
|
|
struct cgrp_cset_link *link;
|
2022-10-10 08:29:18 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2022-06-16 10:38:30 +00:00
|
|
|
|
|
|
|
list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
|
|
|
|
struct cgroup *c = link->cgrp;
|
|
|
|
|
|
|
|
if (c->root == root) {
|
|
|
|
res_cgroup = c;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-10-29 06:14:29 +00:00
|
|
|
/*
|
|
|
|
* If cgroup_mutex is not held, the cgrp_cset_link will be freed
|
|
|
|
* before we remove the cgroup root from the root_list. Consequently,
|
|
|
|
* when accessing a cgroup root, the cset_link may have already been
|
|
|
|
* freed, resulting in a NULL res_cgroup. However, by holding the
|
|
|
|
* cgroup_mutex, we ensure that res_cgroup can't be NULL.
|
|
|
|
* If we don't hold cgroup_mutex in the caller, we must do the NULL
|
|
|
|
* check.
|
|
|
|
*/
|
2022-06-16 10:38:30 +00:00
|
|
|
return res_cgroup;
|
|
|
|
}
|
|
|
|
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
/*
|
|
|
|
* look up cgroup associated with current task's cgroup namespace on the
|
|
|
|
* specified hierarchy
|
|
|
|
*/
|
|
|
|
static struct cgroup *
|
|
|
|
current_cgns_cgroup_from_root(struct cgroup_root *root)
|
|
|
|
{
|
|
|
|
struct cgroup *res = NULL;
|
|
|
|
struct css_set *cset;
|
|
|
|
|
|
|
|
lockdep_assert_held(&css_set_lock);
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
cset = current->nsproxy->cgroup_ns->root_cset;
|
2022-06-16 10:38:30 +00:00
|
|
|
res = __cset_cgroup_from_root(cset, root);
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2023-10-29 06:14:31 +00:00
|
|
|
/*
|
|
|
|
* The namespace_sem is held by current, so the root cgroup can't
|
|
|
|
* be umounted. Therefore, we can ensure that the res is non-NULL.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(!res);
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
2022-10-10 08:29:18 +00:00
|
|
|
/*
|
|
|
|
* Look up cgroup associated with current task's cgroup namespace on the default
|
|
|
|
* hierarchy.
|
|
|
|
*
|
|
|
|
* Unlike current_cgns_cgroup_from_root(), this doesn't need locks:
|
|
|
|
* - Internal rcu_read_lock is unnecessary because we don't dereference any rcu
|
|
|
|
* pointers.
|
|
|
|
* - css_set_lock is not needed because we just read cset->dfl_cgrp.
|
|
|
|
* - As a bonus returned cgrp is pinned with the current because it cannot
|
|
|
|
* switch cgroup_ns asynchronously.
|
|
|
|
*/
|
|
|
|
static struct cgroup *current_cgns_cgroup_dfl(void)
|
|
|
|
{
|
|
|
|
struct css_set *cset;
|
|
|
|
|
2023-03-14 21:59:49 +00:00
|
|
|
if (current->nsproxy) {
|
|
|
|
cset = current->nsproxy->cgroup_ns->root_cset;
|
|
|
|
return __cset_cgroup_from_root(cset, &cgrp_dfl_root);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* NOTE: This function may be called from bpf_cgroup_from_id()
|
|
|
|
* on a task which has already passed exit_task_namespaces() and
|
|
|
|
* nsproxy == NULL. Fall back to cgrp_dfl_root which will make all
|
|
|
|
* cgroups visible for lookups.
|
|
|
|
*/
|
|
|
|
return &cgrp_dfl_root.cgrp;
|
|
|
|
}
|
2022-10-10 08:29:18 +00:00
|
|
|
}
|
|
|
|
|
2014-02-25 15:04:02 +00:00
|
|
|
/* look up cgroup associated with given css_set on the specified hierarchy */
|
|
|
|
static struct cgroup *cset_cgroup_from_root(struct css_set *cset,
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup_root *root)
|
2009-09-23 22:56:22 +00:00
|
|
|
{
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem
Currently there are two ways to walk tasks of a cgroup -
css_task_iter_start/next/end() and css_scan_tasks(). The latter
builds on the former but allows blocking while iterating.
Unfortunately, the way css_scan_tasks() is implemented is rather
nasty, it uses a priority heap of pointers to extract some number of
tasks in task creation order and loops over them invoking the callback
and repeats that until it reaches the end. It requires either
preallocated heap or may fail under memory pressure, while unlikely to
be problematic, the complexity is O(N^2), and in general just nasty.
We're gonna convert all css_scan_users() to
css_task_iter_start/next/end() and remove css_scan_users(). As
css_scan_tasks() users may block, let's convert css_set_lock to a
rwsem so that tasks can block during css_task_iter_*() is in progress.
While this does increase the chance of possible deadlock scenarios,
given the current usage, the probability is relatively low, and even
if that happens, the right thing to do is updating the iteration in
the similar way to css iterators so that it can handle blocking.
Most conversions are trivial; however, task_cgroup_path() now expects
to be called with css_set_rwsem locked instead of locking itself.
This is because the function is called with RCU read lock held and
rwsem locking should nest outside RCU read lock.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-13 11:58:40 +00:00
|
|
|
|
2022-10-10 08:29:18 +00:00
|
|
|
return __cset_cgroup_from_root(cset, root);
|
2009-09-23 22:56:22 +00:00
|
|
|
}
|
|
|
|
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
/*
|
2014-02-25 15:04:02 +00:00
|
|
|
* Return the cgroup for "task" from the given hierarchy. Must be
|
2023-10-29 06:14:29 +00:00
|
|
|
* called with css_set_lock held to prevent task's groups from being modified.
|
|
|
|
* Must be called with either cgroup_mutex or rcu read lock to prevent the
|
|
|
|
* cgroup root from being destroyed.
|
2014-02-25 15:04:02 +00:00
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
struct cgroup *task_cgroup_from_root(struct task_struct *task,
|
|
|
|
struct cgroup_root *root)
|
2014-02-25 15:04:02 +00:00
|
|
|
{
|
|
|
|
/*
|
2019-10-04 10:57:39 +00:00
|
|
|
* No need to lock the task - since we hold css_set_lock the
|
|
|
|
* task can't change groups.
|
2014-02-25 15:04:02 +00:00
|
|
|
*/
|
|
|
|
return cset_cgroup_from_root(task_css_set(task), root);
|
|
|
|
}
|
|
|
|
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
/*
|
|
|
|
* A task must hold cgroup_mutex to modify cgroups.
|
|
|
|
*
|
|
|
|
* Any task can increment and decrement the count field without lock.
|
|
|
|
* So in general, code holding cgroup_mutex can't rely on the count
|
|
|
|
* field not changing. However, if the count goes to zero, then only
|
2008-02-07 08:14:43 +00:00
|
|
|
* cgroup_attach_task() can increment it again. Because a count of zero
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
* means that no tasks are currently attached, therefore there is no
|
|
|
|
* way a task attached to that cgroup can fork (the other way to
|
|
|
|
* increment the count). So code holding cgroup_mutex can safely
|
|
|
|
* assume that if the count is zero, it will stay zero. Similarly, if
|
|
|
|
* a task holds cgroup_mutex on a cgroup with zero count, it
|
|
|
|
* knows that the cgroup won't be removed, as cgroup_rmdir()
|
|
|
|
* needs that mutex.
|
|
|
|
*
|
|
|
|
* A cgroup can only be deleted if both its 'count' of using tasks
|
|
|
|
* is zero, and its list of 'children' cgroups is empty. Since all
|
|
|
|
* tasks in the system use _some_ cgroup, and since there is always at
|
2014-03-19 14:23:54 +00:00
|
|
|
* least one task in the system (init, pid == 1), therefore, root cgroup
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
* always has either children cgroups and/or using tasks. So we don't
|
2014-03-19 14:23:54 +00:00
|
|
|
* need a special hack to ensure that root cgroup cannot be deleted.
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
*
|
|
|
|
* P.S. One more locking exception. RCU is used to guard the
|
2008-02-07 08:14:43 +00:00
|
|
|
* update of a tasks cgroup pointer by cgroup_attach_task()
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
*/
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static struct kernfs_syscall_ops cgroup_kf_syscall_ops;
|
2007-10-19 06:39:35 +00:00
|
|
|
|
2019-06-10 09:35:41 +00:00
|
|
|
static char *cgroup_file_name(struct cgroup *cgrp, const struct cftype *cft,
|
|
|
|
char *buf)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2015-08-18 20:58:16 +00:00
|
|
|
struct cgroup_subsys *ss = cft->ss;
|
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
if (cft->ss && !(cft->flags & CFTYPE_NO_PREFIX) &&
|
2018-11-13 20:06:41 +00:00
|
|
|
!(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
|
|
|
|
const char *dbg = (cft->flags & CFTYPE_DEBUG) ? ".__DEBUG__." : "";
|
|
|
|
|
|
|
|
snprintf(buf, CGROUP_FILE_NAME_MAX, "%s%s.%s",
|
|
|
|
dbg, cgroup_on_dfl(cgrp) ? ss->name : ss->legacy_name,
|
2019-06-10 09:35:41 +00:00
|
|
|
cft->name);
|
2018-11-13 20:06:41 +00:00
|
|
|
} else {
|
2019-06-10 09:35:41 +00:00
|
|
|
strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
|
2018-11-13 20:06:41 +00:00
|
|
|
}
|
2014-02-11 16:52:48 +00:00
|
|
|
return buf;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2014-02-11 16:52:49 +00:00
|
|
|
/**
|
|
|
|
* cgroup_file_mode - deduce file mode of a control file
|
|
|
|
* @cft: the control file in question
|
|
|
|
*
|
2015-09-18 21:54:23 +00:00
|
|
|
* S_IRUGO for read, S_IWUSR for write.
|
2014-02-11 16:52:49 +00:00
|
|
|
*/
|
|
|
|
static umode_t cgroup_file_mode(const struct cftype *cft)
|
2013-03-01 07:01:56 +00:00
|
|
|
{
|
2014-02-11 16:52:49 +00:00
|
|
|
umode_t mode = 0;
|
2013-03-01 07:01:56 +00:00
|
|
|
|
2014-02-11 16:52:49 +00:00
|
|
|
if (cft->read_u64 || cft->read_s64 || cft->seq_show)
|
|
|
|
mode |= S_IRUGO;
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
if (cft->write_u64 || cft->write_s64 || cft->write) {
|
|
|
|
if (cft->flags & CFTYPE_WORLD_WRITABLE)
|
|
|
|
mode |= S_IWUGO;
|
|
|
|
else
|
|
|
|
mode |= S_IWUSR;
|
|
|
|
}
|
2014-02-11 16:52:49 +00:00
|
|
|
|
|
|
|
return mode;
|
2013-03-01 07:01:56 +00:00
|
|
|
}
|
|
|
|
|
2014-07-08 22:02:57 +00:00
|
|
|
/**
|
2016-02-23 03:25:46 +00:00
|
|
|
* cgroup_calc_subtree_ss_mask - calculate subtree_ss_mask
|
2014-11-18 07:49:50 +00:00
|
|
|
* @subtree_control: the new subtree_control mask to consider
|
2016-03-03 14:58:01 +00:00
|
|
|
* @this_ss_mask: available subsystems
|
2014-07-08 22:02:57 +00:00
|
|
|
*
|
|
|
|
* On the default hierarchy, a subsystem may request other subsystems to be
|
|
|
|
* enabled together through its ->depends_on mask. In such cases, more
|
|
|
|
* subsystems than specified in "cgroup.subtree_control" may be enabled.
|
|
|
|
*
|
2014-11-18 07:49:50 +00:00
|
|
|
* This function calculates which subsystems need to be enabled if
|
2016-03-03 14:58:01 +00:00
|
|
|
* @subtree_control is to be applied while restricted to @this_ss_mask.
|
2014-07-08 22:02:57 +00:00
|
|
|
*/
|
2016-03-03 14:58:01 +00:00
|
|
|
static u16 cgroup_calc_subtree_ss_mask(u16 subtree_control, u16 this_ss_mask)
|
2014-07-08 22:02:56 +00:00
|
|
|
{
|
2016-02-23 03:25:47 +00:00
|
|
|
u16 cur_ss_mask = subtree_control;
|
2014-07-08 22:02:57 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2016-03-08 16:51:26 +00:00
|
|
|
cur_ss_mask |= cgrp_dfl_implicit_ss_mask;
|
|
|
|
|
2014-07-08 22:02:57 +00:00
|
|
|
while (true) {
|
2016-02-23 03:25:47 +00:00
|
|
|
u16 new_ss_mask = cur_ss_mask;
|
2014-07-08 22:02:57 +00:00
|
|
|
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, ssid, cur_ss_mask) {
|
2015-06-06 00:02:15 +00:00
|
|
|
new_ss_mask |= ss->depends_on;
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
2014-07-08 22:02:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Mask out subsystems which aren't available. This can
|
|
|
|
* happen only if some depended-upon subsystems were bound
|
|
|
|
* to non-default hierarchies.
|
|
|
|
*/
|
2016-03-03 14:58:01 +00:00
|
|
|
new_ss_mask &= this_ss_mask;
|
2014-07-08 22:02:57 +00:00
|
|
|
|
|
|
|
if (new_ss_mask == cur_ss_mask)
|
|
|
|
break;
|
|
|
|
cur_ss_mask = new_ss_mask;
|
|
|
|
}
|
|
|
|
|
2014-11-18 07:49:50 +00:00
|
|
|
return cur_ss_mask;
|
|
|
|
}
|
|
|
|
|
2014-05-13 16:19:22 +00:00
|
|
|
/**
|
|
|
|
* cgroup_kn_unlock - unlocking helper for cgroup kernfs methods
|
|
|
|
* @kn: the kernfs_node being serviced
|
|
|
|
*
|
|
|
|
* This helper undoes cgroup_kn_lock_live() and should be invoked before
|
|
|
|
* the method finishes if locking succeeded. Note that once this function
|
|
|
|
* returns the cgroup returned by cgroup_kn_lock_live() may become
|
|
|
|
* inaccessible any time. If the caller intends to continue to access the
|
|
|
|
* cgroup, it should pin it before invoking this function.
|
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
void cgroup_kn_unlock(struct kernfs_node *kn)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2014-05-13 16:19:22 +00:00
|
|
|
struct cgroup *cgrp;
|
|
|
|
|
|
|
|
if (kernfs_type(kn) == KERNFS_DIR)
|
|
|
|
cgrp = kn->priv;
|
|
|
|
else
|
|
|
|
cgrp = kn->parent->priv;
|
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2014-05-13 16:19:22 +00:00
|
|
|
|
|
|
|
kernfs_unbreak_active_protection(kn);
|
|
|
|
cgroup_put(cgrp);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2014-05-13 16:19:22 +00:00
|
|
|
/**
|
|
|
|
* cgroup_kn_lock_live - locking helper for cgroup kernfs methods
|
|
|
|
* @kn: the kernfs_node being serviced
|
2016-03-03 14:58:00 +00:00
|
|
|
* @drain_offline: perform offline draining on the cgroup
|
2014-05-13 16:19:22 +00:00
|
|
|
*
|
|
|
|
* This helper is to be used by a cgroup kernfs method currently servicing
|
|
|
|
* @kn. It breaks the active protection, performs cgroup locking and
|
|
|
|
* verifies that the associated cgroup is alive. Returns the cgroup if
|
|
|
|
* alive; otherwise, %NULL. A successful return should be undone by a
|
2016-03-03 14:58:00 +00:00
|
|
|
* matching cgroup_kn_unlock() invocation. If @drain_offline is %true, the
|
|
|
|
* cgroup is drained of offlining csses before return.
|
2014-05-13 16:19:22 +00:00
|
|
|
*
|
|
|
|
* Any cgroup kernfs method implementation which requires locking the
|
|
|
|
* associated cgroup should use this helper. It avoids nesting cgroup
|
|
|
|
* locking under kernfs active protection and allows all kernfs operations
|
|
|
|
* including self-removal.
|
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline)
|
2012-04-01 19:09:56 +00:00
|
|
|
{
|
2014-05-13 16:19:22 +00:00
|
|
|
struct cgroup *cgrp;
|
|
|
|
|
|
|
|
if (kernfs_type(kn) == KERNFS_DIR)
|
|
|
|
cgrp = kn->priv;
|
|
|
|
else
|
|
|
|
cgrp = kn->parent->priv;
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2013-01-21 10:18:33 +00:00
|
|
|
/*
|
2014-05-13 16:19:23 +00:00
|
|
|
* We're gonna grab cgroup_mutex which nests outside kernfs
|
2014-05-13 16:19:22 +00:00
|
|
|
* active_ref. cgroup liveliness check alone provides enough
|
|
|
|
* protection against removal. Ensure @cgrp stays accessible and
|
|
|
|
* break the active_ref protection.
|
2013-01-21 10:18:33 +00:00
|
|
|
*/
|
2014-09-04 06:43:38 +00:00
|
|
|
if (!cgroup_tryget(cgrp))
|
|
|
|
return NULL;
|
2014-05-13 16:19:22 +00:00
|
|
|
kernfs_break_active_protection(kn);
|
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
if (drain_offline)
|
|
|
|
cgroup_lock_and_drain_offline(cgrp);
|
|
|
|
else
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2014-05-13 16:19:22 +00:00
|
|
|
if (!cgroup_is_dead(cgrp))
|
|
|
|
return cgrp;
|
|
|
|
|
|
|
|
cgroup_kn_unlock(kn);
|
|
|
|
return NULL;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2013-01-21 10:18:33 +00:00
|
|
|
static void cgroup_rm_file(struct cgroup *cgrp, const struct cftype *cft)
|
2012-04-01 19:09:56 +00:00
|
|
|
{
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
char name[CGROUP_FILE_NAME_MAX];
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2015-11-05 05:12:24 +00:00
|
|
|
|
|
|
|
if (cft->file_offset) {
|
|
|
|
struct cgroup_subsys_state *css = cgroup_css(cgrp, cft->ss);
|
|
|
|
struct cgroup_file *cfile = (void *)css + cft->file_offset;
|
|
|
|
|
|
|
|
spin_lock_irq(&cgroup_file_kn_lock);
|
|
|
|
cfile->kn = NULL;
|
|
|
|
spin_unlock_irq(&cgroup_file_kn_lock);
|
2018-04-26 21:29:04 +00:00
|
|
|
|
|
|
|
del_timer_sync(&cfile->notify_timer);
|
2015-11-05 05:12:24 +00:00
|
|
|
}
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
kernfs_remove_by_name(cgrp->kn, cgroup_file_name(cgrp, cft, name));
|
2012-04-01 19:09:56 +00:00
|
|
|
}
|
|
|
|
|
2012-08-23 20:53:29 +00:00
|
|
|
/**
|
2015-09-18 21:54:23 +00:00
|
|
|
* css_clear_dir - remove subsys files in a cgroup directory
|
2021-05-24 08:29:43 +00:00
|
|
|
* @css: target css
|
2012-08-23 20:53:29 +00:00
|
|
|
*/
|
2016-03-03 14:58:01 +00:00
|
|
|
static void css_clear_dir(struct cgroup_subsys_state *css)
|
2012-04-01 19:09:56 +00:00
|
|
|
{
|
2016-03-03 14:58:01 +00:00
|
|
|
struct cgroup *cgrp = css->cgroup;
|
2015-09-18 21:54:23 +00:00
|
|
|
struct cftype *cfts;
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
if (!(css->flags & CSS_VISIBLE))
|
|
|
|
return;
|
|
|
|
|
|
|
|
css->flags &= ~CSS_VISIBLE;
|
|
|
|
|
2018-04-26 21:29:04 +00:00
|
|
|
if (!css->ss) {
|
2022-09-06 19:38:55 +00:00
|
|
|
if (cgroup_on_dfl(cgrp)) {
|
|
|
|
cgroup_addrm_files(css, cgrp,
|
|
|
|
cgroup_base_files, false);
|
|
|
|
if (cgroup_psi_enabled())
|
|
|
|
cgroup_addrm_files(css, cgrp,
|
|
|
|
cgroup_psi_files, false);
|
|
|
|
} else {
|
|
|
|
cgroup_addrm_files(css, cgrp,
|
|
|
|
cgroup1_base_files, false);
|
|
|
|
}
|
2018-04-26 21:29:04 +00:00
|
|
|
} else {
|
|
|
|
list_for_each_entry(cfts, &css->ss->cfts, node)
|
|
|
|
cgroup_addrm_files(css, cgrp, cfts, false);
|
|
|
|
}
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
/**
|
2015-09-18 21:54:23 +00:00
|
|
|
* css_populate_dir - create subsys files in a cgroup directory
|
|
|
|
* @css: target css
|
2015-09-18 21:54:23 +00:00
|
|
|
*
|
|
|
|
* On failure, no file is added.
|
|
|
|
*/
|
2016-03-03 14:58:01 +00:00
|
|
|
static int css_populate_dir(struct cgroup_subsys_state *css)
|
2015-09-18 21:54:23 +00:00
|
|
|
{
|
2016-03-03 14:58:01 +00:00
|
|
|
struct cgroup *cgrp = css->cgroup;
|
2015-09-18 21:54:23 +00:00
|
|
|
struct cftype *cfts, *failed_cfts;
|
|
|
|
int ret;
|
2015-09-18 21:54:23 +00:00
|
|
|
|
2023-07-17 14:39:23 +00:00
|
|
|
if (css->flags & CSS_VISIBLE)
|
2016-03-03 14:57:58 +00:00
|
|
|
return 0;
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
if (!css->ss) {
|
2022-09-06 19:38:55 +00:00
|
|
|
if (cgroup_on_dfl(cgrp)) {
|
2023-09-12 07:04:35 +00:00
|
|
|
ret = cgroup_addrm_files(css, cgrp,
|
2022-09-06 19:38:55 +00:00
|
|
|
cgroup_base_files, true);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (cgroup_psi_enabled()) {
|
2023-09-12 07:04:35 +00:00
|
|
|
ret = cgroup_addrm_files(css, cgrp,
|
2022-09-06 19:38:55 +00:00
|
|
|
cgroup_psi_files, true);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
} else {
|
2023-09-12 07:04:34 +00:00
|
|
|
ret = cgroup_addrm_files(css, cgrp,
|
|
|
|
cgroup1_base_files, true);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2022-09-06 19:38:55 +00:00
|
|
|
}
|
2018-04-26 21:29:04 +00:00
|
|
|
} else {
|
|
|
|
list_for_each_entry(cfts, &css->ss->cfts, node) {
|
|
|
|
ret = cgroup_addrm_files(css, cgrp, cfts, true);
|
|
|
|
if (ret < 0) {
|
|
|
|
failed_cfts = cfts;
|
|
|
|
goto err;
|
|
|
|
}
|
2015-09-18 21:54:23 +00:00
|
|
|
}
|
|
|
|
}
|
2016-03-03 14:57:58 +00:00
|
|
|
|
|
|
|
css->flags |= CSS_VISIBLE;
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
return 0;
|
|
|
|
err:
|
2015-09-18 21:54:23 +00:00
|
|
|
list_for_each_entry(cfts, &css->ss->cfts, node) {
|
|
|
|
if (cfts == failed_cfts)
|
|
|
|
break;
|
|
|
|
cgroup_addrm_files(css, cgrp, cfts, false);
|
|
|
|
}
|
2015-09-18 21:54:23 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:06 +00:00
|
|
|
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2015-09-18 21:54:23 +00:00
|
|
|
struct cgroup *dcgrp = &dst_root->cgrp;
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2023-06-10 09:26:43 +00:00
|
|
|
int ssid, ret;
|
2021-09-18 22:53:08 +00:00
|
|
|
u16 dfl_disable_ss_mask = 0;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
cgroup: introduce cgroup_tree_mutex
Currently cgroup uses combination of inode->i_mutex'es and
cgroup_mutex for synchronization. With the scheduled kernfs
conversion, i_mutex'es will be removed. Unfortunately, just using
cgroup_mutex isn't possible. All kernfs file and syscall operations,
most of which require grabbing cgroup_mutex, will be called with
kernfs active ref held and, if we try to perform kernfs removals under
cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
target node.
Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
stuff used during hierarchy changing operations - cftypes and all the
operations which may affect the cgroupfs. It also covers css
association and iteration. This allows cgroup_css(), for_each_css()
and other css iterators to be called under cgroup_tree_mutex. The new
mutex will nest above both kernfs's active ref protection and
cgroup_mutex. By protecting tree modifications with a separate outer
mutex, we can get rid of the forementioned deadlock condition.
Actual file additions and removals now require cgroup_tree_mutex
instead of cgroup_mutex. Currently, cgroup_tree_mutex is never used
without cgroup_mutex; however, we'll soon add hierarchy modification
sections which are only protected by cgroup_tree_mutex. In the
future, we might want to make the locking more granular by better
splitting the coverages of the two mutexes. For now, this should do.
v2: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-11 16:52:47 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, ssid, ss_mask) {
|
2016-03-08 16:51:26 +00:00
|
|
|
/*
|
|
|
|
* If @ss has non-root csses attached to it, can't move.
|
|
|
|
* If @ss is an implicit controller, it is exempt from this
|
|
|
|
* rule and can be stolen.
|
|
|
|
*/
|
|
|
|
if (css_next_child(NULL, cgroup_css(&ss->root->cgrp, ss)) &&
|
|
|
|
!ss->implicit_on_dfl)
|
2014-02-08 15:36:58 +00:00
|
|
|
return -EBUSY;
|
2013-07-12 20:38:17 +00:00
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
/* can't move between two non-dummy roots either */
|
2014-04-23 15:13:16 +00:00
|
|
|
if (ss->root != &cgrp_dfl_root && dst_root != &cgrp_dfl_root)
|
2014-03-19 14:23:54 +00:00
|
|
|
return -EBUSY;
|
2021-09-18 22:53:08 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Collect ssid's that need to be disabled from default
|
|
|
|
* hierarchy.
|
|
|
|
*/
|
|
|
|
if (ss->root == &cgrp_dfl_root)
|
|
|
|
dfl_disable_ss_mask |= 1 << ssid;
|
|
|
|
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2021-09-18 22:53:08 +00:00
|
|
|
if (dfl_disable_ss_mask) {
|
|
|
|
struct cgroup *scgrp = &cgrp_dfl_root.cgrp;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Controllers from default hierarchy that need to be rebound
|
|
|
|
* are all disabled together in one go.
|
|
|
|
*/
|
|
|
|
cgrp_dfl_root.subsys_mask &= ~dfl_disable_ss_mask;
|
|
|
|
WARN_ON(cgroup_apply_control(scgrp));
|
|
|
|
cgroup_finalize_control(scgrp, 0);
|
|
|
|
}
|
|
|
|
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, ssid, ss_mask) {
|
2015-09-18 21:54:23 +00:00
|
|
|
struct cgroup_root *src_root = ss->root;
|
|
|
|
struct cgroup *scgrp = &src_root->cgrp;
|
|
|
|
struct cgroup_subsys_state *css = cgroup_css(scgrp, ss);
|
2023-06-10 09:26:43 +00:00
|
|
|
struct css_set *cset, *cset_pos;
|
|
|
|
struct css_task_iter *it;
|
2013-06-24 22:21:47 +00:00
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
WARN_ON(!css || cgroup_css(dcgrp, ss));
|
2013-06-24 22:21:47 +00:00
|
|
|
|
2021-09-18 22:53:08 +00:00
|
|
|
if (src_root != &cgrp_dfl_root) {
|
|
|
|
/* disable from the source */
|
|
|
|
src_root->subsys_mask &= ~(1 << ssid);
|
|
|
|
WARN_ON(cgroup_apply_control(scgrp));
|
|
|
|
cgroup_finalize_control(scgrp, 0);
|
|
|
|
}
|
2015-09-18 21:54:23 +00:00
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
/* rebind */
|
2015-09-18 21:54:23 +00:00
|
|
|
RCU_INIT_POINTER(scgrp->subsys[ssid], NULL);
|
|
|
|
rcu_assign_pointer(dcgrp->subsys[ssid], css);
|
2014-03-19 14:23:54 +00:00
|
|
|
ss->root = dst_root;
|
2015-09-18 21:54:23 +00:00
|
|
|
css->cgroup = dcgrp;
|
2013-08-13 15:01:55 +00:00
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2023-06-10 09:26:43 +00:00
|
|
|
WARN_ON(!list_empty(&dcgrp->e_csets[ss->id]));
|
|
|
|
list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id],
|
|
|
|
e_cset_node[ss->id]) {
|
2014-04-23 15:13:15 +00:00
|
|
|
list_move_tail(&cset->e_cset_node[ss->id],
|
2015-09-18 21:54:23 +00:00
|
|
|
&dcgrp->e_csets[ss->id]);
|
2023-06-10 09:26:43 +00:00
|
|
|
/*
|
|
|
|
* all css_sets of scgrp together in same order to dcgrp,
|
|
|
|
* patch in-flight iterators to preserve correct iteration.
|
|
|
|
* since the iterator is always advanced right away and
|
|
|
|
* finished when it->cset_pos meets it->cset_head, so only
|
|
|
|
* update it->cset_head is enough here.
|
|
|
|
*/
|
|
|
|
list_for_each_entry(it, &cset->task_iters, iters_node)
|
|
|
|
if (it->cset_head == &scgrp->e_csets[ss->id])
|
|
|
|
it->cset_head = &dcgrp->e_csets[ss->id];
|
|
|
|
}
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2014-04-23 15:13:15 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
if (ss->css_rstat_flush) {
|
|
|
|
list_del_rcu(&css->rstat_css_node);
|
2022-08-23 05:41:46 +00:00
|
|
|
synchronize_rcu();
|
2021-04-30 05:56:20 +00:00
|
|
|
list_add_rcu(&css->rstat_css_node,
|
|
|
|
&dcgrp->rstat_css_list);
|
|
|
|
}
|
|
|
|
|
2014-04-23 15:13:16 +00:00
|
|
|
/* default hierarchy doesn't enable controllers by default */
|
2014-04-23 15:13:14 +00:00
|
|
|
dst_root->subsys_mask |= 1 << ssid;
|
2015-09-18 15:56:28 +00:00
|
|
|
if (dst_root == &cgrp_dfl_root) {
|
|
|
|
static_branch_enable(cgroup_subsys_on_dfl_key[ssid]);
|
|
|
|
} else {
|
2015-09-18 21:54:23 +00:00
|
|
|
dcgrp->subtree_control |= 1 << ssid;
|
2015-09-18 15:56:28 +00:00
|
|
|
static_branch_disable(cgroup_subsys_on_dfl_key[ssid]);
|
2014-07-08 22:02:56 +00:00
|
|
|
}
|
2013-06-24 22:21:47 +00:00
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
ret = cgroup_apply_control(dcgrp);
|
|
|
|
if (ret)
|
|
|
|
pr_warn("partial failure to rebind %s controller (err=%d)\n",
|
|
|
|
ss->name, ret);
|
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
if (ss->bind)
|
|
|
|
ss->bind(css);
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
kernfs_activate(dcgrp->kn);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:08 +00:00
|
|
|
int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
|
|
|
|
struct kernfs_root *kf_root)
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
{
|
2016-05-12 09:34:38 +00:00
|
|
|
int len = 0;
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
char *buf = NULL;
|
|
|
|
struct cgroup_root *kf_cgroot = cgroup_root_from_kf(kf_root);
|
|
|
|
struct cgroup *ns_cgroup;
|
|
|
|
|
|
|
|
buf = kmalloc(PATH_MAX, GFP_KERNEL);
|
|
|
|
if (!buf)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
ns_cgroup = current_cgns_cgroup_from_root(kf_cgroot);
|
|
|
|
len = kernfs_path_from_node(kf_node, ns_cgroup->kn, buf, PATH_MAX);
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
|
2023-12-12 21:17:40 +00:00
|
|
|
if (len == -E2BIG)
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
len = -ERANGE;
|
|
|
|
else if (len > 0) {
|
|
|
|
seq_escape(sf, buf, " \t\n\\");
|
|
|
|
len = 0;
|
|
|
|
}
|
|
|
|
kfree(buf);
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2019-01-17 05:22:58 +00:00
|
|
|
enum cgroup2_param {
|
|
|
|
Opt_nsdelegate,
|
cgroup: remove "no" prefixed mount options
30312730bd02 ("cgroup: Add "no" prefixed mount options") added "no" prefixed
mount options to allow turning them off and 6a010a49b63a ("cgroup: Make
!percpu threadgroup_rwsem operations optional") added one more "no" prefixed
mount option. However, Michal pointed out that the "no" prefixed options
aren't necessary in allowing mount options to be turned off:
# grep group /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
# mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0
Note that this is different from the remount behavior when the mount(1) is
invoked without the device argument - "none":
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
While a bit confusing, given that there is a way to turn off the options,
there's no reason to have the explicit "no" prefixed options. Let's remove
them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-07-27 17:54:55 +00:00
|
|
|
Opt_favordynmods,
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
Opt_memory_localevents,
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
Opt_memory_recursiveprot,
|
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 18:46:28 +00:00
|
|
|
Opt_memory_hugetlb_accounting,
|
2019-01-17 05:22:58 +00:00
|
|
|
nr__cgroup2_params
|
|
|
|
};
|
2017-06-27 18:30:28 +00:00
|
|
|
|
2019-09-07 11:23:15 +00:00
|
|
|
static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
fsparam_flag("nsdelegate", Opt_nsdelegate),
|
2022-07-23 14:28:28 +00:00
|
|
|
fsparam_flag("favordynmods", Opt_favordynmods),
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
fsparam_flag("memory_localevents", Opt_memory_localevents),
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
|
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 18:46:28 +00:00
|
|
|
fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
|
2019-01-17 05:22:58 +00:00
|
|
|
{}
|
|
|
|
};
|
2017-06-27 18:30:28 +00:00
|
|
|
|
2019-01-17 05:22:58 +00:00
|
|
|
static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param)
|
|
|
|
{
|
|
|
|
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
|
|
|
|
struct fs_parse_result result;
|
|
|
|
int opt;
|
2017-06-27 18:30:28 +00:00
|
|
|
|
2019-09-07 11:23:15 +00:00
|
|
|
opt = fs_parse(fc, cgroup2_fs_parameters, param, &result);
|
2019-01-17 05:22:58 +00:00
|
|
|
if (opt < 0)
|
|
|
|
return opt;
|
2017-06-27 18:30:28 +00:00
|
|
|
|
2019-01-17 05:22:58 +00:00
|
|
|
switch (opt) {
|
|
|
|
case Opt_nsdelegate:
|
|
|
|
ctx->flags |= CGRP_ROOT_NS_DELEGATE;
|
|
|
|
return 0;
|
2022-07-23 14:28:28 +00:00
|
|
|
case Opt_favordynmods:
|
|
|
|
ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
|
|
|
|
return 0;
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
case Opt_memory_localevents:
|
|
|
|
ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
|
|
|
|
return 0;
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
case Opt_memory_recursiveprot:
|
|
|
|
ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
|
|
|
|
return 0;
|
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 18:46:28 +00:00
|
|
|
case Opt_memory_hugetlb_accounting:
|
|
|
|
ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
|
|
|
|
return 0;
|
2019-01-17 05:22:58 +00:00
|
|
|
}
|
|
|
|
return -EINVAL;
|
2017-06-27 18:30:28 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void apply_cgroup_root_flags(unsigned int root_flags)
|
|
|
|
{
|
|
|
|
if (current->nsproxy->cgroup_ns == &init_cgroup_ns) {
|
|
|
|
if (root_flags & CGRP_ROOT_NS_DELEGATE)
|
|
|
|
cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE;
|
|
|
|
else
|
|
|
|
cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE;
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
|
2022-07-23 14:28:28 +00:00
|
|
|
cgroup_favor_dynmods(&cgrp_dfl_root,
|
|
|
|
root_flags & CGRP_ROOT_FAVOR_DYNMODS);
|
|
|
|
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
if (root_flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
|
|
|
|
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS;
|
|
|
|
else
|
|
|
|
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS;
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
|
|
|
|
if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
|
|
|
|
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT;
|
|
|
|
else
|
|
|
|
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT;
|
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 18:46:28 +00:00
|
|
|
|
|
|
|
if (root_flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
|
|
|
|
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
|
|
|
|
else
|
|
|
|
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
|
2017-06-27 18:30:28 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root)
|
|
|
|
{
|
|
|
|
if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE)
|
|
|
|
seq_puts(seq, ",nsdelegate");
|
2022-07-23 14:28:28 +00:00
|
|
|
if (cgrp_dfl_root.flags & CGRP_ROOT_FAVOR_DYNMODS)
|
|
|
|
seq_puts(seq, ",favordynmods");
|
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 05:30:22 +00:00
|
|
|
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS)
|
|
|
|
seq_puts(seq, ",memory_localevents");
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)
|
|
|
|
seq_puts(seq, ",memory_recursiveprot");
|
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 18:46:28 +00:00
|
|
|
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
|
|
|
|
seq_puts(seq, ",memory_hugetlb_accounting");
|
2017-06-27 18:30:28 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2019-01-05 05:38:03 +00:00
|
|
|
static int cgroup_reconfigure(struct fs_context *fc)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2019-01-05 05:38:03 +00:00
|
|
|
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
|
2017-06-27 18:30:28 +00:00
|
|
|
|
2019-01-17 04:42:38 +00:00
|
|
|
apply_cgroup_root_flags(ctx->flags);
|
2017-06-27 18:30:28 +00:00
|
|
|
return 0;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2008-10-19 03:28:04 +00:00
|
|
|
static void init_cgroup_housekeeping(struct cgroup *cgrp)
|
|
|
|
{
|
2014-04-23 15:13:15 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
|
|
|
|
2014-05-16 17:22:48 +00:00
|
|
|
INIT_LIST_HEAD(&cgrp->self.sibling);
|
|
|
|
INIT_LIST_HEAD(&cgrp->self.children);
|
2013-06-13 04:04:50 +00:00
|
|
|
INIT_LIST_HEAD(&cgrp->cset_links);
|
2009-09-23 22:56:27 +00:00
|
|
|
INIT_LIST_HEAD(&cgrp->pidlists);
|
|
|
|
mutex_init(&cgrp->pidlist_mutex);
|
2014-05-14 13:15:00 +00:00
|
|
|
cgrp->self.cgroup = cgrp;
|
2014-05-16 17:22:51 +00:00
|
|
|
cgrp->self.flags |= CSS_ONLINE;
|
2017-05-15 13:34:02 +00:00
|
|
|
cgrp->dom_cgrp = cgrp;
|
2017-07-28 17:28:44 +00:00
|
|
|
cgrp->max_descendants = INT_MAX;
|
|
|
|
cgrp->max_depth = INT_MAX;
|
2018-04-26 21:29:05 +00:00
|
|
|
INIT_LIST_HEAD(&cgrp->rstat_css_list);
|
2018-04-26 21:29:04 +00:00
|
|
|
prev_cputime_init(&cgrp->prev_cputime);
|
2014-04-23 15:13:15 +00:00
|
|
|
|
|
|
|
for_each_subsys(ss, ssid)
|
|
|
|
INIT_LIST_HEAD(&cgrp->e_csets[ssid]);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
|
|
|
init_waitqueue_head(&cgrp->offline_waitq);
|
2016-12-27 19:49:08 +00:00
|
|
|
INIT_WORK(&cgrp->release_agent_work, cgroup1_release_agent);
|
2008-10-19 03:28:04 +00:00
|
|
|
}
|
2009-09-23 22:56:19 +00:00
|
|
|
|
2019-01-17 07:25:51 +00:00
|
|
|
void init_cgroup_root(struct cgroup_fs_context *ctx)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2019-01-17 07:25:51 +00:00
|
|
|
struct cgroup_root *root = ctx->root;
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup *cgrp = &root->cgrp;
|
2012-04-01 19:09:54 +00:00
|
|
|
|
2023-10-29 06:14:29 +00:00
|
|
|
INIT_LIST_HEAD_RCU(&root->root_list);
|
2014-02-12 14:29:50 +00:00
|
|
|
atomic_set(&root->nr_cgrps, 1);
|
2007-10-19 06:40:44 +00:00
|
|
|
cgrp->root = root;
|
2008-10-19 03:28:04 +00:00
|
|
|
init_cgroup_housekeeping(cgrp);
|
2009-09-23 22:56:19 +00:00
|
|
|
|
2022-07-23 14:28:28 +00:00
|
|
|
/* DYNMODS must be modified through cgroup_favor_dynmods() */
|
|
|
|
root->flags = ctx->flags & ~CGRP_ROOT_FAVOR_DYNMODS;
|
2019-01-17 04:42:38 +00:00
|
|
|
if (ctx->release_agent)
|
|
|
|
strscpy(root->release_agent_path, ctx->release_agent, PATH_MAX);
|
|
|
|
if (ctx->name)
|
|
|
|
strscpy(root->name, ctx->name, MAX_CGROUP_ROOT_NAMELEN);
|
|
|
|
if (ctx->cpuset_clone_children)
|
2014-03-19 14:23:54 +00:00
|
|
|
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
|
2009-09-23 22:56:19 +00:00
|
|
|
}
|
|
|
|
|
2019-01-12 05:20:54 +00:00
|
|
|
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
|
2009-09-23 22:56:23 +00:00
|
|
|
{
|
2014-02-11 16:52:48 +00:00
|
|
|
LIST_HEAD(tmp_links);
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup *root_cgrp = &root->cgrp;
|
2016-12-27 19:49:07 +00:00
|
|
|
struct kernfs_syscall_ops *kf_sops;
|
2014-02-11 16:52:48 +00:00
|
|
|
struct css_set *cset;
|
|
|
|
int i, ret;
|
2009-09-23 22:56:23 +00:00
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2009-09-23 22:56:19 +00:00
|
|
|
|
2017-04-19 02:15:59 +00:00
|
|
|
ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
|
2019-01-12 05:20:54 +00:00
|
|
|
0, GFP_KERNEL);
|
2014-05-14 13:15:02 +00:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
/*
|
2015-10-15 20:41:53 +00:00
|
|
|
* We're accessing css_set_count without locking css_set_lock here,
|
2014-02-11 16:52:48 +00:00
|
|
|
* but that's OK - it can only be increased by someone holding
|
2016-03-03 14:58:01 +00:00
|
|
|
* cgroup_lock, and that's us. Later rebinding may disable
|
|
|
|
* controllers on the default hierarchy and thus create new csets,
|
|
|
|
* which can't be more than the existing ones. Allocate 2x.
|
2014-02-11 16:52:48 +00:00
|
|
|
*/
|
2016-03-03 14:58:01 +00:00
|
|
|
ret = allocate_cgrp_cset_links(2 * css_set_count, &tmp_links);
|
2014-02-11 16:52:48 +00:00
|
|
|
if (ret)
|
2014-05-14 13:15:02 +00:00
|
|
|
goto cancel_ref;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-03-19 14:23:53 +00:00
|
|
|
ret = cgroup_init_root_id(root);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
if (ret)
|
2014-05-14 13:15:02 +00:00
|
|
|
goto cancel_ref;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-12-27 19:49:07 +00:00
|
|
|
kf_sops = root == &cgrp_dfl_root ?
|
|
|
|
&cgroup_kf_syscall_ops : &cgroup1_kf_syscall_ops;
|
|
|
|
|
|
|
|
root->kf_root = kernfs_create_root(kf_sops,
|
2017-07-12 18:49:51 +00:00
|
|
|
KERNFS_ROOT_CREATE_DEACTIVATED |
|
cgroupfs: Support user xattrs
This patch turns on xattr support for cgroupfs. This is useful for
letting non-root owners of delegated subtrees attach metadata to
cgroups.
One use case is for subtree owners to tell a userspace out of memory
killer to bias away from killing specific subtrees.
Tests:
[/sys/fs/cgroup]# for i in $(seq 0 130); \
do setfattr workload.slice -n user.name$i -v wow; done
setfattr: workload.slice: No space left on device
setfattr: workload.slice: No space left on device
setfattr: workload.slice: No space left on device
[/sys/fs/cgroup]# for i in $(seq 0 130); \
do setfattr workload.slice --remove user.name$i; done
setfattr: workload.slice: No such attribute
setfattr: workload.slice: No such attribute
setfattr: workload.slice: No such attribute
[/sys/fs/cgroup]# for i in $(seq 0 130); \
do setfattr workload.slice -n user.name$i -v wow; done
setfattr: workload.slice: No space left on device
setfattr: workload.slice: No space left on device
setfattr: workload.slice: No space left on device
`seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of
errors we expect to see.
[/data]# cat testxattr.c
#include <sys/types.h>
#include <sys/xattr.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
char name[256];
char *buf = malloc(64 << 10);
if (!buf) {
perror("malloc");
return 1;
}
for (int i = 0; i < 4; ++i) {
snprintf(name, 256, "user.bigone%d", i);
if (setxattr("/sys/fs/cgroup/system.slice", name, buf,
64 << 10, 0)) {
printf("setxattr failed on iteration=%d\n", i);
return 1;
}
}
return 0;
}
[/data]# ./a.out
setxattr failed on iteration=2
[/data]# ./a.out
setxattr failed on iteration=0
[/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/
[/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/
[/data]# ./a.out
setxattr failed on iteration=2
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Acked-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2020-03-12 20:03:17 +00:00
|
|
|
KERNFS_ROOT_SUPPORT_EXPORTOP |
|
|
|
|
KERNFS_ROOT_SUPPORT_USER_XATTR,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
root_cgrp);
|
|
|
|
if (IS_ERR(root->kf_root)) {
|
|
|
|
ret = PTR_ERR(root->kf_root);
|
|
|
|
goto exit_root_id;
|
|
|
|
}
|
2022-02-22 07:07:13 +00:00
|
|
|
root_cgrp->kn = kernfs_root_to_node(root->kf_root);
|
2019-11-14 22:46:51 +00:00
|
|
|
WARN_ON_ONCE(cgroup_ino(root_cgrp) != 1);
|
2022-07-29 23:10:16 +00:00
|
|
|
root_cgrp->ancestors[0] = root_cgrp;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
ret = css_populate_dir(&root_cgrp->self);
|
2014-02-11 16:52:48 +00:00
|
|
|
if (ret)
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
goto destroy_root;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
ret = cgroup_rstat_init(root_cgrp);
|
2014-02-11 16:52:48 +00:00
|
|
|
if (ret)
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
goto destroy_root;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
ret = rebind_subsystems(root, ss_mask);
|
|
|
|
if (ret)
|
|
|
|
goto exit_stats;
|
|
|
|
|
2017-10-03 05:50:21 +00:00
|
|
|
ret = cgroup_bpf_inherit(root_cgrp);
|
|
|
|
WARN_ON_ONCE(ret);
|
|
|
|
|
2016-08-10 15:23:44 +00:00
|
|
|
trace_cgroup_setup_root(root);
|
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
/*
|
|
|
|
* There must be no failure case after here, since rebinding takes
|
|
|
|
* care of subsystems' refcounts, which are explicitly dropped in
|
|
|
|
* the failure exit path.
|
|
|
|
*/
|
2023-10-29 06:14:29 +00:00
|
|
|
list_add_rcu(&root->root_list, &cgroup_roots);
|
2014-02-11 16:52:48 +00:00
|
|
|
cgroup_root_count++;
|
2010-12-21 18:29:29 +00:00
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
/*
|
2014-03-19 14:23:54 +00:00
|
|
|
* Link the root cgroup in this hierarchy into all the css_set
|
2014-02-11 16:52:48 +00:00
|
|
|
* objects.
|
|
|
|
*/
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2015-10-15 20:41:49 +00:00
|
|
|
hash_for_each(css_set_table, i, cset, hlist) {
|
2014-02-11 16:52:48 +00:00
|
|
|
link_css_set(&tmp_links, cset, root_cgrp);
|
2015-10-15 20:41:49 +00:00
|
|
|
if (css_set_populated(cset))
|
|
|
|
cgroup_update_populated(root_cgrp, true);
|
|
|
|
}
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-05-16 17:22:48 +00:00
|
|
|
BUG_ON(!list_empty(&root_cgrp->self.children));
|
2014-02-12 14:29:50 +00:00
|
|
|
BUG_ON(atomic_read(&root->nr_cgrps) != 1);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
ret = 0;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
goto out;
|
2014-02-11 16:52:48 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
exit_stats:
|
|
|
|
cgroup_rstat_exit(root_cgrp);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
destroy_root:
|
|
|
|
kernfs_destroy_root(root->kf_root);
|
|
|
|
root->kf_root = NULL;
|
|
|
|
exit_root_id:
|
2014-02-11 16:52:48 +00:00
|
|
|
cgroup_exit_root_id(root);
|
2014-05-14 13:15:02 +00:00
|
|
|
cancel_ref:
|
2014-06-28 12:10:14 +00:00
|
|
|
percpu_ref_exit(&root_cgrp->self.refcnt);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
out:
|
2014-02-11 16:52:48 +00:00
|
|
|
free_cgrp_cset_links(&tmp_links);
|
|
|
|
return ret;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2019-01-17 15:14:26 +00:00
|
|
|
int cgroup_do_get_tree(struct fs_context *fc)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2019-01-17 07:44:07 +00:00
|
|
|
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
int ret;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
ctx->kfc.root = ctx->root->kf_root;
|
2019-01-17 15:14:26 +00:00
|
|
|
if (fc->fs_type == &cgroup2_fs_type)
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
ctx->kfc.magic = CGROUP2_SUPER_MAGIC;
|
2019-01-17 15:14:26 +00:00
|
|
|
else
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
ctx->kfc.magic = CGROUP_SUPER_MAGIC;
|
|
|
|
ret = kernfs_get_tree(fc);
|
2016-01-29 08:54:09 +00:00
|
|
|
|
2014-02-13 11:58:38 +00:00
|
|
|
/*
|
2016-12-27 19:49:07 +00:00
|
|
|
* In non-init cgroup namespace, instead of root cgroup's dentry,
|
|
|
|
* we return the dentry corresponding to the cgroupns->root_cgrp.
|
2014-02-13 11:58:38 +00:00
|
|
|
*/
|
2019-01-17 15:14:26 +00:00
|
|
|
if (!ret && ctx->ns != &init_cgroup_ns) {
|
2016-12-27 19:49:07 +00:00
|
|
|
struct dentry *nsdentry;
|
2019-01-17 07:44:07 +00:00
|
|
|
struct super_block *sb = fc->root->d_sb;
|
2016-12-27 19:49:07 +00:00
|
|
|
struct cgroup *cgrp;
|
cgroup: fix the retry path of cgroup_mount()
If we hit the retry path, we'll call parse_cgroupfs_options() again,
but the string we pass to it has been modified by the previous call
to this function.
This bug can be observed by:
# mount -t cgroup -o name=foo,cpuset xxx /mnt && umount /mnt && \
mount -t cgroup -o name=foo,cpuset xxx /mnt
mount: wrong fs type, bad option, bad superblock on xxx,
missing codepage or helper program, or other error
...
The second mount passed "name=foo,cpuset" to the parser, and then it
hit the retry path and call the parser again, but this time the string
passed to the parser is "name=foo".
To fix this, we avoid calling parse_cgroupfs_options() again in this
case.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-17 05:53:08 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2016-12-27 19:49:07 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
|
2019-01-17 15:14:26 +00:00
|
|
|
cgrp = cset_cgroup_from_root(ctx->ns->root_cset, ctx->root);
|
2016-12-27 19:49:07 +00:00
|
|
|
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2016-12-27 19:49:07 +00:00
|
|
|
|
2019-01-06 16:41:29 +00:00
|
|
|
nsdentry = kernfs_node_dentry(cgrp->kn, sb);
|
2019-01-17 07:44:07 +00:00
|
|
|
dput(fc->root);
|
|
|
|
if (IS_ERR(nsdentry)) {
|
2019-01-06 16:41:29 +00:00
|
|
|
deactivate_locked_super(sb);
|
2019-11-10 16:53:27 +00:00
|
|
|
ret = PTR_ERR(nsdentry);
|
|
|
|
nsdentry = NULL;
|
2019-01-17 07:44:07 +00:00
|
|
|
}
|
2019-11-10 16:53:27 +00:00
|
|
|
fc->root = nsdentry;
|
2015-11-16 16:13:34 +00:00
|
|
|
}
|
|
|
|
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
if (!ctx->kfc.new_sb_created)
|
2019-01-17 07:44:07 +00:00
|
|
|
cgroup_put(&ctx->root->cgrp);
|
2016-12-27 19:49:07 +00:00
|
|
|
|
2019-01-17 07:44:07 +00:00
|
|
|
return ret;
|
2016-12-27 19:49:07 +00:00
|
|
|
}
|
|
|
|
|
2019-01-05 05:38:03 +00:00
|
|
|
/*
|
|
|
|
* Destroy a cgroup filesystem context.
|
|
|
|
*/
|
|
|
|
static void cgroup_fs_context_free(struct fs_context *fc)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2019-01-05 05:38:03 +00:00
|
|
|
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
|
|
|
|
|
2019-01-17 04:42:38 +00:00
|
|
|
kfree(ctx->name);
|
|
|
|
kfree(ctx->release_agent);
|
2019-01-17 15:14:26 +00:00
|
|
|
put_cgroup_ns(ctx->ns);
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
kernfs_free_fs_context(fc);
|
2019-01-05 05:38:03 +00:00
|
|
|
kfree(ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cgroup_get_tree(struct fs_context *fc)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2019-01-05 05:38:03 +00:00
|
|
|
struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
|
2017-06-27 18:30:28 +00:00
|
|
|
int ret;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2022-09-04 19:16:19 +00:00
|
|
|
WRITE_ONCE(cgrp_dfl_visible, true);
|
2019-01-05 05:38:03 +00:00
|
|
|
cgroup_get_live(&cgrp_dfl_root.cgrp);
|
2019-01-17 07:25:51 +00:00
|
|
|
ctx->root = &cgrp_dfl_root;
|
2016-01-29 08:54:09 +00:00
|
|
|
|
2019-01-17 15:14:26 +00:00
|
|
|
ret = cgroup_do_get_tree(fc);
|
2019-01-17 07:44:07 +00:00
|
|
|
if (!ret)
|
|
|
|
apply_cgroup_root_flags(ctx->flags);
|
|
|
|
return ret;
|
2019-01-05 05:38:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static const struct fs_context_operations cgroup_fs_context_ops = {
|
|
|
|
.free = cgroup_fs_context_free,
|
2019-01-17 05:22:58 +00:00
|
|
|
.parse_param = cgroup2_parse_param,
|
2019-01-05 05:38:03 +00:00
|
|
|
.get_tree = cgroup_get_tree,
|
|
|
|
.reconfigure = cgroup_reconfigure,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct fs_context_operations cgroup1_fs_context_ops = {
|
|
|
|
.free = cgroup_fs_context_free,
|
2019-01-17 05:15:11 +00:00
|
|
|
.parse_param = cgroup1_parse_param,
|
2019-01-05 05:38:03 +00:00
|
|
|
.get_tree = cgroup1_get_tree,
|
|
|
|
.reconfigure = cgroup1_reconfigure,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
* Initialise the cgroup filesystem creation/reconfiguration context. Notably,
|
|
|
|
* we select the namespace we're going to use.
|
2019-01-05 05:38:03 +00:00
|
|
|
*/
|
|
|
|
static int cgroup_init_fs_context(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
struct cgroup_fs_context *ctx;
|
|
|
|
|
|
|
|
ctx = kzalloc(sizeof(struct cgroup_fs_context), GFP_KERNEL);
|
|
|
|
if (!ctx)
|
|
|
|
return -ENOMEM;
|
2016-01-29 08:54:09 +00:00
|
|
|
|
2019-01-17 15:14:26 +00:00
|
|
|
ctx->ns = current->nsproxy->cgroup_ns;
|
|
|
|
get_cgroup_ns(ctx->ns);
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
fc->fs_private = &ctx->kfc;
|
2019-01-05 05:38:03 +00:00
|
|
|
if (fc->fs_type == &cgroup2_fs_type)
|
|
|
|
fc->ops = &cgroup_fs_context_ops;
|
|
|
|
else
|
|
|
|
fc->ops = &cgroup1_fs_context_ops;
|
2019-05-12 16:42:58 +00:00
|
|
|
put_user_ns(fc->user_ns);
|
kernfs, sysfs, cgroup, intel_rdt: Support fs_context
Make kernfs support superblock creation/mount/remount with fs_context.
This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
be made to support fs_context also.
Notes:
(1) A kernfs_fs_context struct is created to wrap fs_context and the
kernfs mount parameters are moved in here (or are in fs_context).
(2) kernfs_mount{,_ns}() are made into kernfs_get_tree(). The extra
namespace tag parameter is passed in the context if desired
(3) kernfs_free_fs_context() is provided as a destructor for the
kernfs_fs_context struct, but for the moment it does nothing except
get called in the right places.
(4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
pass, but possibly this should be done anyway in case someone wants to
add a parameter in future.
(5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
the cgroup v1 and v2 mount parameters are all moved there.
(6) cgroup1 parameter parsing error messages are now handled by invalf(),
which allows userspace to collect them directly.
(7) cgroup1 parameter cleanup is now done in the context destructor rather
than in the mount/get_tree and remount functions.
Weirdies:
(*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
but then uses the resulting pointer after dropping the locks. I'm
told this is okay and needs commenting.
(*) The cgroup refcount web. This really needs documenting.
(*) cgroup2 only has one root?
Add a suggestion from Thomas Gleixner in which the RDT enablement code is
placed into its own function.
[folded a leak fix from Andrey Vagin]
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Tejun Heo <tj@kernel.org>
cc: Li Zefan <lizefan@huawei.com>
cc: Johannes Weiner <hannes@cmpxchg.org>
cc: cgroups@vger.kernel.org
cc: fenghua.yu@intel.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01 23:07:26 +00:00
|
|
|
fc->user_ns = get_user_ns(ctx->ns->user_ns);
|
|
|
|
fc->global = true;
|
2022-07-23 14:28:28 +00:00
|
|
|
|
2023-09-27 14:25:40 +00:00
|
|
|
if (have_favordynmods)
|
|
|
|
ctx->flags |= CGRP_ROOT_FAVOR_DYNMODS;
|
|
|
|
|
2019-01-05 05:38:03 +00:00
|
|
|
return 0;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
}
|
2014-02-11 16:52:48 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static void cgroup_kill_sb(struct super_block *sb)
|
|
|
|
{
|
|
|
|
struct kernfs_root *kf_root = kernfs_root_from_sb(sb);
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup_root *root = cgroup_root_from_kf(kf_root);
|
2014-05-14 13:15:00 +00:00
|
|
|
|
2014-06-30 03:49:58 +00:00
|
|
|
/*
|
2019-01-12 05:20:54 +00:00
|
|
|
* If @root doesn't have any children, start killing it.
|
2014-05-14 13:15:02 +00:00
|
|
|
* This prevents new mounts by disabling percpu_ref_tryget_live().
|
2014-06-04 08:48:15 +00:00
|
|
|
*
|
|
|
|
* And don't kill the default root.
|
2014-06-30 03:49:58 +00:00
|
|
|
*/
|
2019-01-12 05:20:54 +00:00
|
|
|
if (list_empty(&root->cgrp.self.children) && root != &cgrp_dfl_root &&
|
2021-10-18 07:56:23 +00:00
|
|
|
!percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
|
|
|
|
cgroup_bpf_offline(&root->cgrp);
|
2014-05-14 13:15:02 +00:00
|
|
|
percpu_ref_kill(&root->cgrp.self.refcnt);
|
2021-10-18 07:56:23 +00:00
|
|
|
}
|
2019-01-12 05:20:54 +00:00
|
|
|
cgroup_put(&root->cgrp);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
kernfs_kill_sb(sb);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
2014-06-30 03:49:58 +00:00
|
|
|
|
2016-12-27 19:49:06 +00:00
|
|
|
struct file_system_type cgroup_fs_type = {
|
2019-01-17 05:15:11 +00:00
|
|
|
.name = "cgroup",
|
|
|
|
.init_fs_context = cgroup_init_fs_context,
|
2019-09-07 11:23:15 +00:00
|
|
|
.parameters = cgroup1_fs_parameters,
|
2019-01-17 05:15:11 +00:00
|
|
|
.kill_sb = cgroup_kill_sb,
|
|
|
|
.fs_flags = FS_USERNS_MOUNT,
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
};
|
2013-06-29 00:07:30 +00:00
|
|
|
|
2015-11-16 16:13:34 +00:00
|
|
|
static struct file_system_type cgroup2_fs_type = {
|
2019-01-17 05:22:58 +00:00
|
|
|
.name = "cgroup2",
|
|
|
|
.init_fs_context = cgroup_init_fs_context,
|
2019-09-07 11:23:15 +00:00
|
|
|
.parameters = cgroup2_fs_parameters,
|
2019-01-17 05:22:58 +00:00
|
|
|
.kill_sb = cgroup_kill_sb,
|
|
|
|
.fs_flags = FS_USERNS_MOUNT,
|
2015-11-16 16:13:34 +00:00
|
|
|
};
|
2013-06-29 00:07:30 +00:00
|
|
|
|
2019-05-13 16:33:22 +00:00
|
|
|
#ifdef CONFIG_CPUSETS
|
|
|
|
static const struct fs_context_operations cpuset_fs_context_ops = {
|
|
|
|
.get_tree = cgroup1_get_tree,
|
|
|
|
.free = cgroup_fs_context_free,
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is ugly, but preserves the userspace API for existing cpuset
|
|
|
|
* users. If someone tries to mount the "cpuset" filesystem, we
|
|
|
|
* silently switch it to mount "cgroup" instead
|
|
|
|
*/
|
|
|
|
static int cpuset_init_fs_context(struct fs_context *fc)
|
|
|
|
{
|
|
|
|
char *agent = kstrdup("/sbin/cpuset_release_agent", GFP_USER);
|
|
|
|
struct cgroup_fs_context *ctx;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = cgroup_init_fs_context(fc);
|
|
|
|
if (err) {
|
|
|
|
kfree(agent);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
fc->ops = &cpuset_fs_context_ops;
|
|
|
|
|
|
|
|
ctx = cgroup_fc2context(fc);
|
|
|
|
ctx->subsys_mask = 1 << cpuset_cgrp_id;
|
|
|
|
ctx->flags |= CGRP_ROOT_NOPREFIX;
|
|
|
|
ctx->release_agent = agent;
|
|
|
|
|
|
|
|
get_filesystem(&cgroup_fs_type);
|
|
|
|
put_filesystem(fc->fs_type);
|
|
|
|
fc->fs_type = &cgroup_fs_type;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct file_system_type cpuset_fs_type = {
|
|
|
|
.name = "cpuset",
|
|
|
|
.init_fs_context = cpuset_init_fs_context,
|
|
|
|
.fs_flags = FS_USERNS_MOUNT,
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2016-12-27 19:49:06 +00:00
|
|
|
int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
|
|
|
|
struct cgroup_namespace *ns)
|
2016-01-29 08:54:06 +00:00
|
|
|
{
|
|
|
|
struct cgroup *root = cset_cgroup_from_root(ns->root_cset, cgrp->root);
|
|
|
|
|
2016-08-10 15:23:44 +00:00
|
|
|
return kernfs_path_from_node(cgrp->kn, root->kn, buf, buflen);
|
2016-01-29 08:54:06 +00:00
|
|
|
}
|
|
|
|
|
2016-08-10 15:23:44 +00:00
|
|
|
int cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
|
|
|
|
struct cgroup_namespace *ns)
|
2016-01-29 08:54:06 +00:00
|
|
|
{
|
2016-08-10 15:23:44 +00:00
|
|
|
int ret;
|
2016-01-29 08:54:06 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2016-01-29 08:54:06 +00:00
|
|
|
|
|
|
|
ret = cgroup_path_ns_locked(cgrp, buf, buflen, ns);
|
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2016-01-29 08:54:06 +00:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(cgroup_path_ns);
|
|
|
|
|
2022-08-15 23:27:38 +00:00
|
|
|
/**
|
|
|
|
* cgroup_attach_lock - Lock for ->attach()
|
|
|
|
* @lock_threadgroup: whether to down_write cgroup_threadgroup_rwsem
|
|
|
|
*
|
|
|
|
* cgroup migration sometimes needs to stabilize threadgroups against forks and
|
|
|
|
* exits by write-locking cgroup_threadgroup_rwsem. However, some ->attach()
|
|
|
|
* implementations (e.g. cpuset), also need to disable CPU hotplug.
|
|
|
|
* Unfortunately, letting ->attach() operations acquire cpus_read_lock() can
|
|
|
|
* lead to deadlocks.
|
|
|
|
*
|
|
|
|
* Bringing up a CPU may involve creating and destroying tasks which requires
|
|
|
|
* read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
|
|
|
|
* cpus_read_lock(). If we call an ->attach() which acquires the cpus lock while
|
|
|
|
* write-locking threadgroup_rwsem, the locking order is reversed and we end up
|
|
|
|
* waiting for an on-going CPU hotplug operation which in turn is waiting for
|
|
|
|
* the threadgroup_rwsem to be released to create new tasks. For more details:
|
|
|
|
*
|
|
|
|
* http://lkml.kernel.org/r/20220711174629.uehfmqegcwn2lqzu@wubuntu
|
|
|
|
*
|
|
|
|
* Resolve the situation by always acquiring cpus_read_lock() before optionally
|
|
|
|
* write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
|
|
|
|
* CPU hotplug is disabled on entry.
|
|
|
|
*/
|
2022-08-26 02:48:29 +00:00
|
|
|
void cgroup_attach_lock(bool lock_threadgroup)
|
2022-08-15 23:27:38 +00:00
|
|
|
{
|
|
|
|
cpus_read_lock();
|
|
|
|
if (lock_threadgroup)
|
|
|
|
percpu_down_write(&cgroup_threadgroup_rwsem);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_attach_unlock - Undo cgroup_attach_lock()
|
|
|
|
* @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
|
|
|
|
*/
|
2022-08-26 02:48:29 +00:00
|
|
|
void cgroup_attach_unlock(bool lock_threadgroup)
|
2022-08-15 23:27:38 +00:00
|
|
|
{
|
|
|
|
if (lock_threadgroup)
|
|
|
|
percpu_up_write(&cgroup_threadgroup_rwsem);
|
|
|
|
cpus_read_unlock();
|
|
|
|
}
|
|
|
|
|
2015-09-11 19:00:21 +00:00
|
|
|
/**
|
2017-01-16 00:03:41 +00:00
|
|
|
* cgroup_migrate_add_task - add a migration target task to a migration context
|
2015-09-11 19:00:21 +00:00
|
|
|
* @task: target task
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx: target migration context
|
2015-09-11 19:00:21 +00:00
|
|
|
*
|
2017-01-16 00:03:41 +00:00
|
|
|
* Add @task, which is a migration target, to @mgctx->tset. This function
|
|
|
|
* becomes noop if @task doesn't need to be migrated. @task's css_set
|
|
|
|
* should have been added as a migration source and @task->cg_list will be
|
|
|
|
* moved from the css_set's tasks list to mg_tasks one.
|
2015-09-11 19:00:21 +00:00
|
|
|
*/
|
2017-01-16 00:03:41 +00:00
|
|
|
static void cgroup_migrate_add_task(struct task_struct *task,
|
|
|
|
struct cgroup_mgctx *mgctx)
|
2015-09-11 19:00:21 +00:00
|
|
|
{
|
|
|
|
struct css_set *cset;
|
|
|
|
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2015-09-11 19:00:21 +00:00
|
|
|
|
|
|
|
/* @task either already exited or can't exit until the end */
|
|
|
|
if (task->flags & PF_EXITING)
|
|
|
|
return;
|
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
/* cgroup_threadgroup_rwsem protects racing against forks */
|
|
|
|
WARN_ON_ONCE(list_empty(&task->cg_list));
|
2015-09-11 19:00:21 +00:00
|
|
|
|
|
|
|
cset = task_css_set(task);
|
|
|
|
if (!cset->mg_src_cgrp)
|
|
|
|
return;
|
|
|
|
|
cgroup: don't call migration methods if there are no tasks to migrate
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
2017-07-08 11:17:02 +00:00
|
|
|
mgctx->tset.nr_tasks++;
|
|
|
|
|
2015-09-11 19:00:21 +00:00
|
|
|
list_move_tail(&task->cg_list, &cset->mg_tasks);
|
|
|
|
if (list_empty(&cset->mg_node))
|
2017-01-16 00:03:41 +00:00
|
|
|
list_add_tail(&cset->mg_node,
|
|
|
|
&mgctx->tset.src_csets);
|
2015-09-11 19:00:21 +00:00
|
|
|
if (list_empty(&cset->mg_dst_cset->mg_node))
|
2017-01-16 00:03:40 +00:00
|
|
|
list_add_tail(&cset->mg_dst_cset->mg_node,
|
2017-01-16 00:03:41 +00:00
|
|
|
&mgctx->tset.dst_csets);
|
2015-09-11 19:00:21 +00:00
|
|
|
}
|
|
|
|
|
2011-12-13 02:12:21 +00:00
|
|
|
/**
|
|
|
|
* cgroup_taskset_first - reset taskset and return the first task
|
|
|
|
* @tset: taskset of interest
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
* @dst_cssp: output variable for the destination css
|
2011-12-13 02:12:21 +00:00
|
|
|
*
|
|
|
|
* @tset iteration is initialized and the first task is returned.
|
|
|
|
*/
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
|
|
|
|
struct cgroup_subsys_state **dst_cssp)
|
2011-12-13 02:12:21 +00:00
|
|
|
{
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
tset->cur_cset = list_first_entry(tset->csets, struct css_set, mg_node);
|
|
|
|
tset->cur_task = NULL;
|
|
|
|
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
return cgroup_taskset_next(tset, dst_cssp);
|
2011-12-13 02:12:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_taskset_next - iterate to the next task in taskset
|
|
|
|
* @tset: taskset of interest
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
* @dst_cssp: output variable for the destination css
|
2011-12-13 02:12:21 +00:00
|
|
|
*
|
|
|
|
* Return the next task in @tset. Iteration must have been initialized
|
|
|
|
* with cgroup_taskset_first().
|
|
|
|
*/
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
|
|
|
|
struct cgroup_subsys_state **dst_cssp)
|
2011-12-13 02:12:21 +00:00
|
|
|
{
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
struct css_set *cset = tset->cur_cset;
|
|
|
|
struct task_struct *task = tset->cur_task;
|
2011-12-13 02:12:21 +00:00
|
|
|
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
while (CGROUP_HAS_SUBSYS_CONFIG && &cset->mg_node != tset->csets) {
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
if (!task)
|
|
|
|
task = list_first_entry(&cset->mg_tasks,
|
|
|
|
struct task_struct, cg_list);
|
|
|
|
else
|
|
|
|
task = list_next_entry(task, cg_list);
|
2011-12-13 02:12:21 +00:00
|
|
|
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
if (&task->cg_list != &cset->mg_tasks) {
|
|
|
|
tset->cur_cset = cset;
|
|
|
|
tset->cur_task = task;
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This function may be called both before and
|
2023-07-17 11:28:00 +00:00
|
|
|
* after cgroup_migrate_execute(). The two cases
|
cgroup: fix handling of multi-destination migration from subtree_control enabling
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
2015-12-03 15:18:21 +00:00
|
|
|
* can be distinguished by looking at whether @cset
|
|
|
|
* has its ->mg_dst_cset set.
|
|
|
|
*/
|
|
|
|
if (cset->mg_dst_cset)
|
|
|
|
*dst_cssp = cset->mg_dst_cset->subsys[tset->ssid];
|
|
|
|
else
|
|
|
|
*dst_cssp = cset->subsys[tset->ssid];
|
|
|
|
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
return task;
|
|
|
|
}
|
2011-12-13 02:12:21 +00:00
|
|
|
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
cset = list_next_entry(cset, mg_node);
|
|
|
|
task = NULL;
|
|
|
|
}
|
2011-12-13 02:12:21 +00:00
|
|
|
|
cgroup: use css_set->mg_tasks to track target tasks during migration
Currently, while migrating tasks from one cgroup to another,
cgroup_attach_task() builds a flex array of all target tasks;
unfortunately, this has a couple issues.
* Flex array has size limit. On 64bit, struct task_and_cgroup is
24bytes making the flex element limit around 87k. It is a high
number but not impossible to hit. This means that the current
cgroup implementation can't migrate a process with more than 87k
threads.
* Process migration involves memory allocation whose size is dependent
on the number of threads the process has. This means that cgroup
core can't guarantee success or failure of multi-process migrations
as memory allocation failure can happen in the middle. This is in
part because cgroup can't grab threadgroup locks of multiple
processes at the same time, so when there are multiple processes to
migrate, it is imposible to tell how many tasks are to be migrated
beforehand.
Note that this already affects cgroup_transfer_tasks(). cgroup
currently cannot guarantee atomic success or failure of the
operation. It may fail in the middle and after such failure cgroup
doesn't have enough information to roll back properly. It just
aborts with some tasks migrated and others not.
To resolve the situation, this patch updates the migration path to use
task->cg_list to track target tasks. The previous patch already added
css_set->mg_tasks and updated iterations in non-migration paths to
include them during task migration. This patch updates migration path
to actually make use of it.
Instead of putting onto a flex_array, each target task is moved from
its css_set->tasks list to css_set->mg_tasks and the migration path
keeps trace of all the source css_sets and the associated cgroups.
Once all source css_sets are determined, the destination css_set for
each is determined, linked to the matching source css_set and put on a
separate list.
To iterate the target tasks, migration path just needs to iterat
through either the source or target css_sets, depending on whether
migration has been committed or not, and the tasks on their ->mg_tasks
lists. cgroup_taskset is updated to contain the list_heads for source
and target css_sets and the iteration cursor. cgroup_taskset_*() are
accordingly updated to walk through css_sets and their ->mg_tasks.
This resolves the above listed issues with moderate additional
complexity.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:01 +00:00
|
|
|
return NULL;
|
2011-12-13 02:12:21 +00:00
|
|
|
}
|
|
|
|
|
2015-09-11 19:00:21 +00:00
|
|
|
/**
|
2021-05-26 02:49:09 +00:00
|
|
|
* cgroup_migrate_execute - migrate a taskset
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx: migration context
|
2015-09-11 19:00:21 +00:00
|
|
|
*
|
2017-01-16 00:03:41 +00:00
|
|
|
* Migrate tasks in @mgctx as setup by migration preparation functions.
|
2016-03-08 16:51:26 +00:00
|
|
|
* This function fails iff one of the ->can_attach callbacks fails and
|
2017-01-16 00:03:41 +00:00
|
|
|
* guarantees that either all or none of the tasks in @mgctx are migrated.
|
|
|
|
* @mgctx is consumed regardless of success.
|
2015-09-11 19:00:21 +00:00
|
|
|
*/
|
2017-01-16 00:03:41 +00:00
|
|
|
static int cgroup_migrate_execute(struct cgroup_mgctx *mgctx)
|
2015-09-11 19:00:21 +00:00
|
|
|
{
|
2017-01-16 00:03:41 +00:00
|
|
|
struct cgroup_taskset *tset = &mgctx->tset;
|
2016-03-08 16:51:26 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2015-09-11 19:00:21 +00:00
|
|
|
struct task_struct *task, *tmp_task;
|
|
|
|
struct css_set *cset, *tmp_cset;
|
2016-03-08 16:51:26 +00:00
|
|
|
int ssid, failed_ssid, ret;
|
2015-09-11 19:00:21 +00:00
|
|
|
|
|
|
|
/* check that we can legitimately attach to the cgroup */
|
cgroup: don't call migration methods if there are no tasks to migrate
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
2017-07-08 11:17:02 +00:00
|
|
|
if (tset->nr_tasks) {
|
|
|
|
do_each_subsys_mask(ss, ssid, mgctx->ss_mask) {
|
|
|
|
if (ss->can_attach) {
|
|
|
|
tset->ssid = ssid;
|
|
|
|
ret = ss->can_attach(tset);
|
|
|
|
if (ret) {
|
|
|
|
failed_ssid = ssid;
|
|
|
|
goto out_cancel_attach;
|
|
|
|
}
|
2015-09-11 19:00:21 +00:00
|
|
|
}
|
cgroup: don't call migration methods if there are no tasks to migrate
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
2017-07-08 11:17:02 +00:00
|
|
|
} while_each_subsys_mask();
|
|
|
|
}
|
2015-09-11 19:00:21 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Now that we're guaranteed success, proceed to move all tasks to
|
|
|
|
* the new cgroup. There are no failure cases after here, so this
|
|
|
|
* is the commit point.
|
|
|
|
*/
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2015-09-11 19:00:21 +00:00
|
|
|
list_for_each_entry(cset, &tset->src_csets, mg_node) {
|
2015-10-15 20:41:52 +00:00
|
|
|
list_for_each_entry_safe(task, tmp_task, &cset->mg_tasks, cg_list) {
|
|
|
|
struct css_set *from_cset = task_css_set(task);
|
|
|
|
struct css_set *to_cset = cset->mg_dst_cset;
|
|
|
|
|
|
|
|
get_css_set(to_cset);
|
2017-06-13 21:18:01 +00:00
|
|
|
to_cset->nr_tasks++;
|
2015-10-15 20:41:52 +00:00
|
|
|
css_set_move_task(task, from_cset, to_cset, true);
|
2017-06-13 21:18:01 +00:00
|
|
|
from_cset->nr_tasks--;
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
/*
|
|
|
|
* If the source or destination cgroup is frozen,
|
|
|
|
* the task might require to change its state.
|
|
|
|
*/
|
|
|
|
cgroup_freezer_migrate_task(task, from_cset->dfl_cgrp,
|
|
|
|
to_cset->dfl_cgrp);
|
|
|
|
put_css_set_locked(from_cset);
|
|
|
|
|
2015-10-15 20:41:52 +00:00
|
|
|
}
|
2015-09-11 19:00:21 +00:00
|
|
|
}
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2015-09-11 19:00:21 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Migration is committed, all target tasks are now on dst_csets.
|
|
|
|
* Nothing is sensitive to fork() after this point. Notify
|
|
|
|
* controllers that migration is complete.
|
|
|
|
*/
|
|
|
|
tset->csets = &tset->dst_csets;
|
|
|
|
|
cgroup: don't call migration methods if there are no tasks to migrate
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
2017-07-08 11:17:02 +00:00
|
|
|
if (tset->nr_tasks) {
|
|
|
|
do_each_subsys_mask(ss, ssid, mgctx->ss_mask) {
|
|
|
|
if (ss->attach) {
|
|
|
|
tset->ssid = ssid;
|
|
|
|
ss->attach(tset);
|
|
|
|
}
|
|
|
|
} while_each_subsys_mask();
|
|
|
|
}
|
2015-09-11 19:00:21 +00:00
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
goto out_release_tset;
|
|
|
|
|
|
|
|
out_cancel_attach:
|
cgroup: don't call migration methods if there are no tasks to migrate
Subsystem migration methods shouldn't be called for empty migrations.
cgroup_migrate_execute() implements this guarantee by bailing early if
there are no source css_sets. This used to be correct before
a79a908fd2b0 ("cgroup: introduce cgroup namespaces"), but no longer
since the commit because css_sets can stay pinned without tasks in
them.
This caused cgroup_migrate_execute() call into cpuset migration
methods with an empty cgroup_taskset. cpuset migration methods
correctly assume that cgroup_taskset_first() never returns NULL;
however, due to the bug, it can, leading to the following oops.
Unable to handle kernel paging request for data at address 0x00000960
Faulting instruction address: 0xc0000000001d6868
Oops: Kernel access of bad area, sig: 11 [#1]
...
CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W
4.12.0-rc4-next-20170609 #2
Workqueue: events cpuset_hotplug_workfn
task: c00000000ca60580 task.stack: c00000000c728000
NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609)
MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000
CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
Call Trace:
[c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
[c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
[c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
[c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
[c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
[c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
[c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
[c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
---[ end trace dcaaf98fb36d9e64 ]---
This patch fixes the bug by adding an explicit nr_tasks counter to
cgroup_taskset and skipping calling the migration methods if the
counter is zero. While at it, remove the now spurious check on no
source css_sets.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces")
Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
2017-07-08 11:17:02 +00:00
|
|
|
if (tset->nr_tasks) {
|
|
|
|
do_each_subsys_mask(ss, ssid, mgctx->ss_mask) {
|
|
|
|
if (ssid == failed_ssid)
|
|
|
|
break;
|
|
|
|
if (ss->cancel_attach) {
|
|
|
|
tset->ssid = ssid;
|
|
|
|
ss->cancel_attach(tset);
|
|
|
|
}
|
|
|
|
} while_each_subsys_mask();
|
|
|
|
}
|
2015-09-11 19:00:21 +00:00
|
|
|
out_release_tset:
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2015-09-11 19:00:21 +00:00
|
|
|
list_splice_init(&tset->dst_csets, &tset->src_csets);
|
|
|
|
list_for_each_entry_safe(cset, tmp_cset, &tset->src_csets, mg_node) {
|
|
|
|
list_splice_tail_init(&cset->mg_tasks, &cset->tasks);
|
|
|
|
list_del_init(&cset->mg_node);
|
|
|
|
}
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2017-09-21 13:54:13 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Re-initialize the cgroup_taskset structure in case it is reused
|
|
|
|
* again in another cgroup_migrate_add_task()/cgroup_migrate_execute()
|
|
|
|
* iteration.
|
|
|
|
*/
|
|
|
|
tset->nr_tasks = 0;
|
|
|
|
tset->csets = &tset->src_csets;
|
2015-09-11 19:00:21 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-03-08 16:51:25 +00:00
|
|
|
/**
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
* cgroup_migrate_vet_dst - verify whether a cgroup can be migration destination
|
2016-03-08 16:51:25 +00:00
|
|
|
* @dst_cgrp: destination cgroup to test
|
|
|
|
*
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
* On the default hierarchy, except for the mixable, (possible) thread root
|
|
|
|
* and threaded cgroups, subtree_control must be zero for migration
|
|
|
|
* destination cgroups with tasks so that child cgroups don't compete
|
|
|
|
* against tasks.
|
2016-03-08 16:51:25 +00:00
|
|
|
*/
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
int cgroup_migrate_vet_dst(struct cgroup *dst_cgrp)
|
2016-03-08 16:51:25 +00:00
|
|
|
{
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
/* v1 doesn't have any restriction */
|
|
|
|
if (!cgroup_on_dfl(dst_cgrp))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* verify @dst_cgrp can host resources */
|
|
|
|
if (!cgroup_is_valid_domain(dst_cgrp->dom_cgrp))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If @dst_cgrp is already or can become a thread root or is
|
|
|
|
* threaded, it doesn't matter.
|
|
|
|
*/
|
|
|
|
if (cgroup_can_be_thread_root(dst_cgrp) || cgroup_is_threaded(dst_cgrp))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* apply no-internal-process constraint */
|
|
|
|
if (dst_cgrp->subtree_control)
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
return 0;
|
2016-03-08 16:51:25 +00:00
|
|
|
}
|
|
|
|
|
2008-02-23 23:24:09 +00:00
|
|
|
/**
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
* cgroup_migrate_finish - cleanup after attach
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx: migration context
|
2011-05-26 23:25:20 +00:00
|
|
|
*
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
* Undo cgroup_migrate_add_src() and cgroup_migrate_prepare_dst(). See
|
|
|
|
* those functions for details.
|
2011-05-26 23:25:20 +00:00
|
|
|
*/
|
2017-01-16 00:03:41 +00:00
|
|
|
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx)
|
2011-05-26 23:25:20 +00:00
|
|
|
{
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
struct css_set *cset, *tmp_cset;
|
2011-05-26 23:25:20 +00:00
|
|
|
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2017-01-16 00:03:41 +00:00
|
|
|
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
list_for_each_entry_safe(cset, tmp_cset, &mgctx->preloaded_src_csets,
|
|
|
|
mg_src_preload_node) {
|
|
|
|
cset->mg_src_cgrp = NULL;
|
|
|
|
cset->mg_dst_cgrp = NULL;
|
|
|
|
cset->mg_dst_cset = NULL;
|
|
|
|
list_del_init(&cset->mg_src_preload_node);
|
|
|
|
put_css_set_locked(cset);
|
|
|
|
}
|
2017-01-16 00:03:41 +00:00
|
|
|
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
list_for_each_entry_safe(cset, tmp_cset, &mgctx->preloaded_dst_csets,
|
|
|
|
mg_dst_preload_node) {
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
cset->mg_src_cgrp = NULL;
|
2016-03-08 16:51:26 +00:00
|
|
|
cset->mg_dst_cgrp = NULL;
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
cset->mg_dst_cset = NULL;
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
list_del_init(&cset->mg_dst_preload_node);
|
2014-09-19 08:51:00 +00:00
|
|
|
put_css_set_locked(cset);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
}
|
2017-01-16 00:03:41 +00:00
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_migrate_add_src - add a migration source css_set
|
|
|
|
* @src_cset: the source css_set to add
|
|
|
|
* @dst_cgrp: the destination cgroup
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx: migration context
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*
|
|
|
|
* Tasks belonging to @src_cset are about to be migrated to @dst_cgrp. Pin
|
2017-01-16 00:03:41 +00:00
|
|
|
* @src_cset and add it to @mgctx->src_csets, which should later be cleaned
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
* up by cgroup_migrate_finish().
|
|
|
|
*
|
2015-09-16 16:53:17 +00:00
|
|
|
* This function may be called without holding cgroup_threadgroup_rwsem
|
|
|
|
* even if the target is a process. Threads may be created and destroyed
|
|
|
|
* but as long as cgroup_mutex is not dropped, no new css_set can be put
|
|
|
|
* into play and the preloaded css_sets are guaranteed to cover all
|
|
|
|
* migrations.
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
void cgroup_migrate_add_src(struct css_set *src_cset,
|
|
|
|
struct cgroup *dst_cgrp,
|
2017-01-16 00:03:41 +00:00
|
|
|
struct cgroup_mgctx *mgctx)
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
{
|
|
|
|
struct cgroup *src_cgrp;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
|
cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 00:43:04 +00:00
|
|
|
/*
|
|
|
|
* If ->dead, @src_set is associated with one or more dead cgroups
|
|
|
|
* and doesn't contain any migratable tasks. Ignore it early so
|
|
|
|
* that the rest of migration path doesn't get confused by it.
|
|
|
|
*/
|
|
|
|
if (src_cset->dead)
|
|
|
|
return;
|
|
|
|
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
if (!list_empty(&src_cset->mg_src_preload_node))
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
return;
|
|
|
|
|
2021-12-14 00:46:07 +00:00
|
|
|
src_cgrp = cset_cgroup_from_root(src_cset, dst_cgrp->root);
|
|
|
|
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
WARN_ON(src_cset->mg_src_cgrp);
|
2016-03-08 16:51:26 +00:00
|
|
|
WARN_ON(src_cset->mg_dst_cgrp);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
WARN_ON(!list_empty(&src_cset->mg_tasks));
|
|
|
|
WARN_ON(!list_empty(&src_cset->mg_node));
|
|
|
|
|
|
|
|
src_cset->mg_src_cgrp = src_cgrp;
|
2016-03-08 16:51:26 +00:00
|
|
|
src_cset->mg_dst_cgrp = dst_cgrp;
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
get_css_set(src_cset);
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
list_add_tail(&src_cset->mg_src_preload_node, &mgctx->preloaded_src_csets);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_migrate_prepare_dst - prepare destination css_sets for migration
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx: migration context
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*
|
2016-03-08 16:51:26 +00:00
|
|
|
* Tasks are about to be moved and all the source css_sets have been
|
2017-01-16 00:03:41 +00:00
|
|
|
* preloaded to @mgctx->preloaded_src_csets. This function looks up and
|
|
|
|
* pins all destination css_sets, links each to its source, and append them
|
|
|
|
* to @mgctx->preloaded_dst_csets.
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*
|
|
|
|
* This function must be called after cgroup_migrate_add_src() has been
|
|
|
|
* called on each migration source css_set. After migration is performed
|
|
|
|
* using cgroup_migrate(), cgroup_migrate_finish() must be called on
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx.
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*/
|
2017-01-16 00:03:41 +00:00
|
|
|
int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx)
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
{
|
2014-04-23 15:13:16 +00:00
|
|
|
struct css_set *src_cset, *tmp_cset;
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
/* look up the dst cset for each src cset and link it to src */
|
2017-01-16 00:03:41 +00:00
|
|
|
list_for_each_entry_safe(src_cset, tmp_cset, &mgctx->preloaded_src_csets,
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
mg_src_preload_node) {
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
struct css_set *dst_cset;
|
2017-01-16 00:03:41 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
|
2016-03-08 16:51:26 +00:00
|
|
|
dst_cset = find_css_set(src_cset, src_cset->mg_dst_cgrp);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
if (!dst_cset)
|
2019-04-03 23:03:54 +00:00
|
|
|
return -ENOMEM;
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
|
|
|
|
WARN_ON_ONCE(src_cset->mg_dst_cset || dst_cset->mg_dst_cset);
|
2014-04-23 15:13:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If src cset equals dst, it's noop. Drop the src.
|
|
|
|
* cgroup_migrate() will skip the cset too. Note that we
|
|
|
|
* can't handle src == dst as some nodes are used by both.
|
|
|
|
*/
|
|
|
|
if (src_cset == dst_cset) {
|
|
|
|
src_cset->mg_src_cgrp = NULL;
|
2016-03-08 16:51:26 +00:00
|
|
|
src_cset->mg_dst_cgrp = NULL;
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
list_del_init(&src_cset->mg_src_preload_node);
|
2014-09-19 08:51:00 +00:00
|
|
|
put_css_set(src_cset);
|
|
|
|
put_css_set(dst_cset);
|
2014-04-23 15:13:16 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
src_cset->mg_dst_cset = dst_cset;
|
|
|
|
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
if (list_empty(&dst_cset->mg_dst_preload_node))
|
|
|
|
list_add_tail(&dst_cset->mg_dst_preload_node,
|
2017-01-16 00:03:41 +00:00
|
|
|
&mgctx->preloaded_dst_csets);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
else
|
2014-09-19 08:51:00 +00:00
|
|
|
put_css_set(dst_cset);
|
2017-01-16 00:03:41 +00:00
|
|
|
|
|
|
|
for_each_subsys(ss, ssid)
|
|
|
|
if (src_cset->subsys[ssid] != dst_cset->subsys[ssid])
|
|
|
|
mgctx->ss_mask |= 1 << ssid;
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_migrate - migrate a process or task to a cgroup
|
|
|
|
* @leader: the leader of the process or the task to migrate
|
|
|
|
* @threadgroup: whether @leader points to the whole process or a single task
|
2017-01-16 00:03:41 +00:00
|
|
|
* @mgctx: migration context
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*
|
2016-03-08 16:51:26 +00:00
|
|
|
* Migrate a process or task denoted by @leader. If migrating a process,
|
|
|
|
* the caller must be holding cgroup_threadgroup_rwsem. The caller is also
|
|
|
|
* responsible for invoking cgroup_migrate_add_src() and
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
* cgroup_migrate_prepare_dst() on the targets before invoking this
|
|
|
|
* function and following up with cgroup_migrate_finish().
|
|
|
|
*
|
|
|
|
* As long as a controller's ->can_attach() doesn't fail, this function is
|
|
|
|
* guaranteed to succeed. This means that, excluding ->can_attach()
|
|
|
|
* failure, when migrating multiple targets, the success or failure can be
|
|
|
|
* decided for all targets by invoking group_migrate_prepare_dst() before
|
|
|
|
* actually starting migrating.
|
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
int cgroup_migrate(struct task_struct *leader, bool threadgroup,
|
2017-01-16 00:03:41 +00:00
|
|
|
struct cgroup_mgctx *mgctx)
|
2011-05-26 23:25:20 +00:00
|
|
|
{
|
2015-09-11 19:00:21 +00:00
|
|
|
struct task_struct *task;
|
2011-05-26 23:25:20 +00:00
|
|
|
|
2012-01-04 05:18:31 +00:00
|
|
|
/*
|
2023-05-24 06:54:31 +00:00
|
|
|
* The following thread iteration should be inside an RCU critical
|
|
|
|
* section to prevent tasks from being freed while taking the snapshot.
|
|
|
|
* spin_lock_irq() implies RCU critical section here.
|
2012-01-04 05:18:31 +00:00
|
|
|
*/
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2014-02-13 11:58:43 +00:00
|
|
|
task = leader;
|
2011-05-26 23:25:20 +00:00
|
|
|
do {
|
2017-01-16 00:03:41 +00:00
|
|
|
cgroup_migrate_add_task(task, mgctx);
|
2013-03-13 01:17:09 +00:00
|
|
|
if (!threadgroup)
|
|
|
|
break;
|
2014-02-13 11:58:43 +00:00
|
|
|
} while_each_thread(leader, task);
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2011-05-26 23:25:20 +00:00
|
|
|
|
2017-01-16 00:03:41 +00:00
|
|
|
return cgroup_migrate_execute(mgctx);
|
2011-05-26 23:25:20 +00:00
|
|
|
}
|
|
|
|
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
/**
|
|
|
|
* cgroup_attach_task - attach a task or a whole threadgroup to a cgroup
|
|
|
|
* @dst_cgrp: the cgroup to attach to
|
|
|
|
* @leader: the task or the leader of the threadgroup to be attached
|
|
|
|
* @threadgroup: attach the whole threadgroup?
|
|
|
|
*
|
2015-09-16 16:53:17 +00:00
|
|
|
* Call holding cgroup_mutex and cgroup_threadgroup_rwsem.
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
|
|
|
|
bool threadgroup)
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
{
|
2017-01-16 00:03:41 +00:00
|
|
|
DEFINE_CGROUP_MGCTX(mgctx);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
struct task_struct *task;
|
2020-02-05 13:26:18 +00:00
|
|
|
int ret = 0;
|
2016-03-08 16:51:25 +00:00
|
|
|
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
/* look up all src csets */
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
task = leader;
|
|
|
|
do {
|
2017-01-16 00:03:41 +00:00
|
|
|
cgroup_migrate_add_src(task_css_set(task), dst_cgrp, &mgctx);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
if (!threadgroup)
|
|
|
|
break;
|
|
|
|
} while_each_thread(leader, task);
|
|
|
|
rcu_read_unlock();
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
|
|
|
|
/* prepare dst csets and commit */
|
2017-01-16 00:03:41 +00:00
|
|
|
ret = cgroup_migrate_prepare_dst(&mgctx);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
if (!ret)
|
2017-01-16 00:03:41 +00:00
|
|
|
ret = cgroup_migrate(leader, threadgroup, &mgctx);
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
|
2017-01-16 00:03:41 +00:00
|
|
|
cgroup_migrate_finish(&mgctx);
|
2016-08-10 15:23:44 +00:00
|
|
|
|
|
|
|
if (!ret)
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 21:48:54 +00:00
|
|
|
TRACE_CGROUP_PATH(attach_task, dst_cgrp, leader, threadgroup);
|
2016-08-10 15:23:44 +00:00
|
|
|
|
cgroup: split process / task migration into four steps
Currently, process / task migration is a single operation which may
fail depending on memory pressure or the involved controllers'
->can_attach() callbacks. One problem with this approach is migration
of multiple targets. It's impossible to tell whether a given target
will be successfully migrated beforehand and cgroup core can't keep
track of enough states to roll back after intermediate failure.
This is already an issue with cgroup_transfer_tasks(). Also, we're
gonna need multiple target migration for unified hierarchy.
This patch splits migration into four stages -
cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
cgroup_migrate() and cgroup_migrate_finish(), where
cgroup_migrate_prepare_dst() performs all the operations which may
fail due to allocation failure without actually migrating the target.
The four separate stages mean that, disregarding ->can_attach()
failures, the success or failure of multi target migration can be
determined before performing any actual migration. If preparations of
all targets succeed, the whole thing will succeed. If not, the whole
operation can fail without any side-effect.
Since the previous patch to use css_set->mg_tasks to keep track of
migration targets, the only thing which may need memory allocation
during migration is the target css_sets. cgroup_migrate_prepare()
pins all source and target css_sets and link them up. Note that this
can be performed without holding threadgroup_lock even if the target
is a process. As long as cgroup_mutex is held, no new css_set can be
put into play.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-02-25 15:04:03 +00:00
|
|
|
return ret;
|
2011-05-26 23:25:20 +00:00
|
|
|
}
|
|
|
|
|
cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c32a ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.
The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.
There are several places that need to synchronize with migration:
a) do_exit,
b) de_thread,
c) copy_process,
d) cgroup_update_dfl_csses,
e) parallel migration (cgroup_{proc,thread}s_write).
In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.
This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.
This change improves migration dependent workload performance similar
to per-signal_struct state.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-04 10:57:40 +00:00
|
|
|
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
|
2022-08-15 23:27:38 +00:00
|
|
|
bool *threadgroup_locked)
|
2007-10-19 06:39:32 +00:00
|
|
|
{
|
|
|
|
struct task_struct *tsk;
|
2014-05-13 16:16:22 +00:00
|
|
|
pid_t pid;
|
2007-10-19 06:39:32 +00:00
|
|
|
|
2014-05-13 16:16:22 +00:00
|
|
|
if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
|
2017-05-15 13:34:00 +00:00
|
|
|
return ERR_PTR(-EINVAL);
|
2011-05-26 23:25:20 +00:00
|
|
|
|
cgroup: Optimize single thread migration
There are reports of users who use thread migrations between cgroups and
they report performance drop after d59cfc09c32a ("sched, cgroup: replace
signal_struct->group_rwsem with a global percpu_rwsem"). The effect is
pronounced on machines with more CPUs.
The migration is affected by forking noise happening in the background,
after the mentioned commit a migrating thread must wait for all
(forking) processes on the system, not only of its threadgroup.
There are several places that need to synchronize with migration:
a) do_exit,
b) de_thread,
c) copy_process,
d) cgroup_update_dfl_csses,
e) parallel migration (cgroup_{proc,thread}s_write).
In the case of self-migrating thread, we relax the synchronization on
cgroup_threadgroup_rwsem to avoid the cost of waiting. d) and e) are
excluded with cgroup_mutex, c) does not matter in case of single thread
migration and the executing thread cannot exec(2) or exit(2) while it is
writing into cgroup.threads. In case of do_exit because of signal
delivery, we either exit before the migration or finish the migration
(of not yet PF_EXITING thread) and die afterwards.
This patch handles only the case of self-migration by writing "0" into
cgroup.threads. For simplicity, we always take cgroup_threadgroup_rwsem
with numeric PIDs.
This change improves migration dependent workload performance similar
to per-signal_struct state.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2019-10-04 10:57:40 +00:00
|
|
|
/*
|
|
|
|
* If we migrate a single thread, we don't care about threadgroup
|
|
|
|
* stability. If the thread is `current`, it won't exit(2) under our
|
|
|
|
* hands or change PID through exec(2). We exclude
|
|
|
|
* cgroup_update_dfl_csses and other cgroup_{proc,thread}s_write
|
|
|
|
* callers by cgroup_mutex.
|
|
|
|
* Therefore, we can skip the global lock.
|
|
|
|
*/
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2022-08-15 23:27:38 +00:00
|
|
|
*threadgroup_locked = pid || threadgroup;
|
|
|
|
cgroup_attach_lock(*threadgroup_locked);
|
2017-05-15 13:34:00 +00:00
|
|
|
|
2012-01-04 05:18:30 +00:00
|
|
|
rcu_read_lock();
|
2007-10-19 06:39:32 +00:00
|
|
|
if (pid) {
|
2008-02-07 08:14:47 +00:00
|
|
|
tsk = find_task_by_vpid(pid);
|
2011-05-26 23:25:20 +00:00
|
|
|
if (!tsk) {
|
2017-05-15 13:34:00 +00:00
|
|
|
tsk = ERR_PTR(-ESRCH);
|
|
|
|
goto out_unlock_threadgroup;
|
2007-10-19 06:39:32 +00:00
|
|
|
}
|
2015-06-18 20:54:28 +00:00
|
|
|
} else {
|
2012-01-04 05:18:30 +00:00
|
|
|
tsk = current;
|
2015-06-18 20:54:28 +00:00
|
|
|
}
|
2011-12-13 02:12:21 +00:00
|
|
|
|
|
|
|
if (threadgroup)
|
2012-01-04 05:18:30 +00:00
|
|
|
tsk = tsk->group_leader;
|
2012-04-21 07:13:46 +00:00
|
|
|
|
|
|
|
/*
|
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-16 20:54:24 +00:00
|
|
|
* kthreads may acquire PF_NO_SETAFFINITY during initialization.
|
|
|
|
* If userland migrates such a kthread to a non-root cgroup, it can
|
|
|
|
* become trapped in a cpuset, or RT kthread may be born in a
|
|
|
|
* cgroup with no rt_runtime allocated. Just say no.
|
2012-04-21 07:13:46 +00:00
|
|
|
*/
|
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-03-16 20:54:24 +00:00
|
|
|
if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
|
2017-05-15 13:34:00 +00:00
|
|
|
tsk = ERR_PTR(-EINVAL);
|
|
|
|
goto out_unlock_threadgroup;
|
2012-04-21 07:13:46 +00:00
|
|
|
}
|
|
|
|
|
2012-01-04 05:18:30 +00:00
|
|
|
get_task_struct(tsk);
|
2017-05-15 13:34:00 +00:00
|
|
|
goto out_unlock_rcu;
|
|
|
|
|
|
|
|
out_unlock_threadgroup:
|
2022-08-15 23:27:38 +00:00
|
|
|
cgroup_attach_unlock(*threadgroup_locked);
|
|
|
|
*threadgroup_locked = false;
|
2017-05-15 13:34:00 +00:00
|
|
|
out_unlock_rcu:
|
2012-01-04 05:18:30 +00:00
|
|
|
rcu_read_unlock();
|
2017-05-15 13:34:00 +00:00
|
|
|
return tsk;
|
|
|
|
}
|
2012-01-04 05:18:30 +00:00
|
|
|
|
2022-08-15 23:27:38 +00:00
|
|
|
void cgroup_procs_write_finish(struct task_struct *task, bool threadgroup_locked)
|
2017-05-15 13:34:00 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
2013-03-13 01:17:09 +00:00
|
|
|
|
2017-05-15 13:34:00 +00:00
|
|
|
/* release reference from cgroup_procs_write_start() */
|
|
|
|
put_task_struct(task);
|
2015-09-16 17:03:02 +00:00
|
|
|
|
2022-08-15 23:27:38 +00:00
|
|
|
cgroup_attach_unlock(threadgroup_locked);
|
|
|
|
|
2016-04-21 23:06:48 +00:00
|
|
|
for_each_subsys(ss, ssid)
|
|
|
|
if (ss->post_attach)
|
|
|
|
ss->post_attach();
|
2008-07-25 08:47:01 +00:00
|
|
|
}
|
|
|
|
|
2016-02-23 03:25:47 +00:00
|
|
|
static void cgroup_print_ss_mask(struct seq_file *seq, u16 ss_mask)
|
2007-10-19 06:39:33 +00:00
|
|
|
{
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
bool printed = false;
|
|
|
|
int ssid;
|
2013-12-05 17:28:03 +00:00
|
|
|
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, ssid, ss_mask) {
|
2015-06-06 00:02:15 +00:00
|
|
|
if (printed)
|
|
|
|
seq_putc(seq, ' ');
|
2019-07-02 17:26:59 +00:00
|
|
|
seq_puts(seq, ss->name);
|
2015-06-06 00:02:15 +00:00
|
|
|
printed = true;
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
if (printed)
|
|
|
|
seq_putc(seq, '\n');
|
2007-10-19 06:39:33 +00:00
|
|
|
}
|
|
|
|
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
/* show controllers which are enabled from the parent */
|
|
|
|
static int cgroup_controllers_show(struct seq_file *seq, void *v)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
cgroup_print_ss_mask(seq, cgroup_control(cgrp));
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
return 0;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
/* show controllers which are enabled for a given cgroup's children */
|
|
|
|
static int cgroup_subtree_control_show(struct seq_file *seq, void *v)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
|
2014-07-08 22:02:56 +00:00
|
|
|
cgroup_print_ss_mask(seq, cgrp->subtree_control);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_update_dfl_csses - update css assoc of a subtree in default hierarchy
|
|
|
|
* @cgrp: root of the subtree to update csses for
|
|
|
|
*
|
2016-03-03 14:58:01 +00:00
|
|
|
* @cgrp's control masks have changed and its subtree's css associations
|
|
|
|
* need to be updated accordingly. This function looks up all css_sets
|
|
|
|
* which are attached to the subtree, creates the matching updated css_sets
|
|
|
|
* and migrates the tasks to the new ones.
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
*/
|
|
|
|
static int cgroup_update_dfl_csses(struct cgroup *cgrp)
|
|
|
|
{
|
2017-01-16 00:03:41 +00:00
|
|
|
DEFINE_CGROUP_MGCTX(mgctx);
|
2016-03-03 14:58:01 +00:00
|
|
|
struct cgroup_subsys_state *d_css;
|
|
|
|
struct cgroup *dsct;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
struct css_set *src_cset;
|
2022-07-15 04:38:15 +00:00
|
|
|
bool has_tasks;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
/* look up all csses currently attached to @cgrp's subtree */
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2016-03-03 14:58:01 +00:00
|
|
|
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
struct cgrp_cset_link *link;
|
|
|
|
|
2022-07-28 00:58:15 +00:00
|
|
|
/*
|
|
|
|
* As cgroup_update_dfl_csses() is only called by
|
|
|
|
* cgroup_apply_control(). The csses associated with the
|
|
|
|
* given cgrp will not be affected by changes made to
|
|
|
|
* its subtree_control file. We can skip them.
|
|
|
|
*/
|
|
|
|
if (dsct == cgrp)
|
|
|
|
continue;
|
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
list_for_each_entry(link, &dsct->cset_links, cset_link)
|
2017-01-16 00:03:41 +00:00
|
|
|
cgroup_migrate_add_src(link->cset, dsct, &mgctx);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
}
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
2022-07-15 04:38:15 +00:00
|
|
|
/*
|
|
|
|
* We need to write-lock threadgroup_rwsem while migrating tasks.
|
|
|
|
* However, if there are no source csets for @cgrp, changing its
|
|
|
|
* controllers isn't gonna produce any task migrations and the
|
|
|
|
* write-locking can be skipped safely.
|
|
|
|
*/
|
|
|
|
has_tasks = !list_empty(&mgctx.preloaded_src_csets);
|
2022-08-15 23:27:38 +00:00
|
|
|
cgroup_attach_lock(has_tasks);
|
2022-07-15 04:38:15 +00:00
|
|
|
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
/* NULL dst indicates self on default hierarchy */
|
2017-01-16 00:03:41 +00:00
|
|
|
ret = cgroup_migrate_prepare_dst(&mgctx);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_finish;
|
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
cgroup: Use separate src/dst nodes when preloading css_sets for migration
Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.
Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:
#1> mkdir -p /sys/fs/cgroup/misc/a/b
#2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
#3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
#4> PID=$!
#5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
#6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs
the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.
After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:
1. cgroup_migrate_add_src() is called on B for the leader.
2. cgroup_migrate_add_src() is called on A for the other threads.
3. cgroup_migrate_prepare_dst() is called. It scans the src list.
4. It notices that B wants to migrate to A, so it tries to A to the dst
list but realizes that its ->mg_preload_node is already busy.
5. and then it notices A wants to migrate to A as it's an identity
migration, it culls it by list_del_init()'ing its ->mg_preload_node and
putting references accordingly.
6. The rest of migration takes place with B on the src list but nothing on
the dst list.
This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.
This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.
This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de98513d ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
2022-06-13 22:19:50 +00:00
|
|
|
list_for_each_entry(src_cset, &mgctx.preloaded_src_csets,
|
|
|
|
mg_src_preload_node) {
|
2015-09-11 19:00:22 +00:00
|
|
|
struct task_struct *task, *ntask;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
2015-09-11 19:00:22 +00:00
|
|
|
/* all tasks in src_csets need to be migrated */
|
|
|
|
list_for_each_entry_safe(task, ntask, &src_cset->tasks, cg_list)
|
2017-01-16 00:03:41 +00:00
|
|
|
cgroup_migrate_add_task(task, &mgctx);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
}
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
2017-01-16 00:03:41 +00:00
|
|
|
ret = cgroup_migrate_execute(&mgctx);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
out_finish:
|
2017-01-16 00:03:41 +00:00
|
|
|
cgroup_migrate_finish(&mgctx);
|
2022-08-15 23:27:38 +00:00
|
|
|
cgroup_attach_unlock(has_tasks);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
/**
|
2016-03-03 14:58:00 +00:00
|
|
|
* cgroup_lock_and_drain_offline - lock cgroup_mutex and drain offlined csses
|
2016-03-03 14:57:59 +00:00
|
|
|
* @cgrp: root of the target subtree
|
2016-03-03 14:57:59 +00:00
|
|
|
*
|
|
|
|
* Because css offlining is asynchronous, userland may try to re-enable a
|
2016-03-03 14:58:00 +00:00
|
|
|
* controller while the previous css is still around. This function grabs
|
|
|
|
* cgroup_mutex and drains the previous css instances of @cgrp's subtree.
|
2016-03-03 14:57:59 +00:00
|
|
|
*/
|
2016-12-27 19:49:06 +00:00
|
|
|
void cgroup_lock_and_drain_offline(struct cgroup *cgrp)
|
2016-03-03 14:58:00 +00:00
|
|
|
__acquires(&cgroup_mutex)
|
2016-03-03 14:57:59 +00:00
|
|
|
{
|
|
|
|
struct cgroup *dsct;
|
2016-03-03 14:57:59 +00:00
|
|
|
struct cgroup_subsys_state *d_css;
|
2016-03-03 14:57:59 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
restart:
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2016-03-03 14:57:59 +00:00
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
|
2016-03-03 14:57:59 +00:00
|
|
|
for_each_subsys(ss, ssid) {
|
|
|
|
struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
if (!css || !percpu_ref_is_dying(&css->refcnt))
|
2016-03-03 14:57:59 +00:00
|
|
|
continue;
|
|
|
|
|
2017-04-28 19:14:55 +00:00
|
|
|
cgroup_get_live(dsct);
|
2016-03-03 14:57:59 +00:00
|
|
|
prepare_to_wait(&dsct->offline_waitq, &wait,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2016-03-03 14:57:59 +00:00
|
|
|
schedule();
|
|
|
|
finish_wait(&dsct->offline_waitq, &wait);
|
|
|
|
|
|
|
|
cgroup_put(dsct);
|
2016-03-03 14:58:00 +00:00
|
|
|
goto restart;
|
2016-03-03 14:57:59 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
/**
|
2018-10-04 20:28:08 +00:00
|
|
|
* cgroup_save_control - save control masks and dom_cgrp of a subtree
|
2016-03-03 14:57:59 +00:00
|
|
|
* @cgrp: root of the target subtree
|
|
|
|
*
|
2018-10-04 20:28:08 +00:00
|
|
|
* Save ->subtree_control, ->subtree_ss_mask and ->dom_cgrp to the
|
|
|
|
* respective old_ prefixed fields for @cgrp's subtree including @cgrp
|
|
|
|
* itself.
|
2016-03-03 14:57:59 +00:00
|
|
|
*/
|
|
|
|
static void cgroup_save_control(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *dsct;
|
|
|
|
struct cgroup_subsys_state *d_css;
|
|
|
|
|
|
|
|
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
|
|
|
|
dsct->old_subtree_control = dsct->subtree_control;
|
|
|
|
dsct->old_subtree_ss_mask = dsct->subtree_ss_mask;
|
2018-10-04 20:28:08 +00:00
|
|
|
dsct->old_dom_cgrp = dsct->dom_cgrp;
|
2016-03-03 14:57:59 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_propagate_control - refresh control masks of a subtree
|
|
|
|
* @cgrp: root of the target subtree
|
|
|
|
*
|
|
|
|
* For @cgrp and its subtree, ensure ->subtree_ss_mask matches
|
|
|
|
* ->subtree_control and propagate controller availability through the
|
|
|
|
* subtree so that descendants don't have unavailable controllers enabled.
|
|
|
|
*/
|
|
|
|
static void cgroup_propagate_control(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *dsct;
|
|
|
|
struct cgroup_subsys_state *d_css;
|
|
|
|
|
|
|
|
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
|
|
|
|
dsct->subtree_control &= cgroup_control(dsct);
|
2016-03-03 14:58:01 +00:00
|
|
|
dsct->subtree_ss_mask =
|
|
|
|
cgroup_calc_subtree_ss_mask(dsct->subtree_control,
|
|
|
|
cgroup_ss_mask(dsct));
|
2016-03-03 14:57:59 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2018-10-04 20:28:08 +00:00
|
|
|
* cgroup_restore_control - restore control masks and dom_cgrp of a subtree
|
2016-03-03 14:57:59 +00:00
|
|
|
* @cgrp: root of the target subtree
|
|
|
|
*
|
2018-10-04 20:28:08 +00:00
|
|
|
* Restore ->subtree_control, ->subtree_ss_mask and ->dom_cgrp from the
|
|
|
|
* respective old_ prefixed fields for @cgrp's subtree including @cgrp
|
|
|
|
* itself.
|
2016-03-03 14:57:59 +00:00
|
|
|
*/
|
|
|
|
static void cgroup_restore_control(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *dsct;
|
|
|
|
struct cgroup_subsys_state *d_css;
|
|
|
|
|
|
|
|
cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
|
|
|
|
dsct->subtree_control = dsct->old_subtree_control;
|
|
|
|
dsct->subtree_ss_mask = dsct->old_subtree_ss_mask;
|
2018-10-04 20:28:08 +00:00
|
|
|
dsct->dom_cgrp = dsct->old_dom_cgrp;
|
2016-03-03 14:57:59 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-03-08 16:51:26 +00:00
|
|
|
static bool css_visible(struct cgroup_subsys_state *css)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys *ss = css->ss;
|
|
|
|
struct cgroup *cgrp = css->cgroup;
|
|
|
|
|
|
|
|
if (cgroup_control(cgrp) & (1 << ss->id))
|
|
|
|
return true;
|
|
|
|
if (!(cgroup_ss_mask(cgrp) & (1 << ss->id)))
|
|
|
|
return false;
|
|
|
|
return cgroup_on_dfl(cgrp) && ss->implicit_on_dfl;
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
/**
|
|
|
|
* cgroup_apply_control_enable - enable or show csses according to control
|
2016-03-03 14:57:59 +00:00
|
|
|
* @cgrp: root of the target subtree
|
2016-03-03 14:57:59 +00:00
|
|
|
*
|
2016-03-03 14:57:59 +00:00
|
|
|
* Walk @cgrp's subtree and create new csses or make the existing ones
|
2016-03-03 14:57:59 +00:00
|
|
|
* visible. A css is created invisible if it's being implicitly enabled
|
|
|
|
* through dependency. An invisible css is made visible when the userland
|
|
|
|
* explicitly enables it.
|
|
|
|
*
|
|
|
|
* Returns 0 on success, -errno on failure. On failure, csses which have
|
|
|
|
* been processed already aren't cleaned up. The caller is responsible for
|
2017-03-10 00:16:31 +00:00
|
|
|
* cleaning up with cgroup_apply_control_disable().
|
2016-03-03 14:57:59 +00:00
|
|
|
*/
|
|
|
|
static int cgroup_apply_control_enable(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *dsct;
|
2016-03-03 14:57:59 +00:00
|
|
|
struct cgroup_subsys_state *d_css;
|
2016-03-03 14:57:59 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid, ret;
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp) {
|
2016-03-03 14:57:59 +00:00
|
|
|
for_each_subsys(ss, ssid) {
|
|
|
|
struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
|
|
|
|
|
|
|
|
if (!(cgroup_ss_mask(dsct) & (1 << ss->id)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!css) {
|
|
|
|
css = css_create(dsct, ss);
|
|
|
|
if (IS_ERR(css))
|
|
|
|
return PTR_ERR(css);
|
|
|
|
}
|
|
|
|
|
2020-01-09 15:05:59 +00:00
|
|
|
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
|
|
|
|
|
2016-03-08 16:51:26 +00:00
|
|
|
if (css_visible(css)) {
|
2016-03-03 14:58:01 +00:00
|
|
|
ret = css_populate_dir(css);
|
2016-03-03 14:57:59 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
/**
|
|
|
|
* cgroup_apply_control_disable - kill or hide csses according to control
|
2016-03-03 14:57:59 +00:00
|
|
|
* @cgrp: root of the target subtree
|
2016-03-03 14:57:59 +00:00
|
|
|
*
|
2016-03-03 14:57:59 +00:00
|
|
|
* Walk @cgrp's subtree and kill and hide csses so that they match
|
2016-03-03 14:57:59 +00:00
|
|
|
* cgroup_ss_mask() and cgroup_visible_mask().
|
|
|
|
*
|
|
|
|
* A css is hidden when the userland requests it to be disabled while other
|
|
|
|
* subsystems are still depending on it. The css must not actively control
|
|
|
|
* resources and be in the vanilla state if it's made visible again later.
|
|
|
|
* Controllers which may be depended upon should provide ->css_reset() for
|
|
|
|
* this purpose.
|
|
|
|
*/
|
|
|
|
static void cgroup_apply_control_disable(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *dsct;
|
2016-03-03 14:57:59 +00:00
|
|
|
struct cgroup_subsys_state *d_css;
|
2016-03-03 14:57:59 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
cgroup_for_each_live_descendant_post(dsct, d_css, cgrp) {
|
2016-03-03 14:57:59 +00:00
|
|
|
for_each_subsys(ss, ssid) {
|
|
|
|
struct cgroup_subsys_state *css = cgroup_css(dsct, ss);
|
|
|
|
|
|
|
|
if (!css)
|
|
|
|
continue;
|
|
|
|
|
2020-01-09 15:05:59 +00:00
|
|
|
WARN_ON_ONCE(percpu_ref_is_dying(&css->refcnt));
|
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
if (css->parent &&
|
|
|
|
!(cgroup_ss_mask(dsct) & (1 << ss->id))) {
|
2016-03-03 14:57:59 +00:00
|
|
|
kill_css(css);
|
2016-03-08 16:51:26 +00:00
|
|
|
} else if (!css_visible(css)) {
|
2016-03-03 14:58:01 +00:00
|
|
|
css_clear_dir(css);
|
2016-03-03 14:57:59 +00:00
|
|
|
if (ss->css_reset)
|
|
|
|
ss->css_reset(css);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
/**
|
|
|
|
* cgroup_apply_control - apply control mask updates to the subtree
|
|
|
|
* @cgrp: root of the target subtree
|
|
|
|
*
|
|
|
|
* subsystems can be enabled and disabled in a subtree using the following
|
|
|
|
* steps.
|
|
|
|
*
|
|
|
|
* 1. Call cgroup_save_control() to stash the current state.
|
|
|
|
* 2. Update ->subtree_control masks in the subtree as desired.
|
|
|
|
* 3. Call cgroup_apply_control() to apply the changes.
|
|
|
|
* 4. Optionally perform other related operations.
|
|
|
|
* 5. Call cgroup_finalize_control() to finish up.
|
|
|
|
*
|
|
|
|
* This function implements step 3 and propagates the mask changes
|
|
|
|
* throughout @cgrp's subtree, updates csses accordingly and perform
|
|
|
|
* process migrations.
|
|
|
|
*/
|
|
|
|
static int cgroup_apply_control(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
cgroup_propagate_control(cgrp);
|
|
|
|
|
|
|
|
ret = cgroup_apply_control_enable(cgrp);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
2018-12-05 17:10:36 +00:00
|
|
|
* At this point, cgroup_e_css_by_mask() results reflect the new csses
|
2016-03-03 14:58:00 +00:00
|
|
|
* making the following cgroup_update_dfl_csses() properly update
|
|
|
|
* css associations of all tasks in the subtree.
|
|
|
|
*/
|
2022-09-17 08:40:39 +00:00
|
|
|
return cgroup_update_dfl_csses(cgrp);
|
2016-03-03 14:58:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_finalize_control - finalize control mask update
|
|
|
|
* @cgrp: root of the target subtree
|
|
|
|
* @ret: the result of the update
|
|
|
|
*
|
|
|
|
* Finalize control mask update. See cgroup_apply_control() for more info.
|
|
|
|
*/
|
|
|
|
static void cgroup_finalize_control(struct cgroup *cgrp, int ret)
|
|
|
|
{
|
|
|
|
if (ret) {
|
|
|
|
cgroup_restore_control(cgrp);
|
|
|
|
cgroup_propagate_control(cgrp);
|
|
|
|
}
|
|
|
|
|
|
|
|
cgroup_apply_control_disable(cgrp);
|
|
|
|
}
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
static int cgroup_vet_subtree_control_enable(struct cgroup *cgrp, u16 enable)
|
|
|
|
{
|
|
|
|
u16 domain_enable = enable & ~cgrp_dfl_threaded_ss_mask;
|
|
|
|
|
|
|
|
/* if nothing is getting enabled, nothing to worry about */
|
|
|
|
if (!enable)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* can @cgrp host any resources? */
|
|
|
|
if (!cgroup_is_valid_domain(cgrp->dom_cgrp))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
/* mixables don't care */
|
|
|
|
if (cgroup_is_mixable(cgrp))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (domain_enable) {
|
|
|
|
/* can't enable domain controllers inside a thread subtree */
|
|
|
|
if (cgroup_is_thread_root(cgrp) || cgroup_is_threaded(cgrp))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Threaded controllers can handle internal competitions
|
|
|
|
* and are always allowed inside a (prospective) thread
|
|
|
|
* subtree.
|
|
|
|
*/
|
|
|
|
if (cgroup_can_be_thread_root(cgrp) || cgroup_is_threaded(cgrp))
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Controllers can't be enabled for a cgroup with tasks to avoid
|
|
|
|
* child cgroups competing against tasks.
|
|
|
|
*/
|
|
|
|
if (cgroup_has_tasks(cgrp))
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
/* change the enabled child controllers for a cgroup in the default hierarchy */
|
2014-05-13 16:16:21 +00:00
|
|
|
static ssize_t cgroup_subtree_control_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes,
|
|
|
|
loff_t off)
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
{
|
2016-02-23 03:25:47 +00:00
|
|
|
u16 enable = 0, disable = 0;
|
2014-05-13 16:19:22 +00:00
|
|
|
struct cgroup *cgrp, *child;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2014-05-13 16:16:21 +00:00
|
|
|
char *tok;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
int ssid, ret;
|
|
|
|
|
|
|
|
/*
|
2014-05-13 16:10:59 +00:00
|
|
|
* Parse input - space separated list of subsystem names prefixed
|
|
|
|
* with either + or -.
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
*/
|
2014-05-13 16:16:21 +00:00
|
|
|
buf = strstrip(buf);
|
|
|
|
while ((tok = strsep(&buf, " "))) {
|
2014-05-13 16:10:59 +00:00
|
|
|
if (tok[0] == '\0')
|
|
|
|
continue;
|
2016-02-23 15:00:50 +00:00
|
|
|
do_each_subsys_mask(ss, ssid, ~cgrp_dfl_inhibit_ss_mask) {
|
2015-09-18 15:56:28 +00:00
|
|
|
if (!cgroup_ssid_enabled(ssid) ||
|
|
|
|
strcmp(tok + 1, ss->name))
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
if (*tok == '+') {
|
2014-05-13 16:11:00 +00:00
|
|
|
enable |= 1 << ssid;
|
|
|
|
disable &= ~(1 << ssid);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
} else if (*tok == '-') {
|
2014-05-13 16:11:00 +00:00
|
|
|
disable |= 1 << ssid;
|
|
|
|
enable &= ~(1 << ssid);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
} else {
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
break;
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
if (ssid == CGROUP_SUBSYS_COUNT)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, true);
|
2014-05-13 16:19:22 +00:00
|
|
|
if (!cgrp)
|
|
|
|
return -ENODEV;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
|
|
|
for_each_subsys(ss, ssid) {
|
|
|
|
if (enable & (1 << ssid)) {
|
2014-07-08 22:02:56 +00:00
|
|
|
if (cgrp->subtree_control & (1 << ssid)) {
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
enable &= ~(1 << ssid);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
if (!(cgroup_control(cgrp) & (1 << ssid))) {
|
2014-07-08 22:02:56 +00:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
} else if (disable & (1 << ssid)) {
|
2014-07-08 22:02:56 +00:00
|
|
|
if (!(cgrp->subtree_control & (1 << ssid))) {
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
disable &= ~(1 << ssid);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* a child has it enabled? */
|
|
|
|
cgroup_for_each_live_child(child, cgrp) {
|
2014-07-08 22:02:56 +00:00
|
|
|
if (child->subtree_control & (1 << ssid)) {
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
ret = -EBUSY;
|
2014-05-13 16:19:22 +00:00
|
|
|
goto out_unlock;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!enable && !disable) {
|
|
|
|
ret = 0;
|
2014-05-13 16:19:22 +00:00
|
|
|
goto out_unlock;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
}
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
ret = cgroup_vet_subtree_control_enable(cgrp, enable);
|
|
|
|
if (ret)
|
2017-07-17 01:44:18 +00:00
|
|
|
goto out_unlock;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
/* save and update control masks and prepare csses */
|
|
|
|
cgroup_save_control(cgrp);
|
2014-07-08 22:02:57 +00:00
|
|
|
|
2016-03-03 14:57:59 +00:00
|
|
|
cgrp->subtree_control |= enable;
|
|
|
|
cgrp->subtree_control &= ~disable;
|
2014-07-08 22:02:56 +00:00
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
ret = cgroup_apply_control(cgrp);
|
|
|
|
cgroup_finalize_control(cgrp, ret);
|
2017-07-23 12:14:15 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
|
|
|
kernfs_activate(cgrp->kn);
|
|
|
|
out_unlock:
|
2014-05-13 16:19:22 +00:00
|
|
|
cgroup_kn_unlock(of->kn);
|
2014-05-13 16:16:21 +00:00
|
|
|
return ret ?: nbytes;
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
}
|
|
|
|
|
2017-07-25 17:20:18 +00:00
|
|
|
/**
|
|
|
|
* cgroup_enable_threaded - make @cgrp threaded
|
|
|
|
* @cgrp: the target cgroup
|
|
|
|
*
|
|
|
|
* Called when "threaded" is written to the cgroup.type interface file and
|
|
|
|
* tries to make @cgrp threaded and join the parent's resource domain.
|
|
|
|
* This function is never called on the root cgroup as cgroup.type doesn't
|
|
|
|
* exist on it.
|
|
|
|
*/
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
static int cgroup_enable_threaded(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup *parent = cgroup_parent(cgrp);
|
|
|
|
struct cgroup *dom_cgrp = parent->dom_cgrp;
|
2018-10-04 20:28:08 +00:00
|
|
|
struct cgroup *dsct;
|
|
|
|
struct cgroup_subsys_state *d_css;
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
/* noop if already threaded */
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
return 0;
|
|
|
|
|
2018-02-21 19:39:22 +00:00
|
|
|
/*
|
|
|
|
* If @cgroup is populated or has domain controllers enabled, it
|
|
|
|
* can't be switched. While the below cgroup_can_be_thread_root()
|
|
|
|
* test can catch the same conditions, that's only when @parent is
|
|
|
|
* not mixable, so let's check it explicitly.
|
|
|
|
*/
|
|
|
|
if (cgroup_is_populated(cgrp) ||
|
|
|
|
cgrp->subtree_control & ~cgrp_dfl_threaded_ss_mask)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
/* we're joining the parent's domain, ensure its validity */
|
|
|
|
if (!cgroup_is_valid_domain(dom_cgrp) ||
|
|
|
|
!cgroup_can_be_thread_root(dom_cgrp))
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The following shouldn't cause actual migrations and should
|
|
|
|
* always succeed.
|
|
|
|
*/
|
|
|
|
cgroup_save_control(cgrp);
|
|
|
|
|
2018-10-04 20:28:08 +00:00
|
|
|
cgroup_for_each_live_descendant_pre(dsct, d_css, cgrp)
|
|
|
|
if (dsct == cgrp || cgroup_is_threaded(dsct))
|
|
|
|
dsct->dom_cgrp = dom_cgrp;
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
ret = cgroup_apply_control(cgrp);
|
|
|
|
if (!ret)
|
|
|
|
parent->nr_threaded_children++;
|
|
|
|
|
|
|
|
cgroup_finalize_control(cgrp, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cgroup_type_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
seq_puts(seq, "threaded\n");
|
|
|
|
else if (!cgroup_is_valid_domain(cgrp))
|
|
|
|
seq_puts(seq, "domain invalid\n");
|
|
|
|
else if (cgroup_is_thread_root(cgrp))
|
|
|
|
seq_puts(seq, "domain threaded\n");
|
|
|
|
else
|
|
|
|
seq_puts(seq, "domain\n");
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_type_write(struct kernfs_open_file *of, char *buf,
|
|
|
|
size_t nbytes, loff_t off)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* only switching to threaded mode is supported */
|
|
|
|
if (strcmp(strstrip(buf), "threaded"))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2020-01-09 15:05:59 +00:00
|
|
|
/* drain dying csses before we re-apply (threaded) subtree control */
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, true);
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
if (!cgrp)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
/* threaded can only be enabled */
|
|
|
|
ret = cgroup_enable_threaded(cgrp);
|
|
|
|
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
return ret ?: nbytes;
|
|
|
|
}
|
|
|
|
|
2017-07-28 17:28:44 +00:00
|
|
|
static int cgroup_max_descendants_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
int descendants = READ_ONCE(cgrp->max_descendants);
|
|
|
|
|
|
|
|
if (descendants == INT_MAX)
|
|
|
|
seq_puts(seq, "max\n");
|
|
|
|
else
|
|
|
|
seq_printf(seq, "%d\n", descendants);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_max_descendants_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes, loff_t off)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
int descendants;
|
|
|
|
ssize_t ret;
|
|
|
|
|
|
|
|
buf = strstrip(buf);
|
|
|
|
if (!strcmp(buf, "max")) {
|
|
|
|
descendants = INT_MAX;
|
|
|
|
} else {
|
|
|
|
ret = kstrtoint(buf, 0, &descendants);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-08-09 10:25:21 +00:00
|
|
|
if (descendants < 0)
|
2017-07-28 17:28:44 +00:00
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!cgrp)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
cgrp->max_descendants = descendants;
|
|
|
|
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
|
|
|
return nbytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int cgroup_max_depth_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
int depth = READ_ONCE(cgrp->max_depth);
|
|
|
|
|
|
|
|
if (depth == INT_MAX)
|
|
|
|
seq_puts(seq, "max\n");
|
|
|
|
else
|
|
|
|
seq_printf(seq, "%d\n", depth);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_max_depth_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes, loff_t off)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
ssize_t ret;
|
|
|
|
int depth;
|
|
|
|
|
|
|
|
buf = strstrip(buf);
|
|
|
|
if (!strcmp(buf, "max")) {
|
|
|
|
depth = INT_MAX;
|
|
|
|
} else {
|
|
|
|
ret = kstrtoint(buf, 0, &depth);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-08-09 10:25:21 +00:00
|
|
|
if (depth < 0)
|
2017-07-28 17:28:44 +00:00
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!cgrp)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
cgrp->max_depth = depth;
|
|
|
|
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
|
|
|
return nbytes;
|
|
|
|
}
|
|
|
|
|
2015-09-18 21:54:22 +00:00
|
|
|
static int cgroup_events_show(struct seq_file *seq, void *v)
|
2014-04-25 22:28:02 +00:00
|
|
|
{
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
|
|
|
|
seq_printf(seq, "populated %d\n", cgroup_is_populated(cgrp));
|
|
|
|
seq_printf(seq, "frozen %d\n", test_bit(CGRP_FROZEN, &cgrp->flags));
|
|
|
|
|
2014-04-25 22:28:02 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-08-11 12:49:01 +00:00
|
|
|
static int cgroup_stat_show(struct seq_file *seq, void *v)
|
2017-08-02 16:55:31 +00:00
|
|
|
{
|
|
|
|
struct cgroup *cgroup = seq_css(seq)->cgroup;
|
|
|
|
|
|
|
|
seq_printf(seq, "nr_descendants %d\n",
|
|
|
|
cgroup->nr_descendants);
|
|
|
|
seq_printf(seq, "nr_dying_descendants %d\n",
|
|
|
|
cgroup->nr_dying_descendants);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-07-06 09:42:42 +00:00
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
2023-07-11 02:38:20 +00:00
|
|
|
/**
|
|
|
|
* cgroup_tryget_css - try to get a cgroup's css for the specified subsystem
|
|
|
|
* @cgrp: the cgroup of interest
|
|
|
|
* @ss: the subsystem of interest
|
|
|
|
*
|
|
|
|
* Find and get @cgrp's css associated with @ss. If the css doesn't exist
|
|
|
|
* or is offline, %NULL is returned.
|
|
|
|
*/
|
|
|
|
static struct cgroup_subsys_state *cgroup_tryget_css(struct cgroup *cgrp,
|
|
|
|
struct cgroup_subsys *ss)
|
2017-10-23 23:18:27 +00:00
|
|
|
{
|
2023-07-11 02:38:20 +00:00
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
css = cgroup_css(cgrp, ss);
|
|
|
|
if (css && !css_tryget_online(css))
|
|
|
|
css = NULL;
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return css;
|
|
|
|
}
|
|
|
|
|
2023-07-06 09:42:42 +00:00
|
|
|
static int cgroup_extra_stat_show(struct seq_file *seq, int ssid)
|
2017-10-23 23:18:27 +00:00
|
|
|
{
|
2023-07-06 09:42:42 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
2017-10-23 23:18:27 +00:00
|
|
|
struct cgroup_subsys *ss = cgroup_subsys[ssid];
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!ss->css_extra_stat_show)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
css = cgroup_tryget_css(cgrp, ss);
|
|
|
|
if (!css)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = ss->css_extra_stat_show(seq, css);
|
|
|
|
css_put(css);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-09-02 15:27:17 +00:00
|
|
|
static int cgroup_local_stat_show(struct seq_file *seq,
|
|
|
|
struct cgroup *cgrp, int ssid)
|
2023-06-20 18:32:47 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys *ss = cgroup_subsys[ssid];
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!ss->css_local_stat_show)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
css = cgroup_tryget_css(cgrp, ss);
|
|
|
|
if (!css)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = ss->css_local_stat_show(seq, css);
|
|
|
|
css_put(css);
|
|
|
|
return ret;
|
|
|
|
}
|
2023-09-02 15:27:17 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
static int cpu_stat_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
cgroup_base_stat_cputime_show(seq);
|
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
|
|
|
ret = cgroup_extra_stat_show(seq, cpu_cgrp_id);
|
|
|
|
#endif
|
|
|
|
return ret;
|
|
|
|
}
|
2023-06-20 18:32:47 +00:00
|
|
|
|
|
|
|
static int cpu_local_stat_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_CGROUP_SCHED
|
|
|
|
ret = cgroup_local_stat_show(seq, cgrp, cpu_cgrp_id);
|
|
|
|
#endif
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-10-26 22:06:31 +00:00
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
2019-11-04 23:54:30 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
2022-08-25 16:41:09 +00:00
|
|
|
struct psi_group *psi = cgroup_psi(cgrp);
|
2019-05-14 22:41:18 +00:00
|
|
|
|
|
|
|
return psi_show(seq, psi, PSI_IO);
|
2018-10-26 22:06:31 +00:00
|
|
|
}
|
|
|
|
static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
2019-11-04 23:54:30 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
2022-08-25 16:41:09 +00:00
|
|
|
struct psi_group *psi = cgroup_psi(cgrp);
|
2019-05-14 22:41:18 +00:00
|
|
|
|
|
|
|
return psi_show(seq, psi, PSI_MEM);
|
2018-10-26 22:06:31 +00:00
|
|
|
}
|
|
|
|
static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
2019-11-04 23:54:30 +00:00
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
2022-08-25 16:41:09 +00:00
|
|
|
struct psi_group *psi = cgroup_psi(cgrp);
|
2019-05-14 22:41:18 +00:00
|
|
|
|
|
|
|
return psi_show(seq, psi, PSI_CPU);
|
2018-10-26 22:06:31 +00:00
|
|
|
}
|
2019-05-14 22:41:15 +00:00
|
|
|
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
|
|
|
|
size_t nbytes, enum psi_res res)
|
2019-05-14 22:41:15 +00:00
|
|
|
{
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
2019-05-14 22:41:15 +00:00
|
|
|
struct psi_trigger *new;
|
|
|
|
struct cgroup *cgrp;
|
2021-01-16 17:36:33 +00:00
|
|
|
struct psi_group *psi;
|
2019-05-14 22:41:15 +00:00
|
|
|
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!cgrp)
|
|
|
|
return -ENODEV;
|
|
|
|
|
|
|
|
cgroup_get(cgrp);
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
2022-01-11 23:23:09 +00:00
|
|
|
/* Allow only one trigger per file descriptor */
|
|
|
|
if (ctx->psi.trigger) {
|
|
|
|
cgroup_put(cgrp);
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
2022-08-25 16:41:09 +00:00
|
|
|
psi = cgroup_psi(cgrp);
|
2023-06-30 00:56:12 +00:00
|
|
|
new = psi_trigger_create(psi, buf, res, of->file, of);
|
2019-05-14 22:41:15 +00:00
|
|
|
if (IS_ERR(new)) {
|
|
|
|
cgroup_put(cgrp);
|
|
|
|
return PTR_ERR(new);
|
|
|
|
}
|
|
|
|
|
2022-01-11 23:23:09 +00:00
|
|
|
smp_store_release(&ctx->psi.trigger, new);
|
2019-05-14 22:41:15 +00:00
|
|
|
cgroup_put(cgrp);
|
|
|
|
|
|
|
|
return nbytes;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_io_pressure_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes,
|
|
|
|
loff_t off)
|
|
|
|
{
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
return pressure_write(of, buf, nbytes, PSI_IO);
|
2019-05-14 22:41:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_memory_pressure_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes,
|
|
|
|
loff_t off)
|
|
|
|
{
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
return pressure_write(of, buf, nbytes, PSI_MEM);
|
2019-05-14 22:41:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_cpu_pressure_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes,
|
|
|
|
loff_t off)
|
|
|
|
{
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
return pressure_write(of, buf, nbytes, PSI_CPU);
|
2019-05-14 22:41:15 +00:00
|
|
|
}
|
|
|
|
|
2022-08-25 16:41:08 +00:00
|
|
|
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
|
|
|
|
static int cgroup_irq_pressure_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
2022-08-25 16:41:09 +00:00
|
|
|
struct psi_group *psi = cgroup_psi(cgrp);
|
2022-08-25 16:41:08 +00:00
|
|
|
|
|
|
|
return psi_show(seq, psi, PSI_IRQ);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes,
|
|
|
|
loff_t off)
|
|
|
|
{
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
return pressure_write(of, buf, nbytes, PSI_IRQ);
|
2022-08-25 16:41:08 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
static int cgroup_pressure_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
struct psi_group *psi = cgroup_psi(cgrp);
|
|
|
|
|
|
|
|
seq_printf(seq, "%d\n", psi->enabled);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_pressure_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes,
|
|
|
|
loff_t off)
|
|
|
|
{
|
|
|
|
ssize_t ret;
|
|
|
|
int enable;
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
struct psi_group *psi;
|
|
|
|
|
|
|
|
ret = kstrtoint(strstrip(buf), 0, &enable);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (enable < 0 || enable > 1)
|
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!cgrp)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
psi = cgroup_psi(cgrp);
|
|
|
|
if (psi->enabled != enable) {
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* show or hide {cpu,memory,io,irq}.pressure files */
|
|
|
|
for (i = 0; i < NR_PSI_RESOURCES; i++)
|
|
|
|
cgroup_file_show(&cgrp->psi_files[i], enable);
|
|
|
|
|
|
|
|
psi->enabled = enable;
|
|
|
|
if (enable)
|
|
|
|
psi_cgroup_restart(psi);
|
|
|
|
}
|
|
|
|
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
|
|
|
return nbytes;
|
2019-05-14 22:41:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
|
|
|
|
poll_table *pt)
|
|
|
|
{
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
|
|
|
|
|
|
|
return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
|
2019-05-14 22:41:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void cgroup_pressure_release(struct kernfs_open_file *of)
|
|
|
|
{
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
|
|
|
|
2022-01-11 23:23:09 +00:00
|
|
|
psi_trigger_destroy(ctx->psi.trigger);
|
2019-05-14 22:41:15 +00:00
|
|
|
}
|
2021-05-24 19:53:39 +00:00
|
|
|
|
|
|
|
bool cgroup_psi_enabled(void)
|
|
|
|
{
|
2022-08-25 16:41:03 +00:00
|
|
|
if (static_branch_likely(&psi_disabled))
|
|
|
|
return false;
|
|
|
|
|
2021-05-24 19:53:39 +00:00
|
|
|
return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) == 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#else /* CONFIG_PSI */
|
|
|
|
bool cgroup_psi_enabled(void)
|
|
|
|
{
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2019-05-14 22:41:15 +00:00
|
|
|
#endif /* CONFIG_PSI */
|
2018-10-26 22:06:31 +00:00
|
|
|
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
static int cgroup_freeze_show(struct seq_file *seq, void *v)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(seq)->cgroup;
|
|
|
|
|
|
|
|
seq_printf(seq, "%d\n", cgrp->freezer.freeze);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_freeze_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes, loff_t off)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
ssize_t ret;
|
|
|
|
int freeze;
|
|
|
|
|
|
|
|
ret = kstrtoint(strstrip(buf), 0, &freeze);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (freeze < 0 || freeze > 1)
|
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!cgrp)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
cgroup_freeze(cgrp, freeze);
|
|
|
|
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
|
|
|
return nbytes;
|
|
|
|
}
|
|
|
|
|
2021-05-08 12:15:38 +00:00
|
|
|
static void __cgroup_kill(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct css_task_iter it;
|
|
|
|
struct task_struct *task;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
set_bit(CGRP_KILL, &cgrp->flags);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
|
|
|
|
css_task_iter_start(&cgrp->self, CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED, &it);
|
|
|
|
while ((task = css_task_iter_next(&it))) {
|
|
|
|
/* Ignore kernel threads here. */
|
|
|
|
if (task->flags & PF_KTHREAD)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* Skip tasks that are already dying. */
|
|
|
|
if (__fatal_signal_pending(task))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
send_sig(SIGKILL, task, 0);
|
|
|
|
}
|
|
|
|
css_task_iter_end(&it);
|
|
|
|
|
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
clear_bit(CGRP_KILL, &cgrp->flags);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void cgroup_kill(struct cgroup *cgrp)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
struct cgroup *dsct;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
cgroup_for_each_live_descendant_pre(dsct, css, cgrp)
|
|
|
|
__cgroup_kill(dsct);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_kill_write(struct kernfs_open_file *of, char *buf,
|
|
|
|
size_t nbytes, loff_t off)
|
|
|
|
{
|
|
|
|
ssize_t ret = 0;
|
|
|
|
int kill;
|
|
|
|
struct cgroup *cgrp;
|
|
|
|
|
|
|
|
ret = kstrtoint(strstrip(buf), 0, &kill);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (kill != 1)
|
|
|
|
return -ERANGE;
|
|
|
|
|
|
|
|
cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!cgrp)
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Killing is a process directed operation, i.e. the whole thread-group
|
|
|
|
* is taken down so act like we do for cgroup.procs and only make this
|
|
|
|
* writable in non-threaded cgroups.
|
|
|
|
*/
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
ret = -EOPNOTSUPP;
|
|
|
|
else
|
|
|
|
cgroup_kill(cgrp);
|
|
|
|
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
|
|
|
return ret ?: nbytes;
|
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:03 +00:00
|
|
|
static int cgroup_file_open(struct kernfs_open_file *of)
|
|
|
|
{
|
2020-11-06 14:47:40 +00:00
|
|
|
struct cftype *cft = of_cft(of);
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx;
|
|
|
|
int ret;
|
2016-12-27 19:49:03 +00:00
|
|
|
|
2022-01-06 21:02:29 +00:00
|
|
|
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
|
|
|
|
if (!ctx)
|
|
|
|
return -ENOMEM;
|
2022-01-06 21:02:29 +00:00
|
|
|
|
|
|
|
ctx->ns = current->nsproxy->cgroup_ns;
|
|
|
|
get_cgroup_ns(ctx->ns);
|
2022-01-06 21:02:29 +00:00
|
|
|
of->priv = ctx;
|
|
|
|
|
|
|
|
if (!cft->open)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = cft->open(of);
|
2022-01-06 21:02:29 +00:00
|
|
|
if (ret) {
|
|
|
|
put_cgroup_ns(ctx->ns);
|
2022-01-06 21:02:29 +00:00
|
|
|
kfree(ctx);
|
2022-01-06 21:02:29 +00:00
|
|
|
}
|
2022-01-06 21:02:29 +00:00
|
|
|
return ret;
|
2016-12-27 19:49:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void cgroup_file_release(struct kernfs_open_file *of)
|
|
|
|
{
|
2020-11-06 14:47:40 +00:00
|
|
|
struct cftype *cft = of_cft(of);
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
2016-12-27 19:49:03 +00:00
|
|
|
|
|
|
|
if (cft->release)
|
|
|
|
cft->release(of);
|
2022-01-06 21:02:29 +00:00
|
|
|
put_cgroup_ns(ctx->ns);
|
2022-01-06 21:02:29 +00:00
|
|
|
kfree(ctx);
|
2016-12-27 19:49:03 +00:00
|
|
|
}
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
|
|
|
|
size_t nbytes, loff_t off)
|
2007-10-19 06:39:33 +00:00
|
|
|
{
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct cgroup *cgrp = of->kn->parent->priv;
|
2020-11-06 14:47:40 +00:00
|
|
|
struct cftype *cft = of_cft(of);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct cgroup_subsys_state *css;
|
2013-12-05 17:28:03 +00:00
|
|
|
int ret;
|
2007-10-19 06:39:33 +00:00
|
|
|
|
2020-09-30 16:42:42 +00:00
|
|
|
if (!nbytes)
|
|
|
|
return 0;
|
|
|
|
|
2017-06-27 18:30:28 +00:00
|
|
|
/*
|
|
|
|
* If namespaces are delegation boundaries, disallow writes to
|
|
|
|
* files in an non-init namespace root from inside the namespace
|
|
|
|
* except for the files explicitly marked delegatable -
|
|
|
|
* cgroup.procs and cgroup.subtree_control.
|
|
|
|
*/
|
|
|
|
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
|
|
|
|
!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
|
2022-01-06 21:02:29 +00:00
|
|
|
ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
|
2017-06-27 18:30:28 +00:00
|
|
|
return -EPERM;
|
|
|
|
|
2014-05-13 16:16:21 +00:00
|
|
|
if (cft->write)
|
|
|
|
return cft->write(of, buf, nbytes, off);
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
/*
|
|
|
|
* kernfs guarantees that a file isn't deleted with operations in
|
|
|
|
* flight, which means that the matching css is and stays alive and
|
|
|
|
* doesn't need to be pinned. The RCU locking is not necessary
|
|
|
|
* either. It's just for the convenience of using cgroup_css().
|
|
|
|
*/
|
|
|
|
rcu_read_lock();
|
|
|
|
css = cgroup_css(cgrp, cft->ss);
|
|
|
|
rcu_read_unlock();
|
2013-12-05 17:28:03 +00:00
|
|
|
|
2014-05-13 16:16:21 +00:00
|
|
|
if (cft->write_u64) {
|
2013-12-05 17:28:03 +00:00
|
|
|
unsigned long long v;
|
|
|
|
ret = kstrtoull(buf, 0, &v);
|
|
|
|
if (!ret)
|
|
|
|
ret = cft->write_u64(css, cft, v);
|
|
|
|
} else if (cft->write_s64) {
|
|
|
|
long long v;
|
|
|
|
ret = kstrtoll(buf, 0, &v);
|
|
|
|
if (!ret)
|
|
|
|
ret = cft->write_s64(css, cft, v);
|
2008-04-29 08:00:06 +00:00
|
|
|
} else {
|
2013-12-05 17:28:03 +00:00
|
|
|
ret = -EINVAL;
|
2008-04-29 08:00:06 +00:00
|
|
|
}
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
|
2013-12-05 17:28:03 +00:00
|
|
|
return ret ?: nbytes;
|
2007-10-19 06:39:33 +00:00
|
|
|
}
|
|
|
|
|
2019-03-05 23:45:48 +00:00
|
|
|
static __poll_t cgroup_file_poll(struct kernfs_open_file *of, poll_table *pt)
|
|
|
|
{
|
2020-11-06 14:47:40 +00:00
|
|
|
struct cftype *cft = of_cft(of);
|
2019-03-05 23:45:48 +00:00
|
|
|
|
|
|
|
if (cft->poll)
|
|
|
|
return cft->poll(of, pt);
|
|
|
|
|
|
|
|
return kernfs_generic_poll(of, pt);
|
|
|
|
}
|
|
|
|
|
2013-12-05 17:28:04 +00:00
|
|
|
static void *cgroup_seqfile_start(struct seq_file *seq, loff_t *ppos)
|
2008-07-25 08:46:58 +00:00
|
|
|
{
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
return seq_cft(seq)->seq_start(seq, ppos);
|
2008-07-25 08:46:58 +00:00
|
|
|
}
|
|
|
|
|
2013-12-05 17:28:04 +00:00
|
|
|
static void *cgroup_seqfile_next(struct seq_file *seq, void *v, loff_t *ppos)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
return seq_cft(seq)->seq_next(seq, v, ppos);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2013-12-05 17:28:04 +00:00
|
|
|
static void cgroup_seqfile_stop(struct seq_file *seq, void *v)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2016-12-27 19:49:03 +00:00
|
|
|
if (seq_cft(seq)->seq_stop)
|
|
|
|
seq_cft(seq)->seq_stop(seq, v);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2008-04-29 08:00:01 +00:00
|
|
|
static int cgroup_seqfile_show(struct seq_file *m, void *arg)
|
2008-04-29 08:00:06 +00:00
|
|
|
{
|
2013-12-05 17:28:04 +00:00
|
|
|
struct cftype *cft = seq_cft(m);
|
|
|
|
struct cgroup_subsys_state *css = seq_css(m);
|
2008-04-29 08:00:06 +00:00
|
|
|
|
2013-12-05 17:28:04 +00:00
|
|
|
if (cft->seq_show)
|
|
|
|
return cft->seq_show(m, arg);
|
2008-04-29 08:00:06 +00:00
|
|
|
|
2008-04-29 07:59:56 +00:00
|
|
|
if (cft->read_u64)
|
2013-12-05 17:28:04 +00:00
|
|
|
seq_printf(m, "%llu\n", cft->read_u64(css, cft));
|
|
|
|
else if (cft->read_s64)
|
|
|
|
seq_printf(m, "%lld\n", cft->read_s64(css, cft));
|
|
|
|
else
|
|
|
|
return -EINVAL;
|
|
|
|
return 0;
|
2008-04-29 08:00:01 +00:00
|
|
|
}
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static struct kernfs_ops cgroup_kf_single_ops = {
|
|
|
|
.atomic_write_len = PAGE_SIZE,
|
2016-12-27 19:49:03 +00:00
|
|
|
.open = cgroup_file_open,
|
|
|
|
.release = cgroup_file_release,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
.write = cgroup_file_write,
|
2019-03-05 23:45:48 +00:00
|
|
|
.poll = cgroup_file_poll,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
.seq_show = cgroup_seqfile_show,
|
2008-04-29 08:00:01 +00:00
|
|
|
};
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static struct kernfs_ops cgroup_kf_ops = {
|
|
|
|
.atomic_write_len = PAGE_SIZE,
|
2016-12-27 19:49:03 +00:00
|
|
|
.open = cgroup_file_open,
|
|
|
|
.release = cgroup_file_release,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
.write = cgroup_file_write,
|
2019-03-05 23:45:48 +00:00
|
|
|
.poll = cgroup_file_poll,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
.seq_start = cgroup_seqfile_start,
|
|
|
|
.seq_next = cgroup_seqfile_next,
|
|
|
|
.seq_stop = cgroup_seqfile_stop,
|
|
|
|
.seq_show = cgroup_seqfile_show,
|
|
|
|
};
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2018-04-26 21:29:04 +00:00
|
|
|
static void cgroup_file_notify_timer(struct timer_list *timer)
|
|
|
|
{
|
|
|
|
cgroup_file_notify(container_of(timer, struct cgroup_file,
|
|
|
|
notify_timer));
|
|
|
|
}
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
static int cgroup_add_file(struct cgroup_subsys_state *css, struct cgroup *cgrp,
|
|
|
|
struct cftype *cft)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2014-02-11 16:52:48 +00:00
|
|
|
char name[CGROUP_FILE_NAME_MAX];
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct kernfs_node *kn;
|
|
|
|
struct lock_class_key *key = NULL;
|
2012-04-01 19:09:56 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
|
|
|
key = &cft->lockdep_key;
|
|
|
|
#endif
|
|
|
|
kn = __kernfs_create_file(cgrp->kn, cgroup_file_name(cgrp, cft, name),
|
2018-07-20 21:56:47 +00:00
|
|
|
cgroup_file_mode(cft),
|
2023-12-08 09:33:09 +00:00
|
|
|
current_fsuid(), current_fsgid(),
|
2018-07-20 21:56:47 +00:00
|
|
|
0, cft->kf_ops, cft,
|
2015-02-13 22:36:31 +00:00
|
|
|
NULL, key);
|
2014-04-07 20:44:47 +00:00
|
|
|
if (IS_ERR(kn))
|
|
|
|
return PTR_ERR(kn);
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
if (cft->file_offset) {
|
|
|
|
struct cgroup_file *cfile = (void *)css + cft->file_offset;
|
|
|
|
|
2018-04-26 21:29:04 +00:00
|
|
|
timer_setup(&cfile->notify_timer, cgroup_file_notify_timer, 0);
|
|
|
|
|
2015-11-05 05:12:24 +00:00
|
|
|
spin_lock_irq(&cgroup_file_kn_lock);
|
2015-09-18 21:54:23 +00:00
|
|
|
cfile->kn = kn;
|
2015-11-05 05:12:24 +00:00
|
|
|
spin_unlock_irq(&cgroup_file_kn_lock);
|
2015-09-18 21:54:23 +00:00
|
|
|
}
|
|
|
|
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
return 0;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2013-06-28 23:24:10 +00:00
|
|
|
/**
|
|
|
|
* cgroup_addrm_files - add or remove files to a cgroup directory
|
2015-09-18 21:54:23 +00:00
|
|
|
* @css: the target css
|
|
|
|
* @cgrp: the target cgroup (usually css->cgroup)
|
2013-06-28 23:24:10 +00:00
|
|
|
* @cfts: array of cftypes to be added
|
|
|
|
* @is_add: whether to add or remove
|
|
|
|
*
|
|
|
|
* Depending on @is_add, add or remove files defined by @cfts on @cgrp.
|
2015-09-18 21:54:23 +00:00
|
|
|
* For removals, this function never fails.
|
2013-06-28 23:24:10 +00:00
|
|
|
*/
|
2015-09-18 21:54:23 +00:00
|
|
|
static int cgroup_addrm_files(struct cgroup_subsys_state *css,
|
|
|
|
struct cgroup *cgrp, struct cftype cfts[],
|
2013-08-09 00:11:23 +00:00
|
|
|
bool is_add)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2015-09-18 21:54:23 +00:00
|
|
|
struct cftype *cft, *cft_end = NULL;
|
2016-02-23 03:25:45 +00:00
|
|
|
int ret = 0;
|
2013-06-28 23:24:10 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2012-04-01 19:09:55 +00:00
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
restart:
|
|
|
|
for (cft = cfts; cft != cft_end && cft->name[0] != '\0'; cft++) {
|
2012-12-06 06:38:57 +00:00
|
|
|
/* does cft->flags tell us to skip this file on @cgrp? */
|
2014-07-15 15:05:10 +00:00
|
|
|
if ((cft->flags & __CFTYPE_ONLY_ON_DFL) && !cgroup_on_dfl(cgrp))
|
2014-03-19 14:23:55 +00:00
|
|
|
continue;
|
2014-07-15 15:05:10 +00:00
|
|
|
if ((cft->flags & __CFTYPE_NOT_ON_DFL) && cgroup_on_dfl(cgrp))
|
cgroup: introduce sane_behavior mount option
It's a sad fact that at this point various cgroup controllers are
carrying so many idiosyncrasies and pure insanities that it simply
isn't possible to reach any sort of sane consistent behavior while
maintaining staying fully compatible with what already has been
exposed to userland.
As we can't break exposed userland interface, transitioning to sane
behaviors can only be done in steps while maintaining backwards
compatibility. This patch introduces a new mount option -
__DEVEL__sane_behavior - which disables crazy features and enforces
consistent behaviors in cgroup core proper and various controllers.
As exactly which behaviors it changes are still being determined, the
mount option, at this point, is useful only for development of the new
behaviors. As such, the mount option is prefixed with __DEVEL__ and
generates a warning message when used.
Eventually, once we get to the point where all controller's behaviors
are consistent enough to implement unified hierarchy, the __DEVEL__
prefix will be dropped, and more importantly, unified-hierarchy will
enforce sane_behavior by default. Maybe we'll able to completely drop
the crazy stuff after a while, maybe not, but we at least have a
strategy to move on to saner behaviors.
This patch introduces the mount option and changes the following
behaviors in cgroup core.
* Mount options "noprefix" and "clone_children" are disallowed. Also,
cgroupfs file cgroup.clone_children is not created.
* When mounting an existing superblock, mount options should match.
This is currently pretty crazy. If one mounts a cgroup, creates a
subdirectory, unmounts it and then mount it again with different
option, it looks like the new options are applied but they aren't.
* Remount is disallowed.
The behaviors changes are documented in the comment above
CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
controllers are converted and planned improvements progress.
v2: Dropped unnecessary explicit file permission setting sane_behavior
cftype entry as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
2013-04-15 03:15:26 +00:00
|
|
|
continue;
|
2014-05-16 17:22:48 +00:00
|
|
|
if ((cft->flags & CFTYPE_NOT_ON_ROOT) && !cgroup_parent(cgrp))
|
2012-12-06 06:38:57 +00:00
|
|
|
continue;
|
2014-05-16 17:22:48 +00:00
|
|
|
if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp))
|
2012-12-06 06:38:57 +00:00
|
|
|
continue;
|
2018-11-08 15:08:46 +00:00
|
|
|
if ((cft->flags & CFTYPE_DEBUG) && !cgroup_debug)
|
|
|
|
continue;
|
2013-01-21 10:18:33 +00:00
|
|
|
if (is_add) {
|
2015-09-18 21:54:23 +00:00
|
|
|
ret = cgroup_add_file(css, cgrp, cft);
|
2013-06-28 23:24:10 +00:00
|
|
|
if (ret) {
|
2014-04-25 22:28:03 +00:00
|
|
|
pr_warn("%s: failed to add %s, err=%d\n",
|
|
|
|
__func__, cft->name, ret);
|
2015-09-18 21:54:23 +00:00
|
|
|
cft_end = cft;
|
|
|
|
is_add = false;
|
|
|
|
goto restart;
|
2013-06-28 23:24:10 +00:00
|
|
|
}
|
2013-01-21 10:18:33 +00:00
|
|
|
} else {
|
|
|
|
cgroup_rm_file(cgrp, cft);
|
2012-04-01 19:09:55 +00:00
|
|
|
}
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
2016-02-23 03:25:45 +00:00
|
|
|
return ret;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2014-02-12 14:29:49 +00:00
|
|
|
static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add)
|
2012-04-01 19:09:55 +00:00
|
|
|
{
|
2013-08-09 00:11:23 +00:00
|
|
|
struct cgroup_subsys *ss = cfts[0].ss;
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup *root = &ss->root->cgrp;
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *css;
|
2013-06-28 23:24:11 +00:00
|
|
|
int ret = 0;
|
2012-04-01 19:09:55 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2013-06-18 10:48:37 +00:00
|
|
|
|
|
|
|
/* add/rm files for all cgroups created before */
|
2013-08-26 22:40:56 +00:00
|
|
|
css_for_each_descendant_pre(css, cgroup_css(root, ss)) {
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup *cgrp = css->cgroup;
|
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
if (!(css->flags & CSS_VISIBLE))
|
2013-06-18 10:48:37 +00:00
|
|
|
continue;
|
|
|
|
|
2015-09-18 21:54:23 +00:00
|
|
|
ret = cgroup_addrm_files(css, cgrp, cfts, is_add);
|
2013-06-28 23:24:11 +00:00
|
|
|
if (ret)
|
|
|
|
break;
|
2012-04-01 19:09:55 +00:00
|
|
|
}
|
2014-02-12 14:29:49 +00:00
|
|
|
|
|
|
|
if (is_add && !ret)
|
|
|
|
kernfs_activate(root->kn);
|
2013-06-28 23:24:11 +00:00
|
|
|
return ret;
|
2012-04-01 19:09:55 +00:00
|
|
|
}
|
|
|
|
|
2014-02-11 16:52:48 +00:00
|
|
|
static void cgroup_exit_cftypes(struct cftype *cfts)
|
2012-04-01 19:09:55 +00:00
|
|
|
{
|
2013-08-09 00:11:23 +00:00
|
|
|
struct cftype *cft;
|
2012-04-01 19:09:55 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
for (cft = cfts; cft->name[0] != '\0'; cft++) {
|
|
|
|
/* free copy for custom atomic_write_len, see init_cftypes() */
|
|
|
|
if (cft->max_write_len && cft->max_write_len != PAGE_SIZE)
|
|
|
|
kfree(cft->kf_ops);
|
|
|
|
cft->kf_ops = NULL;
|
2014-02-11 16:52:48 +00:00
|
|
|
cft->ss = NULL;
|
2014-07-15 15:05:10 +00:00
|
|
|
|
|
|
|
/* revert flags set by cgroup core while adding @cfts */
|
2022-09-06 19:38:42 +00:00
|
|
|
cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL |
|
|
|
|
__CFTYPE_ADDED);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
}
|
2014-02-11 16:52:48 +00:00
|
|
|
}
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
|
2014-02-11 16:52:48 +00:00
|
|
|
{
|
|
|
|
struct cftype *cft;
|
2022-09-06 19:38:42 +00:00
|
|
|
int ret = 0;
|
2014-02-11 16:52:48 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
for (cft = cfts; cft->name[0] != '\0'; cft++) {
|
|
|
|
struct kernfs_ops *kf_ops;
|
|
|
|
|
2014-02-12 14:29:48 +00:00
|
|
|
WARN_ON(cft->ss || cft->kf_ops);
|
|
|
|
|
2022-09-06 19:38:42 +00:00
|
|
|
if (cft->flags & __CFTYPE_ADDED) {
|
|
|
|
ret = -EBUSY;
|
|
|
|
break;
|
|
|
|
}
|
2021-05-24 19:53:39 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
if (cft->seq_start)
|
|
|
|
kf_ops = &cgroup_kf_ops;
|
|
|
|
else
|
|
|
|
kf_ops = &cgroup_kf_single_ops;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ugh... if @cft wants a custom max_write_len, we need to
|
|
|
|
* make a copy of kf_ops to set its atomic_write_len.
|
|
|
|
*/
|
|
|
|
if (cft->max_write_len && cft->max_write_len != PAGE_SIZE) {
|
|
|
|
kf_ops = kmemdup(kf_ops, sizeof(*kf_ops), GFP_KERNEL);
|
|
|
|
if (!kf_ops) {
|
2022-09-06 19:38:42 +00:00
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
}
|
|
|
|
kf_ops->atomic_write_len = cft->max_write_len;
|
|
|
|
}
|
2012-04-01 19:09:55 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
cft->kf_ops = kf_ops;
|
2013-08-09 00:11:23 +00:00
|
|
|
cft->ss = ss;
|
2022-09-06 19:38:42 +00:00
|
|
|
cft->flags |= __CFTYPE_ADDED;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
}
|
2013-08-09 00:11:23 +00:00
|
|
|
|
2022-09-06 19:38:42 +00:00
|
|
|
if (ret)
|
|
|
|
cgroup_exit_cftypes(cfts);
|
|
|
|
return ret;
|
2014-02-11 16:52:48 +00:00
|
|
|
}
|
|
|
|
|
2023-07-01 07:38:56 +00:00
|
|
|
static void cgroup_rm_cftypes_locked(struct cftype *cfts)
|
2014-02-12 14:29:49 +00:00
|
|
|
{
|
2014-05-13 16:19:23 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2014-02-12 14:29:49 +00:00
|
|
|
|
|
|
|
list_del(&cfts->node);
|
|
|
|
cgroup_apply_cftypes(cfts, false);
|
|
|
|
cgroup_exit_cftypes(cfts);
|
2012-04-01 19:09:55 +00:00
|
|
|
}
|
|
|
|
|
2012-04-01 19:09:56 +00:00
|
|
|
/**
|
|
|
|
* cgroup_rm_cftypes - remove an array of cftypes from a subsystem
|
|
|
|
* @cfts: zero-length name terminated array of cftypes
|
|
|
|
*
|
2013-08-09 00:11:23 +00:00
|
|
|
* Unregister @cfts. Files described by @cfts are removed from all
|
|
|
|
* existing cgroups and all future cgroups won't have them either. This
|
|
|
|
* function can be called anytime whether @cfts' subsys is attached or not.
|
2012-04-01 19:09:56 +00:00
|
|
|
*
|
|
|
|
* Returns 0 on successful unregistration, -ENOENT if @cfts is not
|
2013-08-09 00:11:23 +00:00
|
|
|
* registered.
|
2012-04-01 19:09:56 +00:00
|
|
|
*/
|
2013-08-09 00:11:23 +00:00
|
|
|
int cgroup_rm_cftypes(struct cftype *cfts)
|
2012-04-01 19:09:56 +00:00
|
|
|
{
|
2022-09-06 19:38:42 +00:00
|
|
|
if (!cfts || cfts[0].name[0] == '\0')
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (!(cfts[0].flags & __CFTYPE_ADDED))
|
|
|
|
return -ENOENT;
|
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2023-07-01 07:38:56 +00:00
|
|
|
cgroup_rm_cftypes_locked(cfts);
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2023-07-01 07:38:56 +00:00
|
|
|
return 0;
|
2014-02-12 14:29:48 +00:00
|
|
|
}
|
|
|
|
|
2012-04-01 19:09:55 +00:00
|
|
|
/**
|
|
|
|
* cgroup_add_cftypes - add an array of cftypes to a subsystem
|
|
|
|
* @ss: target cgroup subsystem
|
|
|
|
* @cfts: zero-length name terminated array of cftypes
|
|
|
|
*
|
|
|
|
* Register @cfts to @ss. Files described by @cfts are created for all
|
|
|
|
* existing cgroups to which @ss is attached and all future cgroups will
|
|
|
|
* have them too. This function can be called anytime whether @ss is
|
|
|
|
* attached or not.
|
|
|
|
*
|
|
|
|
* Returns 0 on successful registration, -errno on failure. Note that this
|
|
|
|
* function currently returns 0 as long as @cfts registration is successful
|
|
|
|
* even if some file creation attempts on existing cgroups fail.
|
|
|
|
*/
|
2014-07-15 15:05:09 +00:00
|
|
|
static int cgroup_add_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
|
2012-04-01 19:09:55 +00:00
|
|
|
{
|
2013-06-28 23:24:11 +00:00
|
|
|
int ret;
|
2012-04-01 19:09:55 +00:00
|
|
|
|
2015-09-18 15:56:28 +00:00
|
|
|
if (!cgroup_ssid_enabled(ss->id))
|
2014-06-05 09:16:30 +00:00
|
|
|
return 0;
|
|
|
|
|
2014-02-17 02:41:50 +00:00
|
|
|
if (!cfts || cfts[0].name[0] == '\0')
|
|
|
|
return 0;
|
2013-08-09 00:11:23 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
ret = cgroup_init_cftypes(ss, cfts);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2014-02-12 14:29:49 +00:00
|
|
|
|
2014-02-12 14:29:48 +00:00
|
|
|
list_add_tail(&cfts->node, &ss->cfts);
|
2014-02-12 14:29:49 +00:00
|
|
|
ret = cgroup_apply_cftypes(cfts, true);
|
2013-06-28 23:24:11 +00:00
|
|
|
if (ret)
|
2014-02-12 14:29:49 +00:00
|
|
|
cgroup_rm_cftypes_locked(cfts);
|
2012-04-01 19:09:56 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2013-06-28 23:24:11 +00:00
|
|
|
return ret;
|
2012-04-01 19:09:56 +00:00
|
|
|
}
|
|
|
|
|
2014-07-15 15:05:10 +00:00
|
|
|
/**
|
|
|
|
* cgroup_add_dfl_cftypes - add an array of cftypes for default hierarchy
|
|
|
|
* @ss: target cgroup subsystem
|
|
|
|
* @cfts: zero-length name terminated array of cftypes
|
|
|
|
*
|
|
|
|
* Similar to cgroup_add_cftypes() but the added files are only used for
|
|
|
|
* the default hierarchy.
|
|
|
|
*/
|
|
|
|
int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
|
|
|
|
{
|
|
|
|
struct cftype *cft;
|
|
|
|
|
|
|
|
for (cft = cfts; cft && cft->name[0] != '\0'; cft++)
|
2014-07-15 15:05:10 +00:00
|
|
|
cft->flags |= __CFTYPE_ONLY_ON_DFL;
|
2014-07-15 15:05:10 +00:00
|
|
|
return cgroup_add_cftypes(ss, cfts);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_add_legacy_cftypes - add an array of cftypes for legacy hierarchies
|
|
|
|
* @ss: target cgroup subsystem
|
|
|
|
* @cfts: zero-length name terminated array of cftypes
|
|
|
|
*
|
|
|
|
* Similar to cgroup_add_cftypes() but the added files are only used for
|
|
|
|
* the legacy hierarchies.
|
|
|
|
*/
|
2014-07-15 15:05:09 +00:00
|
|
|
int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
|
|
|
|
{
|
2014-07-15 15:05:10 +00:00
|
|
|
struct cftype *cft;
|
|
|
|
|
2015-10-15 21:00:43 +00:00
|
|
|
for (cft = cfts; cft && cft->name[0] != '\0'; cft++)
|
|
|
|
cft->flags |= __CFTYPE_NOT_ON_DFL;
|
2014-07-15 15:05:09 +00:00
|
|
|
return cgroup_add_cftypes(ss, cfts);
|
|
|
|
}
|
|
|
|
|
2015-11-05 05:12:24 +00:00
|
|
|
/**
|
|
|
|
* cgroup_file_notify - generate a file modified event for a cgroup_file
|
|
|
|
* @cfile: target cgroup_file
|
|
|
|
*
|
|
|
|
* @cfile must have been obtained by setting cftype->file_offset.
|
|
|
|
*/
|
|
|
|
void cgroup_file_notify(struct cgroup_file *cfile)
|
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&cgroup_file_kn_lock, flags);
|
2018-04-26 21:29:04 +00:00
|
|
|
if (cfile->kn) {
|
|
|
|
unsigned long last = cfile->notified_at;
|
|
|
|
unsigned long next = last + CGROUP_FILE_NOTIFY_MIN_INTV;
|
|
|
|
|
|
|
|
if (time_in_range(jiffies, last, next)) {
|
|
|
|
timer_reduce(&cfile->notify_timer, next);
|
|
|
|
} else {
|
|
|
|
kernfs_notify(cfile->kn);
|
|
|
|
cfile->notified_at = jiffies;
|
|
|
|
}
|
|
|
|
}
|
2015-11-05 05:12:24 +00:00
|
|
|
spin_unlock_irqrestore(&cgroup_file_kn_lock, flags);
|
|
|
|
}
|
|
|
|
|
2022-08-28 05:04:40 +00:00
|
|
|
/**
|
|
|
|
* cgroup_file_show - show or hide a hidden cgroup file
|
|
|
|
* @cfile: target cgroup_file obtained by setting cftype->file_offset
|
|
|
|
* @show: whether to show or hide
|
|
|
|
*/
|
|
|
|
void cgroup_file_show(struct cgroup_file *cfile, bool show)
|
|
|
|
{
|
|
|
|
struct kernfs_node *kn;
|
|
|
|
|
|
|
|
spin_lock_irq(&cgroup_file_kn_lock);
|
|
|
|
kn = cfile->kn;
|
|
|
|
kernfs_get(kn);
|
|
|
|
spin_unlock_irq(&cgroup_file_kn_lock);
|
|
|
|
|
|
|
|
if (kn)
|
|
|
|
kernfs_show(kn, show);
|
|
|
|
|
|
|
|
kernfs_put(kn);
|
|
|
|
}
|
|
|
|
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
/**
|
2013-08-09 00:11:25 +00:00
|
|
|
* css_next_child - find the next child of a given css
|
2014-05-16 17:22:51 +00:00
|
|
|
* @pos: the current position (%NULL to initiate traversal)
|
|
|
|
* @parent: css whose children to walk
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
*
|
2014-05-16 17:22:51 +00:00
|
|
|
* This function returns the next child of @parent and should be called
|
2013-12-06 20:11:55 +00:00
|
|
|
* under either cgroup_mutex or RCU read lock. The only requirement is
|
2014-05-16 17:22:51 +00:00
|
|
|
* that @parent and @pos are accessible. The next sibling is guaranteed to
|
|
|
|
* be returned regardless of their states.
|
|
|
|
*
|
|
|
|
* If a subsystem synchronizes ->css_online() and the start of iteration, a
|
|
|
|
* css which finished ->css_online() is guaranteed to be visible in the
|
|
|
|
* future iterations and will stay visible until the last reference is put.
|
|
|
|
* A css which hasn't finished ->css_online() or already finished
|
|
|
|
* ->css_offline() may show up during traversal. It's each subsystem's
|
|
|
|
* responsibility to synchronize against on/offlining.
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
*/
|
2014-05-16 17:22:51 +00:00
|
|
|
struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos,
|
|
|
|
struct cgroup_subsys_state *parent)
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
{
|
2014-05-16 17:22:51 +00:00
|
|
|
struct cgroup_subsys_state *next;
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
cgroup_assert_mutex_or_rcu_locked();
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
|
|
|
|
/*
|
2014-05-16 17:22:49 +00:00
|
|
|
* @pos could already have been unlinked from the sibling list.
|
|
|
|
* Once a cgroup is removed, its ->sibling.next is no longer
|
|
|
|
* updated when its next sibling changes. CSS_RELEASED is set when
|
|
|
|
* @pos is taken off list, at which time its next pointer is valid,
|
|
|
|
* and, as releases are serialized, the one pointed to by the next
|
|
|
|
* pointer is guaranteed to not have started release yet. This
|
|
|
|
* implies that if we observe !CSS_RELEASED on @pos in this RCU
|
|
|
|
* critical section, the one pointed to by its next pointer is
|
|
|
|
* guaranteed to not have finished its RCU grace period even if we
|
2020-11-09 10:31:11 +00:00
|
|
|
* have dropped rcu_read_lock() in-between iterations.
|
2013-08-09 00:11:24 +00:00
|
|
|
*
|
2014-05-16 17:22:49 +00:00
|
|
|
* If @pos has CSS_RELEASED set, its next pointer can't be
|
|
|
|
* dereferenced; however, as each css is given a monotonically
|
|
|
|
* increasing unique serial number and always appended to the
|
|
|
|
* sibling list, the next one can be found by walking the parent's
|
|
|
|
* children until the first css with higher serial number than
|
|
|
|
* @pos's. While this path can be slower, it happens iff iteration
|
|
|
|
* races against release and the race window is very small.
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
*/
|
2013-08-09 00:11:24 +00:00
|
|
|
if (!pos) {
|
2014-05-16 17:22:51 +00:00
|
|
|
next = list_entry_rcu(parent->children.next, struct cgroup_subsys_state, sibling);
|
|
|
|
} else if (likely(!(pos->flags & CSS_RELEASED))) {
|
|
|
|
next = list_entry_rcu(pos->sibling.next, struct cgroup_subsys_state, sibling);
|
2013-08-09 00:11:24 +00:00
|
|
|
} else {
|
2020-01-18 03:10:51 +00:00
|
|
|
list_for_each_entry_rcu(next, &parent->children, sibling,
|
|
|
|
lockdep_is_held(&cgroup_mutex))
|
2013-08-09 00:11:24 +00:00
|
|
|
if (next->serial_nr > pos->serial_nr)
|
|
|
|
break;
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
}
|
|
|
|
|
2014-04-23 15:13:15 +00:00
|
|
|
/*
|
|
|
|
* @next, if not pointing to the head, can be dereferenced and is
|
2014-05-16 17:22:51 +00:00
|
|
|
* the next sibling.
|
2014-04-23 15:13:15 +00:00
|
|
|
*/
|
2014-05-16 17:22:51 +00:00
|
|
|
if (&next->sibling != &parent->children)
|
|
|
|
return next;
|
2014-04-23 15:13:15 +00:00
|
|
|
return NULL;
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
}
|
|
|
|
|
2012-11-09 17:12:29 +00:00
|
|
|
/**
|
2013-08-09 00:11:25 +00:00
|
|
|
* css_next_descendant_pre - find the next descendant for pre-order walk
|
2012-11-09 17:12:29 +00:00
|
|
|
* @pos: the current position (%NULL to initiate traversal)
|
2013-08-09 00:11:25 +00:00
|
|
|
* @root: css whose descendants to walk
|
2012-11-09 17:12:29 +00:00
|
|
|
*
|
2013-08-09 00:11:25 +00:00
|
|
|
* To be used by css_for_each_descendant_pre(). Find the next descendant
|
2013-08-09 00:11:27 +00:00
|
|
|
* to visit for pre-order traversal of @root's descendants. @root is
|
|
|
|
* included in the iteration and the first node to be visited.
|
2013-05-24 01:55:38 +00:00
|
|
|
*
|
2013-12-06 20:11:55 +00:00
|
|
|
* While this function requires cgroup_mutex or RCU read locking, it
|
|
|
|
* doesn't require the whole traversal to be contained in a single critical
|
|
|
|
* section. This function will return the correct next descendant as long
|
|
|
|
* as both @pos and @root are accessible and @pos is a descendant of @root.
|
2014-05-16 17:22:51 +00:00
|
|
|
*
|
|
|
|
* If a subsystem synchronizes ->css_online() and the start of iteration, a
|
|
|
|
* css which finished ->css_online() is guaranteed to be visible in the
|
|
|
|
* future iterations and will stay visible until the last reference is put.
|
|
|
|
* A css which hasn't finished ->css_online() or already finished
|
|
|
|
* ->css_offline() may show up during traversal. It's each subsystem's
|
|
|
|
* responsibility to synchronize against on/offlining.
|
2012-11-09 17:12:29 +00:00
|
|
|
*/
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *
|
|
|
|
css_next_descendant_pre(struct cgroup_subsys_state *pos,
|
|
|
|
struct cgroup_subsys_state *root)
|
2012-11-09 17:12:29 +00:00
|
|
|
{
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *next;
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
cgroup_assert_mutex_or_rcu_locked();
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2013-08-09 00:11:27 +00:00
|
|
|
/* if first iteration, visit @root */
|
2013-05-24 01:50:24 +00:00
|
|
|
if (!pos)
|
2013-08-09 00:11:27 +00:00
|
|
|
return root;
|
2012-11-09 17:12:29 +00:00
|
|
|
|
|
|
|
/* visit the first child if exists */
|
2013-08-09 00:11:25 +00:00
|
|
|
next = css_next_child(NULL, pos);
|
2012-11-09 17:12:29 +00:00
|
|
|
if (next)
|
|
|
|
return next;
|
|
|
|
|
|
|
|
/* no child, visit my or the closest ancestor's next sibling */
|
2013-08-09 00:11:25 +00:00
|
|
|
while (pos != root) {
|
2014-05-16 17:22:48 +00:00
|
|
|
next = css_next_child(pos, pos->parent);
|
2013-05-24 01:55:38 +00:00
|
|
|
if (next)
|
2012-11-09 17:12:29 +00:00
|
|
|
return next;
|
2014-05-16 17:22:48 +00:00
|
|
|
pos = pos->parent;
|
2013-05-24 01:50:24 +00:00
|
|
|
}
|
2012-11-09 17:12:29 +00:00
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
2019-06-21 08:22:48 +00:00
|
|
|
EXPORT_SYMBOL_GPL(css_next_descendant_pre);
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2013-01-07 16:49:33 +00:00
|
|
|
/**
|
2013-08-09 00:11:25 +00:00
|
|
|
* css_rightmost_descendant - return the rightmost descendant of a css
|
|
|
|
* @pos: css of interest
|
2013-01-07 16:49:33 +00:00
|
|
|
*
|
2013-08-09 00:11:25 +00:00
|
|
|
* Return the rightmost descendant of @pos. If there's no descendant, @pos
|
|
|
|
* is returned. This can be used during pre-order traversal to skip
|
2013-01-07 16:49:33 +00:00
|
|
|
* subtree of @pos.
|
2013-05-24 01:55:38 +00:00
|
|
|
*
|
2013-12-06 20:11:55 +00:00
|
|
|
* While this function requires cgroup_mutex or RCU read locking, it
|
|
|
|
* doesn't require the whole traversal to be contained in a single critical
|
|
|
|
* section. This function will return the correct rightmost descendant as
|
|
|
|
* long as @pos is accessible.
|
2013-01-07 16:49:33 +00:00
|
|
|
*/
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *
|
|
|
|
css_rightmost_descendant(struct cgroup_subsys_state *pos)
|
2013-01-07 16:49:33 +00:00
|
|
|
{
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *last, *tmp;
|
2013-01-07 16:49:33 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
cgroup_assert_mutex_or_rcu_locked();
|
2013-01-07 16:49:33 +00:00
|
|
|
|
|
|
|
do {
|
|
|
|
last = pos;
|
|
|
|
/* ->prev isn't RCU safe, walk ->next till the end */
|
|
|
|
pos = NULL;
|
2013-08-09 00:11:25 +00:00
|
|
|
css_for_each_child(tmp, last)
|
2013-01-07 16:49:33 +00:00
|
|
|
pos = tmp;
|
|
|
|
} while (pos);
|
|
|
|
|
|
|
|
return last;
|
|
|
|
}
|
|
|
|
|
2013-08-09 00:11:25 +00:00
|
|
|
static struct cgroup_subsys_state *
|
|
|
|
css_leftmost_descendant(struct cgroup_subsys_state *pos)
|
2012-11-09 17:12:29 +00:00
|
|
|
{
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *last;
|
2012-11-09 17:12:29 +00:00
|
|
|
|
|
|
|
do {
|
|
|
|
last = pos;
|
2013-08-09 00:11:25 +00:00
|
|
|
pos = css_next_child(NULL, pos);
|
2012-11-09 17:12:29 +00:00
|
|
|
} while (pos);
|
|
|
|
|
|
|
|
return last;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2013-08-09 00:11:25 +00:00
|
|
|
* css_next_descendant_post - find the next descendant for post-order walk
|
2012-11-09 17:12:29 +00:00
|
|
|
* @pos: the current position (%NULL to initiate traversal)
|
2013-08-09 00:11:25 +00:00
|
|
|
* @root: css whose descendants to walk
|
2012-11-09 17:12:29 +00:00
|
|
|
*
|
2013-08-09 00:11:25 +00:00
|
|
|
* To be used by css_for_each_descendant_post(). Find the next descendant
|
2013-08-09 00:11:27 +00:00
|
|
|
* to visit for post-order traversal of @root's descendants. @root is
|
|
|
|
* included in the iteration and the last node to be visited.
|
2013-05-24 01:55:38 +00:00
|
|
|
*
|
2013-12-06 20:11:55 +00:00
|
|
|
* While this function requires cgroup_mutex or RCU read locking, it
|
|
|
|
* doesn't require the whole traversal to be contained in a single critical
|
|
|
|
* section. This function will return the correct next descendant as long
|
|
|
|
* as both @pos and @cgroup are accessible and @pos is a descendant of
|
|
|
|
* @cgroup.
|
2014-05-16 17:22:51 +00:00
|
|
|
*
|
|
|
|
* If a subsystem synchronizes ->css_online() and the start of iteration, a
|
|
|
|
* css which finished ->css_online() is guaranteed to be visible in the
|
|
|
|
* future iterations and will stay visible until the last reference is put.
|
|
|
|
* A css which hasn't finished ->css_online() or already finished
|
|
|
|
* ->css_offline() may show up during traversal. It's each subsystem's
|
|
|
|
* responsibility to synchronize against on/offlining.
|
2012-11-09 17:12:29 +00:00
|
|
|
*/
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *
|
|
|
|
css_next_descendant_post(struct cgroup_subsys_state *pos,
|
|
|
|
struct cgroup_subsys_state *root)
|
2012-11-09 17:12:29 +00:00
|
|
|
{
|
2013-08-09 00:11:25 +00:00
|
|
|
struct cgroup_subsys_state *next;
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2014-05-13 16:19:23 +00:00
|
|
|
cgroup_assert_mutex_or_rcu_locked();
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2013-09-06 19:31:08 +00:00
|
|
|
/* if first iteration, visit leftmost descendant which may be @root */
|
|
|
|
if (!pos)
|
|
|
|
return css_leftmost_descendant(root);
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2013-08-09 00:11:27 +00:00
|
|
|
/* if we visited @root, we're done */
|
|
|
|
if (pos == root)
|
|
|
|
return NULL;
|
|
|
|
|
2012-11-09 17:12:29 +00:00
|
|
|
/* if there's an unvisited sibling, visit its leftmost descendant */
|
2014-05-16 17:22:48 +00:00
|
|
|
next = css_next_child(pos, pos->parent);
|
2013-05-24 01:55:38 +00:00
|
|
|
if (next)
|
2013-08-09 00:11:25 +00:00
|
|
|
return css_leftmost_descendant(next);
|
2012-11-09 17:12:29 +00:00
|
|
|
|
|
|
|
/* no sibling left, visit parent */
|
2014-05-16 17:22:48 +00:00
|
|
|
return pos->parent;
|
2012-11-09 17:12:29 +00:00
|
|
|
}
|
|
|
|
|
2014-05-16 17:22:52 +00:00
|
|
|
/**
|
|
|
|
* css_has_online_children - does a css have online children
|
|
|
|
* @css: the target css
|
|
|
|
*
|
|
|
|
* Returns %true if @css has any online children; otherwise, %false. This
|
|
|
|
* function can be called from any context but the caller is responsible
|
|
|
|
* for synchronizing against on/offlining as necessary.
|
|
|
|
*/
|
|
|
|
bool css_has_online_children(struct cgroup_subsys_state *css)
|
2014-05-14 13:15:01 +00:00
|
|
|
{
|
2014-05-16 17:22:52 +00:00
|
|
|
struct cgroup_subsys_state *child;
|
|
|
|
bool ret = false;
|
2014-05-14 13:15:01 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2014-05-16 17:22:52 +00:00
|
|
|
css_for_each_child(child, css) {
|
2014-06-12 06:31:31 +00:00
|
|
|
if (child->flags & CSS_ONLINE) {
|
2014-05-16 17:22:52 +00:00
|
|
|
ret = true;
|
|
|
|
break;
|
2014-05-14 13:15:01 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
2014-05-16 17:22:52 +00:00
|
|
|
return ret;
|
2012-11-09 17:12:29 +00:00
|
|
|
}
|
|
|
|
|
2017-05-15 13:34:03 +00:00
|
|
|
static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it)
|
|
|
|
{
|
|
|
|
struct list_head *l;
|
|
|
|
struct cgrp_cset_link *link;
|
|
|
|
struct css_set *cset;
|
|
|
|
|
|
|
|
lockdep_assert_held(&css_set_lock);
|
|
|
|
|
|
|
|
/* find the next threaded cset */
|
|
|
|
if (it->tcset_pos) {
|
|
|
|
l = it->tcset_pos->next;
|
|
|
|
|
|
|
|
if (l != it->tcset_head) {
|
|
|
|
it->tcset_pos = l;
|
|
|
|
return container_of(l, struct css_set,
|
|
|
|
threaded_csets_node);
|
|
|
|
}
|
|
|
|
|
|
|
|
it->tcset_pos = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* find the next cset */
|
|
|
|
l = it->cset_pos;
|
|
|
|
l = l->next;
|
|
|
|
if (l == it->cset_head) {
|
|
|
|
it->cset_pos = NULL;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (it->ss) {
|
|
|
|
cset = container_of(l, struct css_set, e_cset_node[it->ss->id]);
|
|
|
|
} else {
|
|
|
|
link = list_entry(l, struct cgrp_cset_link, cset_link);
|
|
|
|
cset = link->cset;
|
|
|
|
}
|
|
|
|
|
|
|
|
it->cset_pos = l;
|
|
|
|
|
|
|
|
/* initialize threaded css_set walking */
|
|
|
|
if (it->flags & CSS_TASK_ITER_THREADED) {
|
|
|
|
if (it->cur_dcset)
|
|
|
|
put_css_set_locked(it->cur_dcset);
|
|
|
|
it->cur_dcset = cset;
|
|
|
|
get_css_set(cset);
|
|
|
|
|
|
|
|
it->tcset_head = &cset->threaded_csets;
|
|
|
|
it->tcset_pos = &cset->threaded_csets;
|
|
|
|
}
|
|
|
|
|
|
|
|
return cset;
|
|
|
|
}
|
|
|
|
|
2013-08-09 00:11:26 +00:00
|
|
|
/**
|
2020-11-09 10:31:11 +00:00
|
|
|
* css_task_iter_advance_css_set - advance a task iterator to the next css_set
|
2013-08-09 00:11:26 +00:00
|
|
|
* @it: the iterator to advance
|
|
|
|
*
|
|
|
|
* Advance @it to the next css_set to walk.
|
2013-08-09 00:11:26 +00:00
|
|
|
*/
|
2015-10-15 20:41:52 +00:00
|
|
|
static void css_task_iter_advance_css_set(struct css_task_iter *it)
|
2013-08-09 00:11:26 +00:00
|
|
|
{
|
|
|
|
struct css_set *cset;
|
|
|
|
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2020-01-24 11:40:16 +00:00
|
|
|
/* Advance to the next non-empty css_set and find first non-empty tasks list*/
|
|
|
|
while ((cset = css_task_iter_next_css_set(it))) {
|
|
|
|
if (!list_empty(&cset->tasks)) {
|
|
|
|
it->cur_tasks_head = &cset->tasks;
|
|
|
|
break;
|
|
|
|
} else if (!list_empty(&cset->mg_tasks)) {
|
|
|
|
it->cur_tasks_head = &cset->mg_tasks;
|
|
|
|
break;
|
|
|
|
} else if (!list_empty(&cset->dying_tasks)) {
|
|
|
|
it->cur_tasks_head = &cset->dying_tasks;
|
|
|
|
break;
|
2013-08-09 00:11:26 +00:00
|
|
|
}
|
2020-01-24 11:40:15 +00:00
|
|
|
}
|
2020-01-24 11:40:16 +00:00
|
|
|
if (!cset) {
|
|
|
|
it->task_pos = NULL;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
it->task_pos = it->cur_tasks_head->next;
|
2015-10-15 20:41:52 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't keep css_sets locked across iteration steps and thus
|
|
|
|
* need to take steps to ensure that iteration can be resumed after
|
|
|
|
* the lock is re-acquired. Iteration is performed at two levels -
|
|
|
|
* css_sets and tasks in them.
|
|
|
|
*
|
|
|
|
* Once created, a css_set never leaves its cgroup lists, so a
|
|
|
|
* pinned css_set is guaranteed to stay put and we can resume
|
|
|
|
* iteration afterwards.
|
|
|
|
*
|
|
|
|
* Tasks may leave @cset across iteration steps. This is resolved
|
|
|
|
* by registering each iterator with the css_set currently being
|
|
|
|
* walked and making css_set_move_task() advance iterators whose
|
|
|
|
* next task is leaving.
|
|
|
|
*/
|
|
|
|
if (it->cur_cset) {
|
|
|
|
list_del(&it->iters_node);
|
|
|
|
put_css_set_locked(it->cur_cset);
|
|
|
|
}
|
|
|
|
get_css_set(cset);
|
|
|
|
it->cur_cset = cset;
|
|
|
|
list_add(&it->iters_node, &cset->task_iters);
|
2013-08-09 00:11:26 +00:00
|
|
|
}
|
|
|
|
|
2019-05-31 17:38:58 +00:00
|
|
|
static void css_task_iter_skip(struct css_task_iter *it,
|
|
|
|
struct task_struct *task)
|
2015-10-15 20:41:52 +00:00
|
|
|
{
|
2019-05-31 17:38:58 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
|
|
|
|
|
|
|
if (it->task_pos == &task->cg_list) {
|
|
|
|
it->task_pos = it->task_pos->next;
|
|
|
|
it->flags |= CSS_TASK_ITER_SKIPPED;
|
|
|
|
}
|
|
|
|
}
|
2015-10-15 20:41:52 +00:00
|
|
|
|
|
|
|
static void css_task_iter_advance(struct css_task_iter *it)
|
|
|
|
{
|
2019-05-31 17:38:58 +00:00
|
|
|
struct task_struct *task;
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2015-10-15 20:41:53 +00:00
|
|
|
lockdep_assert_held(&css_set_lock);
|
2017-05-15 13:34:01 +00:00
|
|
|
repeat:
|
2018-11-08 20:15:15 +00:00
|
|
|
if (it->task_pos) {
|
|
|
|
/*
|
2020-01-24 11:40:16 +00:00
|
|
|
* Advance iterator to find next entry. We go through cset
|
|
|
|
* tasks, mg_tasks and dying_tasks, when consumed we move onto
|
|
|
|
* the next cset.
|
2018-11-08 20:15:15 +00:00
|
|
|
*/
|
2019-05-31 17:38:58 +00:00
|
|
|
if (it->flags & CSS_TASK_ITER_SKIPPED)
|
|
|
|
it->flags &= ~CSS_TASK_ITER_SKIPPED;
|
|
|
|
else
|
|
|
|
it->task_pos = it->task_pos->next;
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2020-01-24 11:40:16 +00:00
|
|
|
if (it->task_pos == &it->cur_cset->tasks) {
|
|
|
|
it->cur_tasks_head = &it->cur_cset->mg_tasks;
|
|
|
|
it->task_pos = it->cur_tasks_head->next;
|
2020-01-24 11:40:15 +00:00
|
|
|
}
|
2020-01-24 11:40:16 +00:00
|
|
|
if (it->task_pos == &it->cur_cset->mg_tasks) {
|
|
|
|
it->cur_tasks_head = &it->cur_cset->dying_tasks;
|
|
|
|
it->task_pos = it->cur_tasks_head->next;
|
2020-01-24 11:40:15 +00:00
|
|
|
}
|
2020-01-24 11:40:16 +00:00
|
|
|
if (it->task_pos == &it->cur_cset->dying_tasks)
|
2018-11-08 20:15:15 +00:00
|
|
|
css_task_iter_advance_css_set(it);
|
|
|
|
} else {
|
|
|
|
/* called from start, proceed to the first cset */
|
2015-10-15 20:41:52 +00:00
|
|
|
css_task_iter_advance_css_set(it);
|
2018-11-08 20:15:15 +00:00
|
|
|
}
|
2017-05-15 13:34:01 +00:00
|
|
|
|
2019-05-31 17:38:58 +00:00
|
|
|
if (!it->task_pos)
|
|
|
|
return;
|
|
|
|
|
|
|
|
task = list_entry(it->task_pos, struct task_struct, cg_list);
|
|
|
|
|
|
|
|
if (it->flags & CSS_TASK_ITER_PROCS) {
|
|
|
|
/* if PROCS, skip over tasks which aren't group leaders */
|
|
|
|
if (!thread_group_leader(task))
|
|
|
|
goto repeat;
|
|
|
|
|
|
|
|
/* and dying leaders w/o live member threads */
|
2020-01-24 11:40:16 +00:00
|
|
|
if (it->cur_tasks_head == &it->cur_cset->dying_tasks &&
|
2020-01-24 11:40:15 +00:00
|
|
|
!atomic_read(&task->signal->live))
|
2019-05-31 17:38:58 +00:00
|
|
|
goto repeat;
|
|
|
|
} else {
|
|
|
|
/* skip all dying ones */
|
2020-01-24 11:40:16 +00:00
|
|
|
if (it->cur_tasks_head == &it->cur_cset->dying_tasks)
|
2019-05-31 17:38:58 +00:00
|
|
|
goto repeat;
|
|
|
|
}
|
2015-10-15 20:41:52 +00:00
|
|
|
}
|
|
|
|
|
2013-08-09 00:11:26 +00:00
|
|
|
/**
|
2013-08-09 00:11:26 +00:00
|
|
|
* css_task_iter_start - initiate task iteration
|
|
|
|
* @css: the css to walk tasks of
|
2017-05-15 13:34:01 +00:00
|
|
|
* @flags: CSS_TASK_ITER_* flags
|
2013-08-09 00:11:26 +00:00
|
|
|
* @it: the task iterator to use
|
|
|
|
*
|
2013-08-09 00:11:26 +00:00
|
|
|
* Initiate iteration through the tasks of @css. The caller can call
|
|
|
|
* css_task_iter_next() to walk through the tasks until the function
|
|
|
|
* returns NULL. On completion of iteration, css_task_iter_end() must be
|
|
|
|
* called.
|
2013-08-09 00:11:26 +00:00
|
|
|
*/
|
2017-05-15 13:34:01 +00:00
|
|
|
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
|
2013-08-09 00:11:26 +00:00
|
|
|
struct css_task_iter *it)
|
2007-10-19 06:39:36 +00:00
|
|
|
{
|
2023-10-18 06:17:39 +00:00
|
|
|
unsigned long irqflags;
|
|
|
|
|
2015-10-15 20:41:52 +00:00
|
|
|
memset(it, 0, sizeof(*it));
|
|
|
|
|
2023-10-18 06:17:39 +00:00
|
|
|
spin_lock_irqsave(&css_set_lock, irqflags);
|
2013-08-09 00:11:26 +00:00
|
|
|
|
2014-04-23 15:13:15 +00:00
|
|
|
it->ss = css->ss;
|
2017-05-15 13:34:01 +00:00
|
|
|
it->flags = flags;
|
2014-04-23 15:13:15 +00:00
|
|
|
|
cgroup: Avoid compiler warnings with no subsystems
As done before in commit cb4a31675270 ("cgroup: use bitmask to filter
for_each_subsys"), avoid compiler warnings for the pathological case of
having no subsystems (i.e. CGROUP_SUBSYS_COUNT == 0). This condition is
hit for the arm multi_v7_defconfig config under -Wzero-length-bounds:
In file included from ./arch/arm/include/generated/asm/rwonce.h:1,
from include/linux/compiler.h:264,
from include/uapi/linux/swab.h:6,
from include/linux/swab.h:5,
from arch/arm/include/asm/opcodes.h:86,
from arch/arm/include/asm/bug.h:7,
from include/linux/bug.h:5,
from include/linux/thread_info.h:13,
from include/asm-generic/current.h:5,
from ./arch/arm/include/generated/asm/current.h:1,
from include/linux/sched.h:12,
from include/linux/cgroup.h:12,
from kernel/cgroup/cgroup-internal.h:5,
from kernel/cgroup/cgroup.c:31:
kernel/cgroup/cgroup.c: In function 'of_css':
kernel/cgroup/cgroup.c:651:42: warning: array subscript '<unknown>' is outside the bounds of an
interior zero-length array 'struct cgroup_subsys_state *[0]' [-Wzero-length-bounds]
651 | return rcu_dereference_raw(cgrp->subsys[cft->ss->id]);
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-08-28 00:02:55 +00:00
|
|
|
if (CGROUP_HAS_SUBSYS_CONFIG && it->ss)
|
2014-04-23 15:13:15 +00:00
|
|
|
it->cset_pos = &css->cgroup->e_csets[css->ss->id];
|
|
|
|
else
|
|
|
|
it->cset_pos = &css->cgroup->cset_links;
|
|
|
|
|
2014-04-23 15:13:15 +00:00
|
|
|
it->cset_head = it->cset_pos;
|
2013-08-09 00:11:26 +00:00
|
|
|
|
2018-11-08 20:15:15 +00:00
|
|
|
css_task_iter_advance(it);
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2023-10-18 06:17:39 +00:00
|
|
|
spin_unlock_irqrestore(&css_set_lock, irqflags);
|
2007-10-19 06:39:36 +00:00
|
|
|
}
|
|
|
|
|
2013-08-09 00:11:26 +00:00
|
|
|
/**
|
2013-08-09 00:11:26 +00:00
|
|
|
* css_task_iter_next - return the next task for the iterator
|
2013-08-09 00:11:26 +00:00
|
|
|
* @it: the task iterator being iterated
|
|
|
|
*
|
|
|
|
* The "next" function for task iteration. @it should have been
|
2013-08-09 00:11:26 +00:00
|
|
|
* initialized via css_task_iter_start(). Returns NULL when the iteration
|
|
|
|
* reaches the end.
|
2013-08-09 00:11:26 +00:00
|
|
|
*/
|
2013-08-09 00:11:26 +00:00
|
|
|
struct task_struct *css_task_iter_next(struct css_task_iter *it)
|
2007-10-19 06:39:36 +00:00
|
|
|
{
|
2023-10-18 06:17:39 +00:00
|
|
|
unsigned long irqflags;
|
|
|
|
|
2015-10-29 02:43:05 +00:00
|
|
|
if (it->cur_task) {
|
2015-10-15 20:41:52 +00:00
|
|
|
put_task_struct(it->cur_task);
|
2015-10-29 02:43:05 +00:00
|
|
|
it->cur_task = NULL;
|
|
|
|
}
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2023-10-18 06:17:39 +00:00
|
|
|
spin_lock_irqsave(&css_set_lock, irqflags);
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2019-06-05 16:54:34 +00:00
|
|
|
/* @it may be half-advanced by skips, finish advancing */
|
|
|
|
if (it->flags & CSS_TASK_ITER_SKIPPED)
|
|
|
|
css_task_iter_advance(it);
|
|
|
|
|
2015-10-29 02:43:05 +00:00
|
|
|
if (it->task_pos) {
|
|
|
|
it->cur_task = list_entry(it->task_pos, struct task_struct,
|
|
|
|
cg_list);
|
|
|
|
get_task_struct(it->cur_task);
|
|
|
|
css_task_iter_advance(it);
|
|
|
|
}
|
2015-10-15 20:41:52 +00:00
|
|
|
|
2023-10-18 06:17:39 +00:00
|
|
|
spin_unlock_irqrestore(&css_set_lock, irqflags);
|
2015-10-15 20:41:52 +00:00
|
|
|
|
|
|
|
return it->cur_task;
|
2007-10-19 06:39:36 +00:00
|
|
|
}
|
|
|
|
|
2013-08-09 00:11:26 +00:00
|
|
|
/**
|
2013-08-09 00:11:26 +00:00
|
|
|
* css_task_iter_end - finish task iteration
|
2013-08-09 00:11:26 +00:00
|
|
|
* @it: the task iterator to finish
|
|
|
|
*
|
2013-08-09 00:11:26 +00:00
|
|
|
* Finish task iteration started by css_task_iter_start().
|
2013-08-09 00:11:26 +00:00
|
|
|
*/
|
2013-08-09 00:11:26 +00:00
|
|
|
void css_task_iter_end(struct css_task_iter *it)
|
2008-02-07 08:14:42 +00:00
|
|
|
{
|
2023-10-18 06:17:39 +00:00
|
|
|
unsigned long irqflags;
|
|
|
|
|
2015-10-15 20:41:52 +00:00
|
|
|
if (it->cur_cset) {
|
2023-10-18 06:17:39 +00:00
|
|
|
spin_lock_irqsave(&css_set_lock, irqflags);
|
2015-10-15 20:41:52 +00:00
|
|
|
list_del(&it->iters_node);
|
|
|
|
put_css_set_locked(it->cur_cset);
|
2023-10-18 06:17:39 +00:00
|
|
|
spin_unlock_irqrestore(&css_set_lock, irqflags);
|
2015-10-15 20:41:52 +00:00
|
|
|
}
|
|
|
|
|
2017-05-15 13:34:03 +00:00
|
|
|
if (it->cur_dcset)
|
|
|
|
put_css_set(it->cur_dcset);
|
|
|
|
|
2015-10-15 20:41:52 +00:00
|
|
|
if (it->cur_task)
|
|
|
|
put_task_struct(it->cur_task);
|
2008-02-07 08:14:42 +00:00
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:04 +00:00
|
|
|
static void cgroup_procs_release(struct kernfs_open_file *of)
|
2008-02-07 08:14:42 +00:00
|
|
|
{
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
|
|
|
|
|
|
|
if (ctx->procs.started)
|
|
|
|
css_task_iter_end(&ctx->procs.iter);
|
2016-12-27 19:49:04 +00:00
|
|
|
}
|
2016-03-08 16:51:25 +00:00
|
|
|
|
2016-12-27 19:49:04 +00:00
|
|
|
static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct kernfs_open_file *of = s->private;
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
2008-02-07 08:14:42 +00:00
|
|
|
|
2020-01-30 10:34:59 +00:00
|
|
|
if (pos)
|
|
|
|
(*pos)++;
|
|
|
|
|
2022-01-06 21:02:29 +00:00
|
|
|
return css_task_iter_next(&ctx->procs.iter);
|
2016-12-27 19:49:04 +00:00
|
|
|
}
|
2008-02-07 08:14:42 +00:00
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
|
|
|
|
unsigned int iter_flags)
|
2016-12-27 19:49:04 +00:00
|
|
|
{
|
|
|
|
struct kernfs_open_file *of = s->private;
|
|
|
|
struct cgroup *cgrp = seq_css(s)->cgroup;
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
|
|
|
struct css_task_iter *it = &ctx->procs.iter;
|
2013-04-07 16:29:50 +00:00
|
|
|
|
2014-02-25 15:04:03 +00:00
|
|
|
/*
|
2016-12-27 19:49:04 +00:00
|
|
|
* When a seq_file is seeked, it's always traversed sequentially
|
|
|
|
* from position 0, so we can simply keep iterating on !0 *pos.
|
2014-02-25 15:04:03 +00:00
|
|
|
*/
|
2022-01-06 21:02:29 +00:00
|
|
|
if (!ctx->procs.started) {
|
2020-01-30 10:34:59 +00:00
|
|
|
if (WARN_ON_ONCE((*pos)))
|
2016-12-27 19:49:04 +00:00
|
|
|
return ERR_PTR(-EINVAL);
|
2017-05-15 13:34:03 +00:00
|
|
|
css_task_iter_start(&cgrp->self, iter_flags, it);
|
2022-01-06 21:02:29 +00:00
|
|
|
ctx->procs.started = true;
|
2020-01-30 10:34:59 +00:00
|
|
|
} else if (!(*pos)) {
|
2016-12-27 19:49:04 +00:00
|
|
|
css_task_iter_end(it);
|
2017-05-15 13:34:03 +00:00
|
|
|
css_task_iter_start(&cgrp->self, iter_flags, it);
|
2020-01-30 10:34:59 +00:00
|
|
|
} else
|
|
|
|
return it->cur_task;
|
2007-10-19 06:39:32 +00:00
|
|
|
|
2016-12-27 19:49:04 +00:00
|
|
|
return cgroup_procs_next(s, NULL, NULL);
|
|
|
|
}
|
2012-01-20 03:58:43 +00:00
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = seq_css(s)->cgroup;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All processes of a threaded subtree belong to the domain cgroup
|
|
|
|
* of the subtree. Only threads can be distributed across the
|
|
|
|
* subtree. Reject reads on cgroup.procs in the subtree proper.
|
|
|
|
* They're always empty anyway.
|
|
|
|
*/
|
|
|
|
if (cgroup_is_threaded(cgrp))
|
|
|
|
return ERR_PTR(-EOPNOTSUPP);
|
|
|
|
|
|
|
|
return __cgroup_procs_start(s, pos, CSS_TASK_ITER_PROCS |
|
|
|
|
CSS_TASK_ITER_THREADED);
|
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:04 +00:00
|
|
|
static int cgroup_procs_show(struct seq_file *s, void *v)
|
2007-10-19 06:39:32 +00:00
|
|
|
{
|
2017-05-15 13:34:01 +00:00
|
|
|
seq_printf(s, "%d\n", task_pid_vnr(v));
|
2010-10-27 22:33:35 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-02-05 13:26:21 +00:00
|
|
|
static int cgroup_may_write(const struct cgroup *cgrp, struct super_block *sb)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct inode *inode;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
inode = kernfs_get_inode(sb, cgrp->procs_file.kn);
|
|
|
|
if (!inode)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2023-01-13 11:49:22 +00:00
|
|
|
ret = inode_permission(&nop_mnt_idmap, inode, MAY_WRITE);
|
2020-02-05 13:26:21 +00:00
|
|
|
iput(inode);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-05-15 13:34:00 +00:00
|
|
|
static int cgroup_procs_write_permission(struct cgroup *src_cgrp,
|
|
|
|
struct cgroup *dst_cgrp,
|
2022-01-06 21:02:29 +00:00
|
|
|
struct super_block *sb,
|
|
|
|
struct cgroup_namespace *ns)
|
2017-05-15 13:34:00 +00:00
|
|
|
{
|
|
|
|
struct cgroup *com_cgrp = src_cgrp;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
/* find the common ancestor */
|
|
|
|
while (!cgroup_is_descendant(dst_cgrp, com_cgrp))
|
|
|
|
com_cgrp = cgroup_parent(com_cgrp);
|
|
|
|
|
|
|
|
/* %current should be authorized to migrate to the common ancestor */
|
2020-02-05 13:26:21 +00:00
|
|
|
ret = cgroup_may_write(com_cgrp, sb);
|
2017-05-15 13:34:00 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If namespaces are delegation boundaries, %current must be able
|
|
|
|
* to see both source and destination cgroups from its namespace.
|
|
|
|
*/
|
|
|
|
if ((cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) &&
|
|
|
|
(!cgroup_is_descendant(src_cgrp, ns->root_cset->dfl_cgrp) ||
|
|
|
|
!cgroup_is_descendant(dst_cgrp, ns->root_cset->dfl_cgrp)))
|
|
|
|
return -ENOENT;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-02-05 13:26:18 +00:00
|
|
|
static int cgroup_attach_permissions(struct cgroup *src_cgrp,
|
|
|
|
struct cgroup *dst_cgrp,
|
2022-01-06 21:02:29 +00:00
|
|
|
struct super_block *sb, bool threadgroup,
|
|
|
|
struct cgroup_namespace *ns)
|
2020-02-05 13:26:18 +00:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
2022-01-06 21:02:29 +00:00
|
|
|
ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb, ns);
|
2020-02-05 13:26:18 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = cgroup_migrate_vet_dst(dst_cgrp);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
if (!threadgroup && (src_cgrp->dom_cgrp != dst_cgrp->dom_cgrp))
|
|
|
|
ret = -EOPNOTSUPP;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-01-14 12:44:27 +00:00
|
|
|
static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
|
|
|
|
bool threadgroup)
|
2017-05-15 13:34:00 +00:00
|
|
|
{
|
2022-01-06 21:02:29 +00:00
|
|
|
struct cgroup_file_ctx *ctx = of->priv;
|
2017-05-15 13:34:00 +00:00
|
|
|
struct cgroup *src_cgrp, *dst_cgrp;
|
|
|
|
struct task_struct *task;
|
2022-01-06 21:02:28 +00:00
|
|
|
const struct cred *saved_cred;
|
2017-05-15 13:34:00 +00:00
|
|
|
ssize_t ret;
|
2022-08-15 23:27:38 +00:00
|
|
|
bool threadgroup_locked;
|
2017-05-15 13:34:00 +00:00
|
|
|
|
|
|
|
dst_cgrp = cgroup_kn_lock_live(of->kn, false);
|
|
|
|
if (!dst_cgrp)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2022-08-15 23:27:38 +00:00
|
|
|
task = cgroup_procs_write_start(buf, threadgroup, &threadgroup_locked);
|
2017-05-15 13:34:00 +00:00
|
|
|
ret = PTR_ERR_OR_ZERO(task);
|
|
|
|
if (ret)
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
/* find the source cgroup */
|
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
|
2022-01-06 21:02:28 +00:00
|
|
|
/*
|
|
|
|
* Process and thread migrations follow same delegation rule. Check
|
|
|
|
* permissions using the credentials from file open to protect against
|
|
|
|
* inherited fd attacks.
|
|
|
|
*/
|
|
|
|
saved_cred = override_creds(of->file->f_cred);
|
2020-02-05 13:26:18 +00:00
|
|
|
ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
|
2022-01-06 21:02:29 +00:00
|
|
|
of->file->f_path.dentry->d_sb,
|
|
|
|
threadgroup, ctx->ns);
|
2022-01-06 21:02:28 +00:00
|
|
|
revert_creds(saved_cred);
|
2017-05-15 13:34:00 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_finish;
|
|
|
|
|
2021-01-14 12:44:27 +00:00
|
|
|
ret = cgroup_attach_task(dst_cgrp, task, threadgroup);
|
2017-05-15 13:34:00 +00:00
|
|
|
|
|
|
|
out_finish:
|
2022-08-15 23:27:38 +00:00
|
|
|
cgroup_procs_write_finish(task, threadgroup_locked);
|
2017-05-15 13:34:00 +00:00
|
|
|
out_unlock:
|
|
|
|
cgroup_kn_unlock(of->kn);
|
|
|
|
|
2021-01-14 12:44:27 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_procs_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes, loff_t off)
|
|
|
|
{
|
|
|
|
return __cgroup_procs_write(of, buf, true) ?: nbytes;
|
2017-05-15 13:34:00 +00:00
|
|
|
}
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
static void *cgroup_threads_start(struct seq_file *s, loff_t *pos)
|
|
|
|
{
|
|
|
|
return __cgroup_procs_start(s, pos, 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t cgroup_threads_write(struct kernfs_open_file *of,
|
|
|
|
char *buf, size_t nbytes, loff_t off)
|
|
|
|
{
|
2021-01-14 12:44:27 +00:00
|
|
|
return __cgroup_procs_write(of, buf, false) ?: nbytes;
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
}
|
|
|
|
|
2014-07-15 15:05:09 +00:00
|
|
|
/* cgroup core interface files for the default hierarchy */
|
2016-12-27 19:49:08 +00:00
|
|
|
static struct cftype cgroup_base_files[] = {
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.type",
|
|
|
|
.flags = CFTYPE_NOT_ON_ROOT,
|
|
|
|
.seq_show = cgroup_type_show,
|
|
|
|
.write = cgroup_type_write,
|
|
|
|
},
|
2007-10-19 06:39:38 +00:00
|
|
|
{
|
2013-06-04 02:14:34 +00:00
|
|
|
.name = "cgroup.procs",
|
2017-06-27 18:30:28 +00:00
|
|
|
.flags = CFTYPE_NS_DELEGATABLE,
|
2015-09-18 21:54:23 +00:00
|
|
|
.file_offset = offsetof(struct cgroup, procs_file),
|
2016-12-27 19:49:04 +00:00
|
|
|
.release = cgroup_procs_release,
|
|
|
|
.seq_start = cgroup_procs_start,
|
|
|
|
.seq_next = cgroup_procs_next,
|
|
|
|
.seq_show = cgroup_procs_show,
|
2014-05-13 16:16:22 +00:00
|
|
|
.write = cgroup_procs_write,
|
2009-09-23 22:56:26 +00:00
|
|
|
},
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.threads",
|
2018-01-10 12:35:12 +00:00
|
|
|
.flags = CFTYPE_NS_DELEGATABLE,
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
.release = cgroup_procs_release,
|
|
|
|
.seq_start = cgroup_threads_start,
|
|
|
|
.seq_next = cgroup_procs_next,
|
|
|
|
.seq_show = cgroup_procs_show,
|
|
|
|
.write = cgroup_threads_write,
|
|
|
|
},
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.controllers",
|
|
|
|
.seq_show = cgroup_controllers_show,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "cgroup.subtree_control",
|
2017-06-27 18:30:28 +00:00
|
|
|
.flags = CFTYPE_NS_DELEGATABLE,
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
.seq_show = cgroup_subtree_control_show,
|
2014-05-13 16:16:21 +00:00
|
|
|
.write = cgroup_subtree_control_write,
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
},
|
2014-04-25 22:28:02 +00:00
|
|
|
{
|
2015-09-18 21:54:22 +00:00
|
|
|
.name = "cgroup.events",
|
2014-07-15 15:05:09 +00:00
|
|
|
.flags = CFTYPE_NOT_ON_ROOT,
|
2015-09-18 21:54:23 +00:00
|
|
|
.file_offset = offsetof(struct cgroup, events_file),
|
2015-09-18 21:54:22 +00:00
|
|
|
.seq_show = cgroup_events_show,
|
2014-04-25 22:28:02 +00:00
|
|
|
},
|
2017-07-28 17:28:44 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.max.descendants",
|
|
|
|
.seq_show = cgroup_max_descendants_show,
|
|
|
|
.write = cgroup_max_descendants_write,
|
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "cgroup.max.depth",
|
|
|
|
.seq_show = cgroup_max_depth_show,
|
|
|
|
.write = cgroup_max_depth_write,
|
|
|
|
},
|
2017-08-02 16:55:31 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.stat",
|
2017-08-11 12:49:01 +00:00
|
|
|
.seq_show = cgroup_stat_show,
|
2017-08-02 16:55:31 +00:00
|
|
|
},
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.freeze",
|
|
|
|
.flags = CFTYPE_NOT_ON_ROOT,
|
|
|
|
.seq_show = cgroup_freeze_show,
|
|
|
|
.write = cgroup_freeze_write,
|
|
|
|
},
|
2021-05-08 12:15:38 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.kill",
|
|
|
|
.flags = CFTYPE_NOT_ON_ROOT,
|
|
|
|
.write = cgroup_kill_write,
|
|
|
|
},
|
2017-10-23 23:18:27 +00:00
|
|
|
{
|
|
|
|
.name = "cpu.stat",
|
|
|
|
.seq_show = cpu_stat_show,
|
|
|
|
},
|
2023-06-20 18:32:47 +00:00
|
|
|
{
|
|
|
|
.name = "cpu.stat.local",
|
|
|
|
.seq_show = cpu_local_stat_show,
|
|
|
|
},
|
2022-09-06 19:38:55 +00:00
|
|
|
{ } /* terminate */
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct cftype cgroup_psi_files[] = {
|
2018-10-26 22:06:31 +00:00
|
|
|
#ifdef CONFIG_PSI
|
|
|
|
{
|
|
|
|
.name = "io.pressure",
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
.file_offset = offsetof(struct cgroup, psi_files[PSI_IO]),
|
2018-10-26 22:06:31 +00:00
|
|
|
.seq_show = cgroup_io_pressure_show,
|
2019-05-14 22:41:15 +00:00
|
|
|
.write = cgroup_io_pressure_write,
|
|
|
|
.poll = cgroup_pressure_poll,
|
|
|
|
.release = cgroup_pressure_release,
|
2018-10-26 22:06:31 +00:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "memory.pressure",
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
.file_offset = offsetof(struct cgroup, psi_files[PSI_MEM]),
|
2018-10-26 22:06:31 +00:00
|
|
|
.seq_show = cgroup_memory_pressure_show,
|
2019-05-14 22:41:15 +00:00
|
|
|
.write = cgroup_memory_pressure_write,
|
|
|
|
.poll = cgroup_pressure_poll,
|
|
|
|
.release = cgroup_pressure_release,
|
2018-10-26 22:06:31 +00:00
|
|
|
},
|
|
|
|
{
|
|
|
|
.name = "cpu.pressure",
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
.file_offset = offsetof(struct cgroup, psi_files[PSI_CPU]),
|
2018-10-26 22:06:31 +00:00
|
|
|
.seq_show = cgroup_cpu_pressure_show,
|
2019-05-14 22:41:15 +00:00
|
|
|
.write = cgroup_cpu_pressure_write,
|
|
|
|
.poll = cgroup_pressure_poll,
|
|
|
|
.release = cgroup_pressure_release,
|
2018-10-26 22:06:31 +00:00
|
|
|
},
|
2022-08-25 16:41:08 +00:00
|
|
|
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
|
|
|
|
{
|
|
|
|
.name = "irq.pressure",
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
.file_offset = offsetof(struct cgroup, psi_files[PSI_IRQ]),
|
2022-08-25 16:41:08 +00:00
|
|
|
.seq_show = cgroup_irq_pressure_show,
|
|
|
|
.write = cgroup_irq_pressure_write,
|
|
|
|
.poll = cgroup_pressure_poll,
|
|
|
|
.release = cgroup_pressure_release,
|
|
|
|
},
|
|
|
|
#endif
|
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
2022-09-07 09:03:32 +00:00
|
|
|
{
|
|
|
|
.name = "cgroup.pressure",
|
|
|
|
.seq_show = cgroup_pressure_show,
|
|
|
|
.write = cgroup_pressure_write,
|
|
|
|
},
|
2019-05-14 22:41:15 +00:00
|
|
|
#endif /* CONFIG_PSI */
|
2014-07-15 15:05:09 +00:00
|
|
|
{ } /* terminate */
|
|
|
|
};
|
2013-06-04 02:14:34 +00:00
|
|
|
|
cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup. When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup. After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path. css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period. css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-14 00:22:51 +00:00
|
|
|
/*
|
|
|
|
* css destruction is four-stage process.
|
|
|
|
*
|
|
|
|
* 1. Destruction starts. Killing of the percpu_ref is initiated.
|
|
|
|
* Implemented in kill_css().
|
|
|
|
*
|
|
|
|
* 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
|
2014-05-13 16:11:01 +00:00
|
|
|
* and thus css_tryget_online() is guaranteed to fail, the css can be
|
|
|
|
* offlined by invoking offline_css(). After offlining, the base ref is
|
|
|
|
* put. Implemented in css_killed_work_fn().
|
cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup. When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup. After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path. css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period. css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-14 00:22:51 +00:00
|
|
|
*
|
|
|
|
* 3. When the percpu_ref reaches zero, the only possible remaining
|
|
|
|
* accessors are inside RCU read sections. css_release() schedules the
|
|
|
|
* RCU callback.
|
|
|
|
*
|
|
|
|
* 4. After the grace period, the css can be freed. Implemented in
|
2023-08-01 12:40:34 +00:00
|
|
|
* css_free_rwork_fn().
|
cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup. When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup. After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path. css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period. css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-14 00:22:51 +00:00
|
|
|
*
|
|
|
|
* It is actually hairier because both step 2 and 4 require process context
|
|
|
|
* and thus involve punting to css->destroy_work adding two additional
|
|
|
|
* steps to the already complex sequence.
|
|
|
|
*/
|
2018-03-14 19:45:14 +00:00
|
|
|
static void css_free_rwork_fn(struct work_struct *work)
|
cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2012-04-01 19:09:56 +00:00
|
|
|
{
|
2018-03-14 19:45:14 +00:00
|
|
|
struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
|
|
|
|
struct cgroup_subsys_state, destroy_rwork);
|
2015-02-12 22:59:26 +00:00
|
|
|
struct cgroup_subsys *ss = css->ss;
|
cgroup: RCU protect each cgroup_subsys_state release
With the planned unified hierarchy, individual css's will be created
and destroyed dynamically across the lifetime of a cgroup. To enable
such usages, css destruction is being decoupled from cgroup
destruction. Most of the destruction path has been decoupled but the
actual free of css still depends on cgroup free path.
When all css refs are drained, css_release() kicks off
css_free_work_fn() which puts the cgroup. When the cgroup refcnt
reaches zero, cgroup_diput() is invoked which in turn schedules RCU
free of the cgroup. After a grace period, all css's are freed along
with the cgroup itself.
This patch moves the RCU grace period and css freeing from cgroup
release path to css release path. css_release(), instead of kicking
off css_free_work_fn() directly, schedules RCU callback
css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
RCU grace period. css_free_work_fn() is updated to free the css
directly.
The five-way punting - percpu ref kill confirmation, a work item,
percpu ref release, RCU grace period, and again a work item - is quite
hairy but the work items are there only to provide process context and
the actual sequence is kill confirm -> release -> RCU free, which
isn't simple but not too crazy.
This removes cgroup_css() usage after offline_css() allowing clearing
cgroup->subsys[] from offline_css(), which makes it consistent with
online_css() and brings it closer to proper lifetime management for
individual css's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-14 00:22:51 +00:00
|
|
|
struct cgroup *cgrp = css->cgroup;
|
cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2012-04-01 19:09:56 +00:00
|
|
|
|
2014-06-28 12:10:14 +00:00
|
|
|
percpu_ref_exit(&css->refcnt);
|
|
|
|
|
2015-02-12 22:59:26 +00:00
|
|
|
if (ss) {
|
2014-05-14 13:15:02 +00:00
|
|
|
/* css free path */
|
2016-01-21 20:32:15 +00:00
|
|
|
struct cgroup_subsys_state *parent = css->parent;
|
2015-02-12 22:59:26 +00:00
|
|
|
int id = css->id;
|
|
|
|
|
|
|
|
ss->css_free(css);
|
|
|
|
cgroup_idr_remove(&ss->css_idr, id);
|
2014-05-14 13:15:02 +00:00
|
|
|
cgroup_put(cgrp);
|
2016-01-21 20:32:15 +00:00
|
|
|
|
|
|
|
if (parent)
|
|
|
|
css_put(parent);
|
2014-05-14 13:15:02 +00:00
|
|
|
} else {
|
|
|
|
/* cgroup free path */
|
|
|
|
atomic_dec(&cgrp->root->nr_cgrps);
|
2016-12-27 19:49:08 +00:00
|
|
|
cgroup1_pidlist_destroy_all(cgrp);
|
2014-09-18 08:06:19 +00:00
|
|
|
cancel_work_sync(&cgrp->release_agent_work);
|
bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage.
There already exists a local storage implementation for cgroup-attached
bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.
The life-cycle of storage is managed with the life-cycle of the
cgroup struct. i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.
The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.
Typically, the following code is used to get the current cgroup:
struct task_struct *task = bpf_get_current_task_btf();
... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
struct task_struct {
....
struct css_set __rcu *cgroups;
....
}
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.
Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-26 04:28:50 +00:00
|
|
|
bpf_cgrp_storage_free(cgrp);
|
2014-05-14 13:15:02 +00:00
|
|
|
|
2014-05-16 17:22:48 +00:00
|
|
|
if (cgroup_parent(cgrp)) {
|
2014-05-14 13:15:02 +00:00
|
|
|
/*
|
|
|
|
* We get a ref to the parent, and put the ref when
|
|
|
|
* this cgroup is being freed, so it's guaranteed
|
|
|
|
* that the parent won't be destroyed before its
|
|
|
|
* children.
|
|
|
|
*/
|
2014-05-16 17:22:48 +00:00
|
|
|
cgroup_put(cgroup_parent(cgrp));
|
2014-05-14 13:15:02 +00:00
|
|
|
kernfs_put(cgrp->kn);
|
2018-10-26 22:06:31 +00:00
|
|
|
psi_cgroup_free(cgrp);
|
2021-04-30 05:56:20 +00:00
|
|
|
cgroup_rstat_exit(cgrp);
|
2014-05-14 13:15:02 +00:00
|
|
|
kfree(cgrp);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This is root cgroup's refcnt reaching zero,
|
|
|
|
* which indicates that the root should be
|
|
|
|
* released.
|
|
|
|
*/
|
|
|
|
cgroup_destroy_root(cgrp->root);
|
|
|
|
}
|
|
|
|
}
|
cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2012-04-01 19:09:56 +00:00
|
|
|
}
|
|
|
|
|
2014-05-14 13:15:02 +00:00
|
|
|
static void css_release_work_fn(struct work_struct *work)
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css =
|
2014-05-14 13:15:02 +00:00
|
|
|
container_of(work, struct cgroup_subsys_state, destroy_work);
|
2014-05-04 19:09:14 +00:00
|
|
|
struct cgroup_subsys *ss = css->ss;
|
2014-05-14 13:15:02 +00:00
|
|
|
struct cgroup *cgrp = css->cgroup;
|
2014-05-04 19:09:14 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2014-05-16 17:22:49 +00:00
|
|
|
|
2014-05-16 17:22:49 +00:00
|
|
|
css->flags |= CSS_RELEASED;
|
2014-05-16 17:22:49 +00:00
|
|
|
list_del_rcu(&css->sibling);
|
|
|
|
|
2014-05-14 13:15:02 +00:00
|
|
|
if (ss) {
|
|
|
|
/* css release path */
|
2018-04-26 21:29:05 +00:00
|
|
|
if (!list_empty(&css->rstat_css_node)) {
|
|
|
|
cgroup_rstat_flush(cgrp);
|
|
|
|
list_del_rcu(&css->rstat_css_node);
|
|
|
|
}
|
|
|
|
|
2015-02-12 22:59:26 +00:00
|
|
|
cgroup_idr_replace(&ss->css_idr, NULL, css->id);
|
2014-11-18 07:49:51 +00:00
|
|
|
if (ss->css_released)
|
|
|
|
ss->css_released(css);
|
2014-05-14 13:15:02 +00:00
|
|
|
} else {
|
2017-08-02 16:55:29 +00:00
|
|
|
struct cgroup *tcgrp;
|
|
|
|
|
2014-05-14 13:15:02 +00:00
|
|
|
/* cgroup release path */
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 21:48:54 +00:00
|
|
|
TRACE_CGROUP_PATH(release, cgrp);
|
2016-08-10 15:23:44 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
cgroup_rstat_flush(cgrp);
|
2017-09-25 15:12:05 +00:00
|
|
|
|
2019-04-19 17:03:03 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2017-08-02 16:55:29 +00:00
|
|
|
for (tcgrp = cgroup_parent(cgrp); tcgrp;
|
|
|
|
tcgrp = cgroup_parent(tcgrp))
|
|
|
|
tcgrp->nr_dying_descendants--;
|
2019-04-19 17:03:03 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2017-08-02 16:55:29 +00:00
|
|
|
|
2014-09-04 06:43:07 +00:00
|
|
|
/*
|
|
|
|
* There are two control paths which try to determine
|
|
|
|
* cgroup from dentry without going through kernfs -
|
|
|
|
* cgroupstats_build() and css_tryget_online_from_dir().
|
|
|
|
* Those are supported by RCU protecting clearing of
|
|
|
|
* cgrp->kn->priv backpointer.
|
|
|
|
*/
|
2016-03-03 14:57:58 +00:00
|
|
|
if (cgrp->kn)
|
|
|
|
RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv,
|
|
|
|
NULL);
|
2014-05-14 13:15:02 +00:00
|
|
|
}
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2014-05-16 17:22:49 +00:00
|
|
|
|
2018-03-14 19:45:14 +00:00
|
|
|
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
|
|
|
|
queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void css_release(struct percpu_ref *ref)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css =
|
|
|
|
container_of(ref, struct cgroup_subsys_state, refcnt);
|
|
|
|
|
2014-05-14 13:15:02 +00:00
|
|
|
INIT_WORK(&css->destroy_work, css_release_work_fn);
|
|
|
|
queue_work(cgroup_destroy_wq, &css->destroy_work);
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
}
|
|
|
|
|
2014-05-04 19:09:14 +00:00
|
|
|
static void init_and_link_css(struct cgroup_subsys_state *css,
|
|
|
|
struct cgroup_subsys *ss, struct cgroup *cgrp)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2014-05-16 17:22:49 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2017-04-28 19:14:55 +00:00
|
|
|
cgroup_get_live(cgrp);
|
2014-05-04 19:09:14 +00:00
|
|
|
|
2014-05-16 17:22:48 +00:00
|
|
|
memset(css, 0, sizeof(*css));
|
2007-10-19 06:40:44 +00:00
|
|
|
css->cgroup = cgrp;
|
2013-08-09 00:11:22 +00:00
|
|
|
css->ss = ss;
|
2016-05-26 19:42:13 +00:00
|
|
|
css->id = -1;
|
2014-05-16 17:22:48 +00:00
|
|
|
INIT_LIST_HEAD(&css->sibling);
|
|
|
|
INIT_LIST_HEAD(&css->children);
|
2018-04-26 21:29:05 +00:00
|
|
|
INIT_LIST_HEAD(&css->rstat_css_node);
|
2014-05-16 17:22:49 +00:00
|
|
|
css->serial_nr = css_serial_nr_next++;
|
2016-01-21 20:31:11 +00:00
|
|
|
atomic_set(&css->online_cnt, 0);
|
2013-08-13 15:01:54 +00:00
|
|
|
|
2014-05-16 17:22:48 +00:00
|
|
|
if (cgroup_parent(cgrp)) {
|
|
|
|
css->parent = cgroup_css(cgroup_parent(cgrp), ss);
|
2014-05-04 19:09:14 +00:00
|
|
|
css_get(css->parent);
|
|
|
|
}
|
cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references. If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.
This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir). To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).
Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior. This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.
Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch. Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.
http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2012-04-01 19:09:56 +00:00
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
if (ss->css_rstat_flush)
|
2018-04-26 21:29:05 +00:00
|
|
|
list_add_rcu(&css->rstat_css_node, &cgrp->rstat_css_list);
|
|
|
|
|
2013-08-26 22:40:56 +00:00
|
|
|
BUG_ON(cgroup_css(cgrp, ss));
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2013-07-31 08:16:40 +00:00
|
|
|
/* invoke ->css_online() on a new CSS and mark it online if successful */
|
2013-08-13 15:01:55 +00:00
|
|
|
static int online_css(struct cgroup_subsys_state *css)
|
2012-11-19 16:13:37 +00:00
|
|
|
{
|
2013-08-13 15:01:55 +00:00
|
|
|
struct cgroup_subsys *ss = css->ss;
|
2012-11-19 16:13:38 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
2012-11-19 16:13:37 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2012-11-19 16:13:38 +00:00
|
|
|
if (ss->css_online)
|
2013-08-09 00:11:23 +00:00
|
|
|
ret = ss->css_online(css);
|
2013-08-14 00:22:50 +00:00
|
|
|
if (!ret) {
|
2013-08-09 00:11:23 +00:00
|
|
|
css->flags |= CSS_ONLINE;
|
2014-02-08 15:36:58 +00:00
|
|
|
rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
|
2016-01-21 20:31:11 +00:00
|
|
|
|
|
|
|
atomic_inc(&css->online_cnt);
|
|
|
|
if (css->parent)
|
|
|
|
atomic_inc(&css->parent->online_cnt);
|
2013-08-14 00:22:50 +00:00
|
|
|
}
|
2012-11-19 16:13:38 +00:00
|
|
|
return ret;
|
2012-11-19 16:13:37 +00:00
|
|
|
}
|
|
|
|
|
2013-07-31 08:16:40 +00:00
|
|
|
/* if the CSS is online, invoke ->css_offline() on it and mark it offline */
|
2013-08-13 15:01:55 +00:00
|
|
|
static void offline_css(struct cgroup_subsys_state *css)
|
2012-11-19 16:13:37 +00:00
|
|
|
{
|
2013-08-13 15:01:55 +00:00
|
|
|
struct cgroup_subsys *ss = css->ss;
|
2012-11-19 16:13:37 +00:00
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
if (!(css->flags & CSS_ONLINE))
|
|
|
|
return;
|
|
|
|
|
2013-03-12 22:35:59 +00:00
|
|
|
if (ss->css_offline)
|
2013-08-09 00:11:23 +00:00
|
|
|
ss->css_offline(css);
|
2012-11-19 16:13:37 +00:00
|
|
|
|
2013-08-09 00:11:23 +00:00
|
|
|
css->flags &= ~CSS_ONLINE;
|
2014-04-23 15:13:15 +00:00
|
|
|
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
|
cgroup: implement dynamic subtree controller enable/disable on the default hierarchy
cgroup is switching away from multiple hierarchies and will use one
unified default hierarchy where controllers can be dynamically enabled
and disabled per subtree. The default hierarchy will serve as the
unified hierarchy to which all controllers are attached and a css on
the default hierarchy would need to also serve the tasks of descendant
cgroups which don't have the controller enabled - ie. the tree may be
collapsed from leaf towards root when viewed from specific
controllers. This has been implemented through effective css in the
previous patches.
This patch finally implements dynamic subtree controller
enable/disable on the default hierarchy via a new knob -
"cgroup.subtree_control" which controls which controllers are enabled
on the child cgroups. Let's assume a hierarchy like the following.
root - A - B - C
\ D
root's "cgroup.subtree_control" determines which controllers are
enabled on A. A's on B. B's on C and D. This coincides with the
fact that controllers on the immediate sub-level are used to
distribute the resources of the parent. In fact, it's natural to
assume that resource control knobs of a child belong to its parent.
Enabling a controller in "cgroup.subtree_control" declares that
distribution of the respective resources of the cgroup will be
controlled. Note that this means that controller enable states are
shared among siblings.
The default hierarchy has an extra restriction - only cgroups which
don't contain any task may have controllers enabled in
"cgroup.subtree_control". Combined with the other properties of the
default hierarchy, this guarantees that, from the view point of
controllers, tasks are only on the leaf cgroups. In other words, only
leaf csses may contain tasks. This rules out situations where child
cgroups compete against internal tasks of the parent, which is a
competition between two different types of entities without any clear
way to determine resource distribution between the two. Different
controllers handle it differently and all the implemented behaviors
are ambiguous, ad-hoc, cumbersome and/or just wrong. Having this
structural constraints imposed from cgroup core removes the burden
from controller implementations and enables showing one consistent
behavior across all controllers.
When a controller is enabled or disabled, css associations for the
controller in the subtrees of each child should be updated. After
enabling, the whole subtree of a child should point to the new css of
the child. After disabling, the whole subtree of a child should point
to the cgroup's css. This is implemented by first updating cgroup
states such that cgroup_e_css() result points to the appropriate css
and then invoking cgroup_update_dfl_csses() which migrates all tasks
in the affected subtrees to the self cgroup on the default hierarchy.
* When read, "cgroup.subtree_control" lists all the currently enabled
controllers on the children of the cgroup.
* White-space separated list of controller names prefixed with either
'+' or '-' can be written to "cgroup.subtree_control". The ones
prefixed with '+' are enabled on the controller and '-' disabled.
* A controller can be enabled iff the parent's
"cgroup.subtree_control" enables it and disabled iff no child's
"cgroup.subtree_control" has it enabled.
* If a cgroup has tasks, no controller can be enabled via
"cgroup.subtree_control". Likewise, if "cgroup.subtree_control" has
some controllers enabled, tasks can't be migrated into the cgroup.
* All controllers which aren't bound on other hierarchies are
automatically associated with the root cgroup of the default
hierarchy. All the controllers which are bound to the default
hierarchy are listed in the read-only file "cgroup.controllers" in
the root directory.
* "cgroup.controllers" in all non-root cgroups is read-only file whose
content is equal to that of "cgroup.subtree_control" of the parent.
This indicates which controllers can be used in the cgroup's
"cgroup.subtree_control".
This is still experimental and there are some holes, one of which is
that ->can_attach() failure during cgroup_update_dfl_csses() may leave
the cgroups in an undefined state. The issues will be addressed by
future patches.
v2: Non-root cgroups now also have "cgroup.controllers".
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2014-04-23 15:13:16 +00:00
|
|
|
|
|
|
|
wake_up_all(&css->cgroup->offline_waitq);
|
2012-11-19 16:13:37 +00:00
|
|
|
}
|
|
|
|
|
2013-12-06 20:11:56 +00:00
|
|
|
/**
|
2016-03-03 14:57:58 +00:00
|
|
|
* css_create - create a cgroup_subsys_state
|
2013-12-06 20:11:56 +00:00
|
|
|
* @cgrp: the cgroup new css will be associated with
|
|
|
|
* @ss: the subsys of new css
|
|
|
|
*
|
|
|
|
* Create a new css associated with @cgrp - @ss pair. On success, the new
|
2016-03-03 14:57:58 +00:00
|
|
|
* css is online and installed in @cgrp. This function doesn't create the
|
|
|
|
* interface files. Returns 0 on success, -errno on failure.
|
2013-12-06 20:11:56 +00:00
|
|
|
*/
|
2016-03-03 14:57:58 +00:00
|
|
|
static struct cgroup_subsys_state *css_create(struct cgroup *cgrp,
|
|
|
|
struct cgroup_subsys *ss)
|
2013-12-06 20:11:56 +00:00
|
|
|
{
|
2014-05-16 17:22:48 +00:00
|
|
|
struct cgroup *parent = cgroup_parent(cgrp);
|
2014-05-16 17:22:49 +00:00
|
|
|
struct cgroup_subsys_state *parent_css = cgroup_css(parent, ss);
|
2013-12-06 20:11:56 +00:00
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2014-05-16 17:22:49 +00:00
|
|
|
css = ss->css_alloc(parent_css);
|
2016-06-21 17:06:24 +00:00
|
|
|
if (!css)
|
|
|
|
css = ERR_PTR(-ENOMEM);
|
2013-12-06 20:11:56 +00:00
|
|
|
if (IS_ERR(css))
|
2016-03-03 14:57:58 +00:00
|
|
|
return css;
|
2013-12-06 20:11:56 +00:00
|
|
|
|
2014-05-04 19:09:14 +00:00
|
|
|
init_and_link_css(css, ss, cgrp);
|
2014-05-04 19:09:14 +00:00
|
|
|
|
2014-09-24 17:31:50 +00:00
|
|
|
err = percpu_ref_init(&css->refcnt, css_release, 0, GFP_KERNEL);
|
2013-12-06 20:11:56 +00:00
|
|
|
if (err)
|
2014-03-18 09:02:36 +00:00
|
|
|
goto err_free_css;
|
2013-12-06 20:11:56 +00:00
|
|
|
|
2015-08-03 12:32:26 +00:00
|
|
|
err = cgroup_idr_alloc(&ss->css_idr, NULL, 2, 0, GFP_KERNEL);
|
2014-05-04 19:09:14 +00:00
|
|
|
if (err < 0)
|
2016-05-13 14:59:20 +00:00
|
|
|
goto err_free_css;
|
2014-05-04 19:09:14 +00:00
|
|
|
css->id = err;
|
2013-12-06 20:11:56 +00:00
|
|
|
|
2014-05-04 19:09:14 +00:00
|
|
|
/* @css is ready to be brought online now, make it visible */
|
2014-05-16 17:22:49 +00:00
|
|
|
list_add_tail_rcu(&css->sibling, &parent_css->children);
|
2014-05-04 19:09:14 +00:00
|
|
|
cgroup_idr_replace(&ss->css_idr, css, css->id);
|
2013-12-06 20:11:56 +00:00
|
|
|
|
|
|
|
err = online_css(css);
|
|
|
|
if (err)
|
2014-05-16 17:22:49 +00:00
|
|
|
goto err_list_del;
|
2014-03-19 14:23:54 +00:00
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
return css;
|
2013-12-06 20:11:56 +00:00
|
|
|
|
2014-05-16 17:22:49 +00:00
|
|
|
err_list_del:
|
|
|
|
list_del_rcu(&css->sibling);
|
2014-03-18 09:02:36 +00:00
|
|
|
err_free_css:
|
2018-04-26 21:29:05 +00:00
|
|
|
list_del_rcu(&css->rstat_css_node);
|
2018-03-14 19:45:14 +00:00
|
|
|
INIT_RCU_WORK(&css->destroy_rwork, css_free_rwork_fn);
|
|
|
|
queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);
|
2016-03-03 14:57:58 +00:00
|
|
|
return ERR_PTR(err);
|
2013-12-06 20:11:56 +00:00
|
|
|
}
|
|
|
|
|
2017-01-26 21:47:28 +00:00
|
|
|
/*
|
|
|
|
* The returned cgroup is fully initialized including its control mask, but
|
2023-07-19 09:06:40 +00:00
|
|
|
* it doesn't have the control mask applied.
|
2017-01-26 21:47:28 +00:00
|
|
|
*/
|
2019-11-04 23:54:30 +00:00
|
|
|
static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
|
|
|
|
umode_t mode)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2016-03-03 14:57:58 +00:00
|
|
|
struct cgroup_root *root = parent->root;
|
|
|
|
struct cgroup *cgrp, *tcgrp;
|
2019-11-04 23:54:30 +00:00
|
|
|
struct kernfs_node *kn;
|
2016-03-03 14:57:58 +00:00
|
|
|
int level = parent->level + 1;
|
2016-03-03 14:58:00 +00:00
|
|
|
int ret;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2012-11-19 17:02:12 +00:00
|
|
|
/* allocate the cgroup and its ID, 0 is reserved for the root */
|
2022-07-29 23:10:16 +00:00
|
|
|
cgrp = kzalloc(struct_size(cgrp, ancestors, (level + 1)), GFP_KERNEL);
|
2016-03-03 14:57:58 +00:00
|
|
|
if (!cgrp)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2014-02-11 08:05:46 +00:00
|
|
|
|
2014-09-24 17:31:50 +00:00
|
|
|
ret = percpu_ref_init(&cgrp->self.refcnt, css_release, 0, GFP_KERNEL);
|
2014-05-14 13:15:02 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_free_cgrp;
|
|
|
|
|
2021-04-30 05:56:20 +00:00
|
|
|
ret = cgroup_rstat_init(cgrp);
|
|
|
|
if (ret)
|
|
|
|
goto out_cancel_ref;
|
2017-09-25 15:12:05 +00:00
|
|
|
|
2019-11-04 23:54:30 +00:00
|
|
|
/* create the directory */
|
2023-12-08 09:33:09 +00:00
|
|
|
kn = kernfs_create_dir_ns(parent->kn, name, mode,
|
|
|
|
current_fsuid(), current_fsgid(),
|
|
|
|
cgrp, NULL);
|
2019-11-04 23:54:30 +00:00
|
|
|
if (IS_ERR(kn)) {
|
|
|
|
ret = PTR_ERR(kn);
|
2017-09-25 15:12:05 +00:00
|
|
|
goto out_stat_exit;
|
2012-11-05 17:16:59 +00:00
|
|
|
}
|
2019-11-04 23:54:30 +00:00
|
|
|
cgrp->kn = kn;
|
2012-11-05 17:16:59 +00:00
|
|
|
|
2008-10-19 03:28:04 +00:00
|
|
|
init_cgroup_housekeeping(cgrp);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-05-14 13:15:00 +00:00
|
|
|
cgrp->self.parent = &parent->self;
|
2014-05-13 16:19:22 +00:00
|
|
|
cgrp->root = root;
|
2015-11-20 20:55:52 +00:00
|
|
|
cgrp->level = level;
|
2018-10-26 22:06:31 +00:00
|
|
|
|
|
|
|
ret = psi_cgroup_alloc(cgrp);
|
2017-10-03 05:50:21 +00:00
|
|
|
if (ret)
|
2019-11-04 23:54:30 +00:00
|
|
|
goto out_kernfs_remove;
|
2015-11-20 20:55:52 +00:00
|
|
|
|
2018-10-26 22:06:31 +00:00
|
|
|
ret = cgroup_bpf_inherit(cgrp);
|
|
|
|
if (ret)
|
|
|
|
goto out_psi_free;
|
|
|
|
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
/*
|
|
|
|
* New cgroup inherits effective freeze counter, and
|
|
|
|
* if the parent has to be frozen, the child has too.
|
|
|
|
*/
|
|
|
|
cgrp->freezer.e_freeze = parent->freezer.e_freeze;
|
2019-09-12 17:56:45 +00:00
|
|
|
if (cgrp->freezer.e_freeze) {
|
|
|
|
/*
|
|
|
|
* Set the CGRP_FREEZE flag, so when a process will be
|
|
|
|
* attached to the child cgroup, it will become frozen.
|
|
|
|
* At this point the new cgroup is unpopulated, so we can
|
|
|
|
* consider it frozen immediately.
|
|
|
|
*/
|
|
|
|
set_bit(CGRP_FREEZE, &cgrp->flags);
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
set_bit(CGRP_FROZEN, &cgrp->flags);
|
2019-09-12 17:56:45 +00:00
|
|
|
}
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
|
2019-04-19 17:03:03 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2017-08-02 16:55:29 +00:00
|
|
|
for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
|
2022-07-29 23:10:16 +00:00
|
|
|
cgrp->ancestors[tcgrp->level] = tcgrp;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
if (tcgrp != cgrp) {
|
2017-08-02 16:55:29 +00:00
|
|
|
tcgrp->nr_descendants++;
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the new cgroup is frozen, all ancestor cgroups
|
|
|
|
* get a new frozen descendant, but their state can't
|
|
|
|
* change because of this.
|
|
|
|
*/
|
|
|
|
if (cgrp->freezer.e_freeze)
|
|
|
|
tcgrp->freezer.nr_frozen_descendants++;
|
|
|
|
}
|
2017-08-02 16:55:29 +00:00
|
|
|
}
|
2019-04-19 17:03:03 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2017-08-02 16:55:29 +00:00
|
|
|
|
2008-03-04 22:28:19 +00:00
|
|
|
if (notify_on_release(parent))
|
|
|
|
set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
|
|
|
|
|
2012-11-19 16:13:38 +00:00
|
|
|
if (test_bit(CGRP_CPUSET_CLONE_CHILDREN, &parent->flags))
|
|
|
|
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &cgrp->flags);
|
2010-10-27 22:33:35 +00:00
|
|
|
|
2014-05-16 17:22:49 +00:00
|
|
|
cgrp->self.serial_nr = css_serial_nr_next++;
|
cgroup: add cgroup->serial_nr and implement cgroup_next_sibling()
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
2013-05-24 01:55:38 +00:00
|
|
|
|
2012-11-19 16:13:36 +00:00
|
|
|
/* allocation complete, commit to creation */
|
2014-05-16 17:22:48 +00:00
|
|
|
list_add_tail_rcu(&cgrp->self.sibling, &cgroup_parent(cgrp)->self.children);
|
2014-02-12 14:29:50 +00:00
|
|
|
atomic_inc(&root->nr_cgrps);
|
2017-04-28 19:14:55 +00:00
|
|
|
cgroup_get_live(parent);
|
2013-04-08 06:35:02 +00:00
|
|
|
|
2014-04-23 15:13:16 +00:00
|
|
|
/*
|
|
|
|
* On the default hierarchy, a child doesn't automatically inherit
|
2014-07-08 22:02:56 +00:00
|
|
|
* subtree_control from the parent. Each is configured manually.
|
2014-04-23 15:13:16 +00:00
|
|
|
*/
|
2016-03-03 14:58:00 +00:00
|
|
|
if (!cgroup_on_dfl(cgrp))
|
2016-03-03 14:57:58 +00:00
|
|
|
cgrp->subtree_control = cgroup_control(cgrp);
|
2016-03-03 14:58:00 +00:00
|
|
|
|
|
|
|
cgroup_propagate_control(cgrp);
|
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
return cgrp;
|
|
|
|
|
2018-10-26 22:06:31 +00:00
|
|
|
out_psi_free:
|
|
|
|
psi_cgroup_free(cgrp);
|
2019-11-04 23:54:30 +00:00
|
|
|
out_kernfs_remove:
|
|
|
|
kernfs_remove(cgrp->kn);
|
2017-09-25 15:12:05 +00:00
|
|
|
out_stat_exit:
|
2021-04-30 05:56:20 +00:00
|
|
|
cgroup_rstat_exit(cgrp);
|
2016-03-03 14:57:58 +00:00
|
|
|
out_cancel_ref:
|
|
|
|
percpu_ref_exit(&cgrp->self.refcnt);
|
|
|
|
out_free_cgrp:
|
|
|
|
kfree(cgrp);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
|
2017-07-28 17:28:44 +00:00
|
|
|
static bool cgroup_check_hierarchy_limits(struct cgroup *parent)
|
|
|
|
{
|
|
|
|
struct cgroup *cgroup;
|
|
|
|
int ret = false;
|
|
|
|
int level = 1;
|
|
|
|
|
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
|
|
|
for (cgroup = parent; cgroup; cgroup = cgroup_parent(cgroup)) {
|
|
|
|
if (cgroup->nr_descendants >= cgroup->max_descendants)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
if (level > cgroup->max_depth)
|
|
|
|
goto fail;
|
|
|
|
|
|
|
|
level++;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = true;
|
|
|
|
fail:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-12-27 19:49:08 +00:00
|
|
|
int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
|
2016-03-03 14:57:58 +00:00
|
|
|
{
|
|
|
|
struct cgroup *parent, *cgrp;
|
2016-03-03 14:58:00 +00:00
|
|
|
int ret;
|
2016-03-03 14:57:58 +00:00
|
|
|
|
|
|
|
/* do not accept '\n' to prevent making /proc/<pid>/cgroup unparsable */
|
|
|
|
if (strchr(name, '\n'))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
parent = cgroup_kn_lock_live(parent_kn, false);
|
2016-03-03 14:57:58 +00:00
|
|
|
if (!parent)
|
|
|
|
return -ENODEV;
|
|
|
|
|
2017-07-28 17:28:44 +00:00
|
|
|
if (!cgroup_check_hierarchy_limits(parent)) {
|
|
|
|
ret = -EAGAIN;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2019-11-04 23:54:30 +00:00
|
|
|
cgrp = cgroup_create(parent, name, mode);
|
2016-03-03 14:57:58 +00:00
|
|
|
if (IS_ERR(cgrp)) {
|
|
|
|
ret = PTR_ERR(cgrp);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
/*
|
|
|
|
* This extra ref will be put in cgroup_free_fn() and guarantees
|
|
|
|
* that @cgrp->kn is always accessible.
|
|
|
|
*/
|
2019-11-04 23:54:30 +00:00
|
|
|
kernfs_get(cgrp->kn);
|
2016-03-03 14:57:58 +00:00
|
|
|
|
2016-03-03 14:58:01 +00:00
|
|
|
ret = css_populate_dir(&cgrp->self);
|
2016-03-03 14:57:58 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_destroy;
|
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
ret = cgroup_apply_control_enable(cgrp);
|
|
|
|
if (ret)
|
|
|
|
goto out_destroy;
|
2016-03-03 14:57:58 +00:00
|
|
|
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 21:48:54 +00:00
|
|
|
TRACE_CGROUP_PATH(mkdir, cgrp);
|
2016-08-10 15:23:44 +00:00
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
/* let's create and online css's */
|
2019-11-04 23:54:30 +00:00
|
|
|
kernfs_activate(cgrp->kn);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2014-05-13 16:19:22 +00:00
|
|
|
ret = 0;
|
|
|
|
goto out_unlock;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-03-03 14:57:58 +00:00
|
|
|
out_destroy:
|
|
|
|
cgroup_destroy_locked(cgrp);
|
2014-05-13 16:19:22 +00:00
|
|
|
out_unlock:
|
2014-05-13 16:19:22 +00:00
|
|
|
cgroup_kn_unlock(parent_kn);
|
2014-05-13 16:19:22 +00:00
|
|
|
return ret;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
|
2013-08-14 00:22:50 +00:00
|
|
|
/*
|
|
|
|
* This is called when the refcnt of a css is confirmed to be killed.
|
2014-05-14 13:15:01 +00:00
|
|
|
* css_tryget_online() is now guaranteed to fail. Tell the subsystem to
|
2021-05-24 08:29:43 +00:00
|
|
|
* initiate destruction and put the css ref from kill_css().
|
2013-08-14 00:22:50 +00:00
|
|
|
*/
|
|
|
|
static void css_killed_work_fn(struct work_struct *work)
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
{
|
2013-08-14 00:22:50 +00:00
|
|
|
struct cgroup_subsys_state *css =
|
|
|
|
container_of(work, struct cgroup_subsys_state, destroy_work);
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2013-08-14 00:22:50 +00:00
|
|
|
|
2016-01-21 20:31:11 +00:00
|
|
|
do {
|
|
|
|
offline_css(css);
|
|
|
|
css_put(css);
|
|
|
|
/* @css can't go away while we're holding cgroup_mutex */
|
|
|
|
css = css->parent;
|
|
|
|
} while (css && atomic_dec_and_test(&css->online_cnt));
|
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
}
|
|
|
|
|
2013-08-14 00:22:50 +00:00
|
|
|
/* css kill confirmation processing requires process context, bounce */
|
|
|
|
static void css_killed_ref_fn(struct percpu_ref *ref)
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css =
|
|
|
|
container_of(ref, struct cgroup_subsys_state, refcnt);
|
|
|
|
|
2016-01-21 20:31:11 +00:00
|
|
|
if (atomic_dec_and_test(&css->online_cnt)) {
|
|
|
|
INIT_WORK(&css->destroy_work, css_killed_work_fn);
|
|
|
|
queue_work(cgroup_destroy_wq, &css->destroy_work);
|
|
|
|
}
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
}
|
|
|
|
|
2014-04-23 15:13:14 +00:00
|
|
|
/**
|
|
|
|
* kill_css - destroy a css
|
|
|
|
* @css: css to destroy
|
|
|
|
*
|
|
|
|
* This function initiates destruction of @css by removing cgroup interface
|
|
|
|
* files and putting its base reference. ->css_offline() will be invoked
|
2014-05-13 16:11:01 +00:00
|
|
|
* asynchronously once css_tryget_online() is guaranteed to fail and when
|
|
|
|
* the reference count reaches zero, @css will be released.
|
2014-04-23 15:13:14 +00:00
|
|
|
*/
|
|
|
|
static void kill_css(struct cgroup_subsys_state *css)
|
2013-08-14 00:22:51 +00:00
|
|
|
{
|
2014-05-13 16:19:23 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
2014-03-19 14:23:54 +00:00
|
|
|
|
2017-05-15 13:34:06 +00:00
|
|
|
if (css->flags & CSS_DYING)
|
|
|
|
return;
|
|
|
|
|
|
|
|
css->flags |= CSS_DYING;
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
/*
|
|
|
|
* This must happen before css is disassociated with its cgroup.
|
|
|
|
* See seq_css() for details.
|
|
|
|
*/
|
2016-03-03 14:58:01 +00:00
|
|
|
css_clear_dir(css);
|
2013-08-14 00:22:51 +00:00
|
|
|
|
2013-08-14 00:22:51 +00:00
|
|
|
/*
|
|
|
|
* Killing would put the base ref, but we need to keep it alive
|
|
|
|
* until after ->css_offline().
|
|
|
|
*/
|
|
|
|
css_get(css);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cgroup core guarantees that, by the time ->css_offline() is
|
|
|
|
* invoked, no new css reference will be given out via
|
2014-05-13 16:11:01 +00:00
|
|
|
* css_tryget_online(). We can't simply call percpu_ref_kill() and
|
2013-08-14 00:22:51 +00:00
|
|
|
* proceed to offlining css's because percpu_ref_kill() doesn't
|
|
|
|
* guarantee that the ref is seen as killed on all CPUs on return.
|
|
|
|
*
|
|
|
|
* Use percpu_ref_kill_and_confirm() to get notifications as each
|
|
|
|
* css is confirmed to be seen as killed on all CPUs.
|
|
|
|
*/
|
|
|
|
percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn);
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_destroy_locked - the first stage of cgroup destruction
|
|
|
|
* @cgrp: cgroup to be destroyed
|
|
|
|
*
|
|
|
|
* css's make use of percpu refcnts whose killing latency shouldn't be
|
|
|
|
* exposed to userland and are RCU protected. Also, cgroup core needs to
|
2014-05-13 16:11:01 +00:00
|
|
|
* guarantee that css_tryget_online() won't succeed by the time
|
|
|
|
* ->css_offline() is invoked. To satisfy all the requirements,
|
|
|
|
* destruction is implemented in the following two steps.
|
cgroup: use percpu refcnt for cgroup_subsys_states
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.
One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.
In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.
This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.
The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.
This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.
v2: - init_cgroup_css() was calling percpu_ref_init() without checking
the return value. This causes two problems - the obvious lack
of error handling and percpu_ref_init() being called from
cgroup_init_subsys() before the allocators are up, which
triggers warnings but doesn't cause actual problems as the
refcnt isn't used for roots anyway. Fix both by moving
percpu_ref_init() to cgroup_create().
- The base references were put too early by
percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
refs one extra time. This wasn't noticeable because css's go
through another RCU grace period before being freed. Update
cgroup_destroy_locked() to grab an extra reference before
killing the refcnts. This problem was noticed by Kent.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "Alasdair G. Kergon" <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
2013-06-14 02:39:16 +00:00
|
|
|
*
|
|
|
|
* s1. Verify @cgrp can be destroyed and mark it dying. Remove all
|
|
|
|
* userland visible parts and start killing the percpu refcnts of
|
|
|
|
* css's. Set up so that the next stage will be kicked off once all
|
|
|
|
* the percpu refcnts are confirmed to be killed.
|
|
|
|
*
|
|
|
|
* s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the
|
|
|
|
* rest of destruction. Once all cgroup references are gone, the
|
|
|
|
* cgroup is RCU-freed.
|
|
|
|
*
|
|
|
|
* This function implements s1. After this step, @cgrp is gone as far as
|
|
|
|
* the userland is concerned and a new cgroup with the same name may be
|
|
|
|
* created. As cgroup doesn't care about the names internally, this
|
|
|
|
* doesn't cause any problem.
|
|
|
|
*/
|
2012-11-19 16:13:37 +00:00
|
|
|
static int cgroup_destroy_locked(struct cgroup *cgrp)
|
|
|
|
__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
2017-08-02 16:55:29 +00:00
|
|
|
struct cgroup *tcgrp, *parent = cgroup_parent(cgrp);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct cgroup_subsys_state *css;
|
cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 00:43:04 +00:00
|
|
|
struct cgrp_cset_link *link;
|
2013-12-06 20:11:56 +00:00
|
|
|
int ssid;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2012-11-19 16:13:37 +00:00
|
|
|
lockdep_assert_held(&cgroup_mutex);
|
|
|
|
|
2015-10-15 20:41:51 +00:00
|
|
|
/*
|
|
|
|
* Only migration can raise populated from zero and we're already
|
|
|
|
* holding cgroup_mutex.
|
|
|
|
*/
|
|
|
|
if (cgroup_is_populated(cgrp))
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
return -EBUSY;
|
2008-02-23 23:24:09 +00:00
|
|
|
|
2013-08-28 23:31:23 +00:00
|
|
|
/*
|
2014-05-16 17:22:48 +00:00
|
|
|
* Make sure there's no live children. We can't test emptiness of
|
|
|
|
* ->self.children as dead children linger on it while being
|
|
|
|
* drained; otherwise, "rmdir parent/child parent" may fail.
|
2013-08-28 23:31:23 +00:00
|
|
|
*/
|
2014-05-16 17:22:52 +00:00
|
|
|
if (css_has_online_children(&cgrp->self))
|
2013-08-28 23:31:23 +00:00
|
|
|
return -EBUSY;
|
|
|
|
|
2013-06-14 02:27:41 +00:00
|
|
|
/*
|
cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 00:43:04 +00:00
|
|
|
* Mark @cgrp and the associated csets dead. The former prevents
|
|
|
|
* further task migration and child creation by disabling
|
2023-08-03 11:57:02 +00:00
|
|
|
* cgroup_kn_lock_live(). The latter makes the csets ignored by
|
cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 00:43:04 +00:00
|
|
|
* the migration path.
|
2013-06-14 02:27:41 +00:00
|
|
|
*/
|
2014-05-16 17:22:51 +00:00
|
|
|
cgrp->self.flags &= ~CSS_ONLINE;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 00:43:04 +00:00
|
|
|
list_for_each_entry(link, &cgrp->cset_links, cset_link)
|
|
|
|
link->cset->dead = true;
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
cgroup: ignore css_sets associated with dead cgroups during migration
Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks. As init_css_set is always valid,
this worked fine.
However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
associated with at the time of death. Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
becomes a zombie, it would still remain associated with A and B. If A
only contains zombie tasks, it can be removed. On removal, A gets
marked offline but stays pinned until all zombies are drained. At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.
WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
...
Call Trace:
[<ffffffff8127e63c>] dump_stack+0x4e/0x82
[<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
[<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
[<ffffffff810c33e1>] cgroup_get+0x121/0x160
[<ffffffff810c349b>] link_css_set+0x7b/0x90
[<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
[<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
[<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
[<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
[<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
[<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
[<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
[<ffffffff81151673>] __vfs_write+0x23/0xe0
[<ffffffff81152494>] vfs_write+0xa4/0x1a0
[<ffffffff811532d4>] SyS_write+0x44/0xa0
[<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration. This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+
2016-03-16 00:43:04 +00:00
|
|
|
|
2014-05-14 13:15:01 +00:00
|
|
|
/* initiate massacre of all css's */
|
2013-12-06 20:11:56 +00:00
|
|
|
for_each_css(css, ssid, cgrp)
|
|
|
|
kill_css(css);
|
2013-06-14 02:27:41 +00:00
|
|
|
|
2018-04-26 21:29:04 +00:00
|
|
|
/* clear and remove @cgrp dir, @cgrp has an extra ref on its kn */
|
|
|
|
css_clear_dir(&cgrp->self);
|
2014-05-13 16:19:23 +00:00
|
|
|
kernfs_remove(cgrp->kn);
|
2013-08-14 00:22:50 +00:00
|
|
|
|
2022-05-18 01:36:47 +00:00
|
|
|
if (cgroup_is_threaded(cgrp))
|
2017-05-15 13:34:02 +00:00
|
|
|
parent->nr_threaded_children--;
|
|
|
|
|
2019-04-19 17:03:03 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2023-07-15 03:08:29 +00:00
|
|
|
for (tcgrp = parent; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
|
2017-08-02 16:55:29 +00:00
|
|
|
tcgrp->nr_descendants--;
|
|
|
|
tcgrp->nr_dying_descendants++;
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
/*
|
|
|
|
* If the dying cgroup is frozen, decrease frozen descendants
|
|
|
|
* counters of ancestor cgroups.
|
|
|
|
*/
|
|
|
|
if (test_bit(CGRP_FROZEN, &cgrp->flags))
|
|
|
|
tcgrp->freezer.nr_frozen_descendants--;
|
2017-08-02 16:55:29 +00:00
|
|
|
}
|
2019-04-19 17:03:03 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2017-08-02 16:55:29 +00:00
|
|
|
|
2017-08-02 16:55:32 +00:00
|
|
|
cgroup1_check_for_release(parent);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
|
bpf: decouple the lifetime of cgroup_bpf from cgroup itself
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.
The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25 16:37:39 +00:00
|
|
|
cgroup_bpf_offline(cgrp);
|
|
|
|
|
2014-05-14 13:15:01 +00:00
|
|
|
/* put the base reference */
|
2014-05-14 13:15:02 +00:00
|
|
|
percpu_ref_kill(&cgrp->self.refcnt);
|
2013-06-14 02:27:41 +00:00
|
|
|
|
2013-06-14 02:27:42 +00:00
|
|
|
return 0;
|
|
|
|
};
|
|
|
|
|
2016-12-27 19:49:08 +00:00
|
|
|
int cgroup_rmdir(struct kernfs_node *kn)
|
2012-11-19 16:13:37 +00:00
|
|
|
{
|
2014-05-13 16:19:22 +00:00
|
|
|
struct cgroup *cgrp;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
int ret = 0;
|
2012-11-19 16:13:37 +00:00
|
|
|
|
2016-03-03 14:58:00 +00:00
|
|
|
cgrp = cgroup_kn_lock_live(kn, false);
|
2014-05-13 16:19:22 +00:00
|
|
|
if (!cgrp)
|
|
|
|
return 0;
|
2012-11-19 16:13:37 +00:00
|
|
|
|
2014-05-13 16:19:22 +00:00
|
|
|
ret = cgroup_destroy_locked(cgrp);
|
2016-08-10 15:23:44 +00:00
|
|
|
if (!ret)
|
cgroup/tracing: Move taking of spin lock out of trace event handlers
It is unwise to take spin locks from the handlers of trace events.
Mainly, because they can introduce lockups, because it introduces locks
in places that are normally not tested. Worse yet, because trace events
are tucked away in the include/trace/events/ directory, locks that are
taken there are forgotten about.
As a general rule, I tell people never to take any locks in a trace
event handler.
Several cgroup trace event handlers call cgroup_path() which eventually
takes the kernfs_rename_lock spinlock. This injects the spinlock in the
code without people realizing it. It also can cause issues for the
PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
handlers are called with preemption disabled.
By moving the calculation of the cgroup_path() out of the trace event
handlers and into a macro (surrounded by a
trace_cgroup_##type##_enabled()), then we could place the cgroup_path
into a string, and pass that to the trace event. Not only does this
remove the taking of the spinlock out of the trace event handler, but
it also means that the cgroup_path() only needs to be called once (it
is currently called twice, once to get the length to reserver the
buffer for, and once again to get the path itself. Now it only needs to
be done once.
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2018-07-09 21:48:54 +00:00
|
|
|
TRACE_CGROUP_PATH(rmdir, cgrp);
|
2016-08-10 15:23:44 +00:00
|
|
|
|
2014-05-13 16:19:22 +00:00
|
|
|
cgroup_kn_unlock(kn);
|
2012-11-19 16:13:37 +00:00
|
|
|
return ret;
|
2012-04-01 19:09:55 +00:00
|
|
|
}
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
|
2017-06-27 18:30:28 +00:00
|
|
|
.show_options = cgroup_show_options,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
.mkdir = cgroup_mkdir,
|
|
|
|
.rmdir = cgroup_rmdir,
|
cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:
When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.
Short explanation (courtesy of mkerrisk):
If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point. Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.
Long version:
When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b. The root dentry field of the mountinfo
entry will show '/a/b'.
cat > /tmp/do1 << EOF
mount -t cgroup -o freezer freezer /mnt
grep freezer /proc/self/mountinfo
EOF
unshare -Gm bash /tmp/do1
> 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
> 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':
grep freezer /proc/self/cgroup
9:freezer:/
If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:
mount --bind /sys/fs/cgroup/freezer/a/b /mnt
130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.
Example (by mkerrisk):
First, a little script to save some typing and verbiage:
echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:
2653
2653 # Our shell
14254 # cat(1)
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):
Look at the state of the /proc files:
/proc/self/cgroup: 10:freezer:/
mountinfo: / /sys/fs/cgroup/freezer
The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.
However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:
# propagating to other mountns
/proc/self/cgroup: 7:freezer:/
mountinfo: /a/b /mnt/freezer
The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.
With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace. So the same steps as above:
/proc/self/cgroup: 10:freezer:/a/b
mountinfo: / /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: /../.. /sys/fs/cgroup/freezer
/proc/self/cgroup: 10:freezer:/
mountinfo: / /mnt/freezer
cgroup.clone_children freezer.parent_freezing freezer.state tasks
cgroup.procs freezer.self_freezing notify_on_release
3164
2653 # First shell that placed in this cgroup
3164 # Shell started by 'unshare'
14197 # cat(1)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 14:59:55 +00:00
|
|
|
.show_path = cgroup_show_path,
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
};
|
|
|
|
|
2014-05-04 19:09:14 +00:00
|
|
|
static void __init cgroup_init_subsys(struct cgroup_subsys *ss, bool early)
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css;
|
2007-11-15 00:58:54 +00:00
|
|
|
|
2015-12-29 19:53:56 +00:00
|
|
|
pr_debug("Initializing cgroup subsys %s\n", ss->name);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2012-11-19 16:13:36 +00:00
|
|
|
|
2014-05-04 19:09:14 +00:00
|
|
|
idr_init(&ss->css_idr);
|
2014-02-12 14:29:48 +00:00
|
|
|
INIT_LIST_HEAD(&ss->cfts);
|
2012-04-01 19:09:55 +00:00
|
|
|
|
2014-03-19 14:23:54 +00:00
|
|
|
/* Create the root cgroup state for this subsystem */
|
|
|
|
ss->root = &cgrp_dfl_root;
|
2021-11-27 14:59:19 +00:00
|
|
|
css = ss->css_alloc(NULL);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
/* We don't handle early failures gracefully */
|
|
|
|
BUG_ON(IS_ERR(css));
|
2014-05-04 19:09:14 +00:00
|
|
|
init_and_link_css(css, ss, &cgrp_dfl_root.cgrp);
|
2014-05-16 17:22:47 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Root csses are never destroyed and we can't initialize
|
|
|
|
* percpu_ref during early init. Disable refcnting.
|
|
|
|
*/
|
|
|
|
css->flags |= CSS_NO_REF;
|
|
|
|
|
2014-05-04 19:09:14 +00:00
|
|
|
if (early) {
|
2014-05-14 13:15:02 +00:00
|
|
|
/* allocation can't be done safely during early init */
|
2014-05-04 19:09:14 +00:00
|
|
|
css->id = 1;
|
|
|
|
} else {
|
|
|
|
css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2, GFP_KERNEL);
|
|
|
|
BUG_ON(css->id < 0);
|
|
|
|
}
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2008-04-29 08:00:13 +00:00
|
|
|
/* Update the init_css_set to contain a subsys
|
2007-10-19 06:39:36 +00:00
|
|
|
* pointer to this state - since the subsystem is
|
2008-04-29 08:00:13 +00:00
|
|
|
* newly registered, all tasks and hence the
|
2014-03-19 14:23:54 +00:00
|
|
|
* init_css_set is in the subsystem's root cgroup. */
|
2014-02-08 15:36:58 +00:00
|
|
|
init_css_set.subsys[ss->id] = css;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2015-06-06 00:02:14 +00:00
|
|
|
have_fork_callback |= (bool)ss->fork << ss->id;
|
|
|
|
have_exit_callback |= (bool)ss->exit << ss->id;
|
2019-01-28 16:00:13 +00:00
|
|
|
have_release_callback |= (bool)ss->release << ss->id;
|
2015-06-09 11:32:09 +00:00
|
|
|
have_canfork_callback |= (bool)ss->can_fork << ss->id;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2008-04-29 08:00:13 +00:00
|
|
|
/* At system boot, before all subsystems have been
|
|
|
|
* registered, no tasks have been forked, so we don't
|
|
|
|
* need to invoke fork callbacks here. */
|
|
|
|
BUG_ON(!list_empty(&init_task.tasks));
|
|
|
|
|
2013-08-14 00:22:50 +00:00
|
|
|
BUG_ON(online_css(css));
|
2012-11-09 17:12:29 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2010-03-10 23:22:09 +00:00
|
|
|
}
|
|
|
|
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
/**
|
2008-02-23 23:24:09 +00:00
|
|
|
* cgroup_init_early - cgroup initialization at system boot
|
|
|
|
*
|
|
|
|
* Initialize cgroups at system boot, and initialize any
|
|
|
|
* subsystems that request early init.
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
*/
|
|
|
|
int __init cgroup_init_early(void)
|
|
|
|
{
|
2019-01-17 04:42:38 +00:00
|
|
|
static struct cgroup_fs_context __initdata ctx;
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
int i;
|
2013-06-25 18:53:37 +00:00
|
|
|
|
2019-01-17 07:25:51 +00:00
|
|
|
ctx.root = &cgrp_dfl_root;
|
|
|
|
init_cgroup_root(&ctx);
|
2014-05-16 17:22:47 +00:00
|
|
|
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
|
|
|
|
|
2013-06-21 22:52:33 +00:00
|
|
|
RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2014-02-08 15:36:58 +00:00
|
|
|
for_each_subsys(ss, i) {
|
2014-02-08 15:36:58 +00:00
|
|
|
WARN(!ss->css_alloc || !ss->css_free || ss->name || ss->id,
|
2016-02-26 05:07:38 +00:00
|
|
|
"invalid cgroup_subsys %d:%s css_alloc=%p css_free=%p id:name=%d:%s\n",
|
2014-02-08 15:36:58 +00:00
|
|
|
i, cgroup_subsys_name[i], ss->css_alloc, ss->css_free,
|
2014-02-08 15:36:58 +00:00
|
|
|
ss->id, ss->name);
|
2014-02-08 15:36:58 +00:00
|
|
|
WARN(strlen(cgroup_subsys_name[i]) > MAX_CGROUP_TYPE_NAMELEN,
|
|
|
|
"cgroup_subsys_name %s too long\n", cgroup_subsys_name[i]);
|
|
|
|
|
2014-02-08 15:36:58 +00:00
|
|
|
ss->id = i;
|
2014-02-08 15:36:58 +00:00
|
|
|
ss->name = cgroup_subsys_name[i];
|
2015-08-18 20:58:16 +00:00
|
|
|
if (!ss->legacy_name)
|
|
|
|
ss->legacy_name = cgroup_subsys_name[i];
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
|
|
|
if (ss->early_init)
|
2014-05-04 19:09:14 +00:00
|
|
|
cgroup_init_subsys(ss, true);
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2008-02-23 23:24:09 +00:00
|
|
|
* cgroup_init - cgroup initialization
|
|
|
|
*
|
|
|
|
* Register cgroup filesystem and /proc file, and initialize
|
|
|
|
* any subsystems that didn't request early init.
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
*/
|
|
|
|
int __init cgroup_init(void)
|
|
|
|
{
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2015-10-15 21:00:43 +00:00
|
|
|
int ssid;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2016-02-23 03:25:47 +00:00
|
|
|
BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16);
|
2016-12-27 19:49:08 +00:00
|
|
|
BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
|
2022-09-06 19:38:55 +00:00
|
|
|
BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files));
|
2016-12-27 19:49:08 +00:00
|
|
|
BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
2018-04-26 21:29:04 +00:00
|
|
|
cgroup_rstat_boot();
|
2017-09-25 15:12:05 +00:00
|
|
|
|
2016-01-29 08:54:06 +00:00
|
|
|
get_user_ns(init_cgroup_ns.user_ns);
|
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2013-04-14 18:36:57 +00:00
|
|
|
|
2016-03-03 14:57:57 +00:00
|
|
|
/*
|
|
|
|
* Add init_css_set to the hash table so that dfl_root can link to
|
|
|
|
* it during init.
|
|
|
|
*/
|
|
|
|
hash_add(css_set_table, &init_css_set.hlist,
|
|
|
|
css_set_hash(init_css_set.subsys));
|
2013-06-25 18:53:37 +00:00
|
|
|
|
2019-01-12 05:20:54 +00:00
|
|
|
BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
|
2013-07-31 01:50:50 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2013-04-14 18:36:57 +00:00
|
|
|
|
2014-03-19 14:23:53 +00:00
|
|
|
for_each_subsys(ss, ssid) {
|
2014-05-04 19:09:14 +00:00
|
|
|
if (ss->early_init) {
|
|
|
|
struct cgroup_subsys_state *css =
|
|
|
|
init_css_set.subsys[ss->id];
|
|
|
|
|
|
|
|
css->id = cgroup_idr_alloc(&ss->css_idr, css, 1, 2,
|
|
|
|
GFP_KERNEL);
|
|
|
|
BUG_ON(css->id < 0);
|
|
|
|
} else {
|
|
|
|
cgroup_init_subsys(ss, false);
|
|
|
|
}
|
2014-03-19 14:23:53 +00:00
|
|
|
|
2014-04-23 15:13:15 +00:00
|
|
|
list_add_tail(&init_css_set.e_cset_node[ssid],
|
|
|
|
&cgrp_dfl_root.cgrp.e_csets[ssid]);
|
2014-03-19 14:23:53 +00:00
|
|
|
|
|
|
|
/*
|
2014-06-05 09:16:30 +00:00
|
|
|
* Setting dfl_root subsys_mask needs to consider the
|
|
|
|
* disabled flag and cftype registration needs kmalloc,
|
|
|
|
* both of which aren't available during early_init.
|
2014-03-19 14:23:53 +00:00
|
|
|
*/
|
2021-05-12 20:19:46 +00:00
|
|
|
if (!cgroup_ssid_enabled(ssid))
|
2014-07-15 15:05:10 +00:00
|
|
|
continue;
|
|
|
|
|
2016-12-27 19:49:08 +00:00
|
|
|
if (cgroup1_ssid_disabled(ssid))
|
2023-08-01 07:22:14 +00:00
|
|
|
pr_info("Disabling %s control group subsystem in v1 mounts\n",
|
2023-10-06 11:50:32 +00:00
|
|
|
ss->legacy_name);
|
2016-02-11 18:34:49 +00:00
|
|
|
|
2014-07-15 15:05:10 +00:00
|
|
|
cgrp_dfl_root.subsys_mask |= 1 << ss->id;
|
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
/* implicit controllers must be threaded too */
|
|
|
|
WARN_ON(ss->implicit_on_dfl && !ss->threaded);
|
|
|
|
|
2016-03-08 16:51:26 +00:00
|
|
|
if (ss->implicit_on_dfl)
|
|
|
|
cgrp_dfl_implicit_ss_mask |= 1 << ss->id;
|
|
|
|
else if (!ss->dfl_cftypes)
|
2016-02-23 15:00:50 +00:00
|
|
|
cgrp_dfl_inhibit_ss_mask |= 1 << ss->id;
|
2014-07-15 15:05:10 +00:00
|
|
|
|
cgroup: implement cgroup v2 thread support
This patch implements cgroup v2 thread support. The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.
A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file. When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.
The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint. Note that the threads aren't allowed
to escape to a different threaded subtree. To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.
The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree. This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted. This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.
As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.
Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.
This patch enables threaded mode for the pids and perf_events
controllers. Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.
For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.
v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
Spotted by Waiman.
- Documentation updated as suggested by Waiman.
- cgroup.type content slightly reformatted.
- Mark the debug controller threaded.
v4: - Updated to the general idea of marking specific cgroups
domain/threaded as suggested by PeterZ.
v3: - Dropped "join" and always make mixed children join the parent's
threaded subtree.
v2: - After discussions with Waiman, support for mixed thread mode is
added. This should address the issue that Peter pointed out
where any nesting should be avoided for thread subtrees while
coexisting with other domain cgroups.
- Enabling / disabling thread mode now piggy backs on the existing
control mask update mechanism.
- Bug fixes and cleanup.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
2017-07-21 15:14:51 +00:00
|
|
|
if (ss->threaded)
|
|
|
|
cgrp_dfl_threaded_ss_mask |= 1 << ss->id;
|
|
|
|
|
2014-07-15 15:05:10 +00:00
|
|
|
if (ss->dfl_cftypes == ss->legacy_cftypes) {
|
|
|
|
WARN_ON(cgroup_add_cftypes(ss, ss->dfl_cftypes));
|
|
|
|
} else {
|
|
|
|
WARN_ON(cgroup_add_dfl_cftypes(ss, ss->dfl_cftypes));
|
|
|
|
WARN_ON(cgroup_add_legacy_cftypes(ss, ss->legacy_cftypes));
|
2014-06-05 09:16:30 +00:00
|
|
|
}
|
2015-02-19 14:34:46 +00:00
|
|
|
|
|
|
|
if (ss->bind)
|
|
|
|
ss->bind(init_css_set.subsys[ssid]);
|
2017-07-18 21:57:46 +00:00
|
|
|
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2017-07-18 21:57:46 +00:00
|
|
|
css_populate_dir(init_css_set.subsys[ssid]);
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2010-08-05 20:53:35 +00:00
|
|
|
}
|
|
|
|
|
2016-03-03 14:57:57 +00:00
|
|
|
/* init_css_set.subsys[] has been updated, re-hash */
|
|
|
|
hash_del(&init_css_set.hlist);
|
|
|
|
hash_add(css_set_table, &init_css_set.hlist,
|
|
|
|
css_set_hash(init_css_set.subsys));
|
|
|
|
|
2015-10-15 21:00:43 +00:00
|
|
|
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
|
|
|
|
WARN_ON(register_filesystem(&cgroup_fs_type));
|
2015-11-16 16:13:34 +00:00
|
|
|
WARN_ON(register_filesystem(&cgroup2_fs_type));
|
2018-05-15 13:57:23 +00:00
|
|
|
WARN_ON(!proc_create_single("cgroups", 0, NULL, proc_cgroupstats_show));
|
2019-05-13 16:33:22 +00:00
|
|
|
#ifdef CONFIG_CPUSETS
|
|
|
|
WARN_ON(register_filesystem(&cpuset_fs_type));
|
|
|
|
#endif
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
return 0;
|
Task Control Groups: basic task cgroup framework
Generic Process Control Groups
--------------------------
There have recently been various proposals floating around for
resource management/accounting and other task grouping subsystems in
the kernel, including ResGroups, User BeanCounters, NSProxy
cgroups, and others. These all need the basic abstraction of being
able to group together multiple processes in an aggregate, in order to
track/limit the resources permitted to those processes, or control
other behaviour of the processes, and all implement this grouping in
different ways.
This patchset provides a framework for tracking and grouping processes
into arbitrary "cgroups" and assigning arbitrary state to those
groupings, in order to control the behaviour of the cgroup as an
aggregate.
The intention is that the various resource management and
virtualization/cgroup efforts can also become task cgroup
clients, with the result that:
- the userspace APIs are (somewhat) normalised
- it's easier to test e.g. the ResGroups CPU controller in
conjunction with the BeanCounters memory controller, or use either of
them as the resource-control portion of a virtual server system.
- the additional kernel footprint of any of the competing resource
management systems is substantially reduced, since it doesn't need
to provide process grouping/containment, hence improving their
chances of getting into the kernel
This patch:
Add the main task cgroups framework - the cgroup filesystem, and the
basic structures for tracking membership and associating subsystem state
objects to tasks.
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-19 06:39:30 +00:00
|
|
|
}
|
2007-10-19 06:39:33 +00:00
|
|
|
|
cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
2013-11-22 22:14:39 +00:00
|
|
|
static int __init cgroup_wq_init(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* There isn't much point in executing destruction path in
|
|
|
|
* parallel. Good chunk is serialized with cgroup_mutex anyway.
|
2014-02-13 00:06:19 +00:00
|
|
|
* Use 1 for @max_active.
|
cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
2013-11-22 22:14:39 +00:00
|
|
|
*
|
|
|
|
* We would prefer to do this in cgroup_init() above, but that
|
|
|
|
* is called before init_workqueues(): so leave this until after.
|
|
|
|
*/
|
2014-02-13 00:06:19 +00:00
|
|
|
cgroup_destroy_wq = alloc_workqueue("cgroup_destroy", 0, 1);
|
cgroup: use a dedicated workqueue for cgroup destruction
Since be44562613851 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue. css
freeing is performed from a work item from that point on and a later
commit, ea15f8ccdb430 ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.
As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex. As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.
Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu(). If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
This can be fixed by queueing destruction work items on a separate
workqueue. This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose. As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.
Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init(). Do it from a
separate core_initcall(). In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Shawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+
2013-11-22 22:14:39 +00:00
|
|
|
BUG_ON(!cgroup_destroy_wq);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
core_initcall(cgroup_wq_init);
|
|
|
|
|
2019-11-04 23:54:30 +00:00
|
|
|
void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
|
2017-07-12 18:49:55 +00:00
|
|
|
{
|
|
|
|
struct kernfs_node *kn;
|
|
|
|
|
2019-11-04 23:54:30 +00:00
|
|
|
kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id);
|
2017-07-12 18:49:55 +00:00
|
|
|
if (!kn)
|
|
|
|
return;
|
|
|
|
kernfs_path(kn, buf, buflen);
|
|
|
|
kernfs_put(kn);
|
|
|
|
}
|
|
|
|
|
2021-06-08 04:35:44 +00:00
|
|
|
/*
|
|
|
|
* cgroup_get_from_id : get the cgroup associated with cgroup id
|
|
|
|
* @id: cgroup id
|
2022-08-26 16:52:37 +00:00
|
|
|
* On success return the cgrp or ERR_PTR on failure
|
2022-08-26 16:52:36 +00:00
|
|
|
* Only cgroups within current task's cgroup NS are valid.
|
2021-06-08 04:35:44 +00:00
|
|
|
*/
|
|
|
|
struct cgroup *cgroup_get_from_id(u64 id)
|
|
|
|
{
|
|
|
|
struct kernfs_node *kn;
|
2022-09-23 17:23:06 +00:00
|
|
|
struct cgroup *cgrp, *root_cgrp;
|
2021-06-08 04:35:44 +00:00
|
|
|
|
|
|
|
kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id);
|
|
|
|
if (!kn)
|
2022-09-23 17:23:06 +00:00
|
|
|
return ERR_PTR(-ENOENT);
|
2021-06-08 04:35:44 +00:00
|
|
|
|
2022-09-23 17:23:06 +00:00
|
|
|
if (kernfs_type(kn) != KERNFS_DIR) {
|
|
|
|
kernfs_put(kn);
|
|
|
|
return ERR_PTR(-ENOENT);
|
|
|
|
}
|
2022-09-23 11:51:19 +00:00
|
|
|
|
2021-10-25 06:19:14 +00:00
|
|
|
rcu_read_lock();
|
2021-06-08 04:35:44 +00:00
|
|
|
|
2021-10-25 06:19:14 +00:00
|
|
|
cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv);
|
|
|
|
if (cgrp && !cgroup_tryget(cgrp))
|
2021-06-08 04:35:44 +00:00
|
|
|
cgrp = NULL;
|
2021-10-25 06:19:14 +00:00
|
|
|
|
|
|
|
rcu_read_unlock();
|
2021-06-08 04:35:44 +00:00
|
|
|
kernfs_put(kn);
|
2022-08-26 16:52:36 +00:00
|
|
|
|
|
|
|
if (!cgrp)
|
2022-09-23 17:23:06 +00:00
|
|
|
return ERR_PTR(-ENOENT);
|
2022-08-26 16:52:36 +00:00
|
|
|
|
2022-10-10 08:29:18 +00:00
|
|
|
root_cgrp = current_cgns_cgroup_dfl();
|
2022-08-26 16:52:36 +00:00
|
|
|
if (!cgroup_is_descendant(cgrp, root_cgrp)) {
|
|
|
|
cgroup_put(cgrp);
|
2022-09-23 17:23:06 +00:00
|
|
|
return ERR_PTR(-ENOENT);
|
2022-08-26 16:52:36 +00:00
|
|
|
}
|
2022-09-23 17:23:06 +00:00
|
|
|
|
2021-06-08 04:35:44 +00:00
|
|
|
return cgrp;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(cgroup_get_from_id);
|
|
|
|
|
2007-10-19 06:39:35 +00:00
|
|
|
/*
|
|
|
|
* proc_cgroup_show()
|
|
|
|
* - Print task's cgroup paths into seq_file, one line for each hierarchy
|
|
|
|
* - Used for /proc/<pid>/cgroup.
|
|
|
|
*/
|
2014-09-18 08:03:15 +00:00
|
|
|
int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
|
|
|
|
struct pid *pid, struct task_struct *tsk)
|
2007-10-19 06:39:35 +00:00
|
|
|
{
|
2016-08-10 15:23:44 +00:00
|
|
|
char *buf;
|
2007-10-19 06:39:35 +00:00
|
|
|
int retval;
|
2014-03-19 14:23:54 +00:00
|
|
|
struct cgroup_root *root;
|
2007-10-19 06:39:35 +00:00
|
|
|
|
|
|
|
retval = -ENOMEM;
|
2014-02-12 14:29:50 +00:00
|
|
|
buf = kmalloc(PATH_MAX, GFP_KERNEL);
|
2007-10-19 06:39:35 +00:00
|
|
|
if (!buf)
|
|
|
|
goto out;
|
|
|
|
|
2023-10-29 06:14:30 +00:00
|
|
|
rcu_read_lock();
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2007-10-19 06:39:35 +00:00
|
|
|
|
2014-03-19 14:23:53 +00:00
|
|
|
for_each_root(root) {
|
2007-10-19 06:39:35 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2007-10-19 06:40:44 +00:00
|
|
|
struct cgroup *cgrp;
|
2013-12-06 20:11:57 +00:00
|
|
|
int ssid, count = 0;
|
2007-10-19 06:39:35 +00:00
|
|
|
|
2022-09-04 19:16:19 +00:00
|
|
|
if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible))
|
2014-03-19 14:23:53 +00:00
|
|
|
continue;
|
|
|
|
|
2023-10-29 06:14:30 +00:00
|
|
|
cgrp = task_cgroup_from_root(tsk, root);
|
|
|
|
/* The root has already been unmounted. */
|
|
|
|
if (!cgrp)
|
|
|
|
continue;
|
|
|
|
|
2009-09-23 22:56:23 +00:00
|
|
|
seq_printf(m, "%d:", root->hierarchy_id);
|
2015-08-18 20:58:16 +00:00
|
|
|
if (root != &cgrp_dfl_root)
|
|
|
|
for_each_subsys(ss, ssid)
|
|
|
|
if (root->subsys_mask & (1 << ssid))
|
|
|
|
seq_printf(m, "%s%s", count++ ? "," : "",
|
2015-08-18 20:58:16 +00:00
|
|
|
ss->legacy_name);
|
2009-09-23 22:56:19 +00:00
|
|
|
if (strlen(root->name))
|
|
|
|
seq_printf(m, "%sname=%s", count ? "," : "",
|
|
|
|
root->name);
|
2007-10-19 06:39:35 +00:00
|
|
|
seq_putc(m, ':');
|
2015-10-15 20:41:53 +00:00
|
|
|
/*
|
|
|
|
* On traditional hierarchies, all zombie tasks show up as
|
|
|
|
* belonging to the root cgroup. On the default hierarchy,
|
|
|
|
* while a zombie doesn't show up in "cgroup.procs" and
|
|
|
|
* thus can't be migrated, its /proc/PID/cgroup keeps
|
|
|
|
* reporting the cgroup it belonged to before exiting. If
|
|
|
|
* the cgroup is removed before the zombie is reaped,
|
|
|
|
* " (deleted)" is appended to the cgroup path.
|
|
|
|
*/
|
|
|
|
if (cgroup_on_dfl(cgrp) || !(tsk->flags & PF_EXITING)) {
|
2016-08-10 15:23:44 +00:00
|
|
|
retval = cgroup_path_ns_locked(cgrp, buf, PATH_MAX,
|
2016-01-29 08:54:06 +00:00
|
|
|
current->nsproxy->cgroup_ns);
|
2023-12-12 21:17:40 +00:00
|
|
|
if (retval == -E2BIG)
|
2015-10-15 20:41:53 +00:00
|
|
|
retval = -ENAMETOOLONG;
|
2016-09-29 13:49:40 +00:00
|
|
|
if (retval < 0)
|
2015-10-15 20:41:53 +00:00
|
|
|
goto out_unlock;
|
2016-08-10 15:23:44 +00:00
|
|
|
|
|
|
|
seq_puts(m, buf);
|
2015-10-15 20:41:53 +00:00
|
|
|
} else {
|
2016-08-10 15:23:44 +00:00
|
|
|
seq_puts(m, "/");
|
2014-02-12 14:29:50 +00:00
|
|
|
}
|
2015-10-15 20:41:53 +00:00
|
|
|
|
|
|
|
if (cgroup_on_dfl(cgrp) && cgroup_is_dead(cgrp))
|
|
|
|
seq_puts(m, " (deleted)\n");
|
|
|
|
else
|
|
|
|
seq_putc(m, '\n');
|
2007-10-19 06:39:35 +00:00
|
|
|
}
|
|
|
|
|
2014-09-18 08:03:15 +00:00
|
|
|
retval = 0;
|
2007-10-19 06:39:35 +00:00
|
|
|
out_unlock:
|
2016-06-22 20:28:41 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2023-10-29 06:14:30 +00:00
|
|
|
rcu_read_unlock();
|
2007-10-19 06:39:35 +00:00
|
|
|
kfree(buf);
|
|
|
|
out:
|
|
|
|
return retval;
|
|
|
|
}
|
|
|
|
|
2007-10-19 06:39:33 +00:00
|
|
|
/**
|
2014-02-25 15:04:03 +00:00
|
|
|
* cgroup_fork - initialize cgroup related fields during copy_process()
|
2008-02-23 23:24:09 +00:00
|
|
|
* @child: pointer to task_struct of forking parent process.
|
2007-10-19 06:39:33 +00:00
|
|
|
*
|
2014-02-25 15:04:03 +00:00
|
|
|
* A task is associated with the init_css_set until cgroup_post_fork()
|
2020-02-05 13:26:22 +00:00
|
|
|
* attaches it to the target css_set.
|
2007-10-19 06:39:33 +00:00
|
|
|
*/
|
|
|
|
void cgroup_fork(struct task_struct *child)
|
|
|
|
{
|
2014-02-25 15:04:03 +00:00
|
|
|
RCU_INIT_POINTER(child->cgroups, &init_css_set);
|
2007-10-19 06:39:36 +00:00
|
|
|
INIT_LIST_HEAD(&child->cg_list);
|
2007-10-19 06:39:33 +00:00
|
|
|
}
|
|
|
|
|
2022-10-11 00:33:58 +00:00
|
|
|
/**
|
|
|
|
* cgroup_v1v2_get_from_file - get a cgroup pointer from a file pointer
|
|
|
|
* @f: file corresponding to cgroup_dir
|
|
|
|
*
|
|
|
|
* Find the cgroup from a file pointer associated with a cgroup directory.
|
|
|
|
* Returns a pointer to the cgroup on success. ERR_PTR is returned if the
|
|
|
|
* cgroup cannot be found.
|
|
|
|
*/
|
|
|
|
static struct cgroup *cgroup_v1v2_get_from_file(struct file *f)
|
2020-02-05 13:26:19 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys_state *css;
|
|
|
|
|
|
|
|
css = css_tryget_online_from_dir(f->f_path.dentry, NULL);
|
|
|
|
if (IS_ERR(css))
|
|
|
|
return ERR_CAST(css);
|
|
|
|
|
2022-10-11 00:33:58 +00:00
|
|
|
return css->cgroup;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_get_from_file - same as cgroup_v1v2_get_from_file, but only supports
|
|
|
|
* cgroup2.
|
2022-10-11 22:51:55 +00:00
|
|
|
* @f: file corresponding to cgroup2_dir
|
2022-10-11 00:33:58 +00:00
|
|
|
*/
|
|
|
|
static struct cgroup *cgroup_get_from_file(struct file *f)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = cgroup_v1v2_get_from_file(f);
|
|
|
|
|
|
|
|
if (IS_ERR(cgrp))
|
|
|
|
return ERR_CAST(cgrp);
|
|
|
|
|
2022-10-10 21:08:17 +00:00
|
|
|
if (!cgroup_on_dfl(cgrp)) {
|
|
|
|
cgroup_put(cgrp);
|
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
}
|
|
|
|
|
2020-02-05 13:26:19 +00:00
|
|
|
return cgrp;
|
|
|
|
}
|
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
/**
|
|
|
|
* cgroup_css_set_fork - find or create a css_set for a child process
|
|
|
|
* @kargs: the arguments passed to create the child process
|
|
|
|
*
|
|
|
|
* This functions finds or creates a new css_set which the child
|
|
|
|
* process will be attached to in cgroup_post_fork(). By default,
|
|
|
|
* the child process will be given the same css_set as its parent.
|
|
|
|
*
|
|
|
|
* If CLONE_INTO_CGROUP is specified this function will try to find an
|
|
|
|
* existing css_set which includes the requested cgroup and if not create
|
|
|
|
* a new css_set that the child will be attached to later. If this function
|
|
|
|
* succeeds it will hold cgroup_threadgroup_rwsem on return. If
|
|
|
|
* CLONE_INTO_CGROUP is requested this function will grab cgroup mutex
|
|
|
|
* before grabbing cgroup_threadgroup_rwsem and will hold a reference
|
|
|
|
* to the target cgroup.
|
|
|
|
*/
|
|
|
|
static int cgroup_css_set_fork(struct kernel_clone_args *kargs)
|
|
|
|
__acquires(&cgroup_mutex) __acquires(&cgroup_threadgroup_rwsem)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct cgroup *dst_cgrp = NULL;
|
|
|
|
struct css_set *cset;
|
|
|
|
struct super_block *sb;
|
|
|
|
struct file *f;
|
|
|
|
|
|
|
|
if (kargs->flags & CLONE_INTO_CGROUP)
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_lock();
|
2020-02-05 13:26:22 +00:00
|
|
|
|
|
|
|
cgroup_threadgroup_change_begin(current);
|
|
|
|
|
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
cset = task_css_set(current);
|
|
|
|
get_css_set(cset);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
|
|
|
|
if (!(kargs->flags & CLONE_INTO_CGROUP)) {
|
|
|
|
kargs->cset = cset;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
f = fget_raw(kargs->cgroup);
|
|
|
|
if (!f) {
|
|
|
|
ret = -EBADF;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
sb = f->f_path.dentry->d_sb;
|
|
|
|
|
|
|
|
dst_cgrp = cgroup_get_from_file(f);
|
|
|
|
if (IS_ERR(dst_cgrp)) {
|
|
|
|
ret = PTR_ERR(dst_cgrp);
|
|
|
|
dst_cgrp = NULL;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cgroup_is_dead(dst_cgrp)) {
|
|
|
|
ret = -ENODEV;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Verify that we the target cgroup is writable for us. This is
|
|
|
|
* usually done by the vfs layer but since we're not going through
|
|
|
|
* the vfs layer here we need to do it "manually".
|
|
|
|
*/
|
|
|
|
ret = cgroup_may_write(dst_cgrp, sb);
|
|
|
|
if (ret)
|
|
|
|
goto err;
|
|
|
|
|
2022-02-21 15:16:39 +00:00
|
|
|
/*
|
|
|
|
* Spawning a task directly into a cgroup works by passing a file
|
|
|
|
* descriptor to the target cgroup directory. This can even be an O_PATH
|
|
|
|
* file descriptor. But it can never be a cgroup.procs file descriptor.
|
|
|
|
* This was done on purpose so spawning into a cgroup could be
|
|
|
|
* conceptualized as an atomic
|
|
|
|
*
|
|
|
|
* fd = openat(dfd_cgroup, "cgroup.procs", ...);
|
|
|
|
* write(fd, <child-pid>, ...);
|
|
|
|
*
|
|
|
|
* sequence, i.e. it's a shorthand for the caller opening and writing
|
|
|
|
* cgroup.procs of the cgroup indicated by @dfd_cgroup. This allows us
|
|
|
|
* to always use the caller's credentials.
|
|
|
|
*/
|
2020-02-05 13:26:22 +00:00
|
|
|
ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb,
|
2022-01-06 21:02:29 +00:00
|
|
|
!(kargs->flags & CLONE_THREAD),
|
|
|
|
current->nsproxy->cgroup_ns);
|
2020-02-05 13:26:22 +00:00
|
|
|
if (ret)
|
|
|
|
goto err;
|
|
|
|
|
|
|
|
kargs->cset = find_css_set(cset, dst_cgrp);
|
|
|
|
if (!kargs->cset) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
|
|
|
|
put_css_set(cset);
|
|
|
|
fput(f);
|
|
|
|
kargs->cgrp = dst_cgrp;
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
err:
|
|
|
|
cgroup_threadgroup_change_end(current);
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2020-02-05 13:26:22 +00:00
|
|
|
if (f)
|
|
|
|
fput(f);
|
|
|
|
if (dst_cgrp)
|
|
|
|
cgroup_put(dst_cgrp);
|
|
|
|
put_css_set(cset);
|
|
|
|
if (kargs->cset)
|
|
|
|
put_css_set(kargs->cset);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_css_set_put_fork - drop references we took during fork
|
|
|
|
* @kargs: the arguments passed to create the child process
|
|
|
|
*
|
|
|
|
* Drop references to the prepared css_set and target cgroup if
|
|
|
|
* CLONE_INTO_CGROUP was requested.
|
|
|
|
*/
|
|
|
|
static void cgroup_css_set_put_fork(struct kernel_clone_args *kargs)
|
|
|
|
__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
|
|
|
|
{
|
2023-05-21 19:29:53 +00:00
|
|
|
struct cgroup *cgrp = kargs->cgrp;
|
|
|
|
struct css_set *cset = kargs->cset;
|
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
cgroup_threadgroup_change_end(current);
|
|
|
|
|
2023-05-21 19:29:53 +00:00
|
|
|
if (cset) {
|
|
|
|
put_css_set(cset);
|
|
|
|
kargs->cset = NULL;
|
|
|
|
}
|
2020-02-05 13:26:22 +00:00
|
|
|
|
2023-05-21 19:29:53 +00:00
|
|
|
if (kargs->flags & CLONE_INTO_CGROUP) {
|
2023-03-03 09:53:10 +00:00
|
|
|
cgroup_unlock();
|
2020-02-05 13:26:22 +00:00
|
|
|
if (cgrp) {
|
|
|
|
cgroup_put(cgrp);
|
|
|
|
kargs->cgrp = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-09 11:32:09 +00:00
|
|
|
/**
|
|
|
|
* cgroup_can_fork - called on a new task before the process is exposed
|
2020-02-05 13:26:20 +00:00
|
|
|
* @child: the child process
|
2022-01-08 06:38:12 +00:00
|
|
|
* @kargs: the arguments passed to create the child process
|
2015-06-09 11:32:09 +00:00
|
|
|
*
|
2020-02-05 13:26:22 +00:00
|
|
|
* This prepares a new css_set for the child process which the child will
|
|
|
|
* be attached to in cgroup_post_fork().
|
2020-02-05 13:26:20 +00:00
|
|
|
* This calls the subsystem can_fork() callbacks. If the cgroup_can_fork()
|
|
|
|
* callback returns an error, the fork aborts with that error code. This
|
|
|
|
* allows for a cgroup subsystem to conditionally allow or deny new forks.
|
2015-06-09 11:32:09 +00:00
|
|
|
*/
|
2020-02-05 13:26:22 +00:00
|
|
|
int cgroup_can_fork(struct task_struct *child, struct kernel_clone_args *kargs)
|
2015-06-09 11:32:09 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int i, j, ret;
|
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
ret = cgroup_css_set_fork(kargs);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2020-02-05 13:26:20 +00:00
|
|
|
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, i, have_canfork_callback) {
|
2020-02-05 13:26:22 +00:00
|
|
|
ret = ss->can_fork(child, kargs->cset);
|
2015-06-09 11:32:09 +00:00
|
|
|
if (ret)
|
|
|
|
goto out_revert;
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
2015-06-09 11:32:09 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_revert:
|
|
|
|
for_each_subsys(ss, j) {
|
|
|
|
if (j >= i)
|
|
|
|
break;
|
|
|
|
if (ss->cancel_fork)
|
2020-02-05 13:26:22 +00:00
|
|
|
ss->cancel_fork(child, kargs->cset);
|
2015-06-09 11:32:09 +00:00
|
|
|
}
|
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
cgroup_css_set_put_fork(kargs);
|
2020-02-05 13:26:20 +00:00
|
|
|
|
2015-06-09 11:32:09 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_cancel_fork - called if a fork failed after cgroup_can_fork()
|
2020-02-05 13:26:22 +00:00
|
|
|
* @child: the child process
|
|
|
|
* @kargs: the arguments passed to create the child process
|
2015-06-09 11:32:09 +00:00
|
|
|
*
|
|
|
|
* This calls the cancel_fork() callbacks if a fork failed *after*
|
2021-05-24 08:29:43 +00:00
|
|
|
* cgroup_can_fork() succeeded and cleans up references we took to
|
2020-02-05 13:26:22 +00:00
|
|
|
* prepare a new css_set for the child process in cgroup_can_fork().
|
2015-06-09 11:32:09 +00:00
|
|
|
*/
|
2020-02-05 13:26:22 +00:00
|
|
|
void cgroup_cancel_fork(struct task_struct *child,
|
|
|
|
struct kernel_clone_args *kargs)
|
2015-06-09 11:32:09 +00:00
|
|
|
{
|
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for_each_subsys(ss, i)
|
|
|
|
if (ss->cancel_fork)
|
2020-02-05 13:26:22 +00:00
|
|
|
ss->cancel_fork(child, kargs->cset);
|
2020-02-05 13:26:20 +00:00
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
cgroup_css_set_put_fork(kargs);
|
2015-06-09 11:32:09 +00:00
|
|
|
}
|
|
|
|
|
2007-10-19 06:39:36 +00:00
|
|
|
/**
|
2020-02-05 13:26:20 +00:00
|
|
|
* cgroup_post_fork - finalize cgroup setup for the child process
|
|
|
|
* @child: the child process
|
2022-01-08 06:38:12 +00:00
|
|
|
* @kargs: the arguments passed to create the child process
|
2008-02-23 23:24:09 +00:00
|
|
|
*
|
2020-02-05 13:26:20 +00:00
|
|
|
* Attach the child process to its css_set calling the subsystem fork()
|
|
|
|
* callbacks.
|
2008-02-23 23:24:09 +00:00
|
|
|
*/
|
2020-02-05 13:26:22 +00:00
|
|
|
void cgroup_post_fork(struct task_struct *child,
|
|
|
|
struct kernel_clone_args *kargs)
|
|
|
|
__releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex)
|
2007-10-19 06:39:36 +00:00
|
|
|
{
|
2021-05-08 12:15:38 +00:00
|
|
|
unsigned long cgrp_flags = 0;
|
|
|
|
bool kill = false;
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2019-10-24 19:03:51 +00:00
|
|
|
struct css_set *cset;
|
2012-10-16 22:03:14 +00:00
|
|
|
int i;
|
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
cset = kargs->cset;
|
|
|
|
kargs->cset = NULL;
|
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
|
2020-01-30 16:37:33 +00:00
|
|
|
/* init tasks are special, only link regular threads */
|
|
|
|
if (likely(child->pid)) {
|
2021-05-08 12:15:38 +00:00
|
|
|
if (kargs->cgrp)
|
|
|
|
cgrp_flags = kargs->cgrp->flags;
|
|
|
|
else
|
|
|
|
cgrp_flags = cset->dfl_cgrp->flags;
|
|
|
|
|
2020-01-30 16:37:33 +00:00
|
|
|
WARN_ON_ONCE(!list_empty(&child->cg_list));
|
|
|
|
cset->nr_tasks++;
|
|
|
|
css_set_move_task(child, NULL, cset, false);
|
2020-02-05 13:26:22 +00:00
|
|
|
} else {
|
|
|
|
put_css_set(cset);
|
|
|
|
cset = NULL;
|
2020-01-30 16:37:33 +00:00
|
|
|
}
|
2019-10-24 19:03:51 +00:00
|
|
|
|
2021-05-08 12:15:38 +00:00
|
|
|
if (!(child->flags & PF_KTHREAD)) {
|
|
|
|
if (unlikely(test_bit(CGRP_FREEZE, &cgrp_flags))) {
|
|
|
|
/*
|
|
|
|
* If the cgroup has to be frozen, the new task has
|
|
|
|
* too. Let's set the JOBCTL_TRAP_FREEZE jobctl bit to
|
|
|
|
* get the task into the frozen state.
|
|
|
|
*/
|
|
|
|
spin_lock(&child->sighand->siglock);
|
|
|
|
WARN_ON_ONCE(child->frozen);
|
|
|
|
child->jobctl |= JOBCTL_TRAP_FREEZE;
|
|
|
|
spin_unlock(&child->sighand->siglock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calling cgroup_update_frozen() isn't required here,
|
|
|
|
* because it will be called anyway a bit later from
|
|
|
|
* do_freezer_trap(). So we avoid cgroup's transient
|
|
|
|
* switch from the frozen state and back.
|
|
|
|
*/
|
|
|
|
}
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
|
|
|
|
/*
|
2021-05-08 12:15:38 +00:00
|
|
|
* If the cgroup is to be killed notice it now and take the
|
|
|
|
* child down right after we finished preparing it for
|
|
|
|
* userspace.
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
*/
|
2021-05-08 12:15:38 +00:00
|
|
|
kill = test_bit(CGRP_KILL, &cgrp_flags);
|
2007-10-19 06:39:36 +00:00
|
|
|
}
|
2012-10-16 22:03:14 +00:00
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
|
|
|
|
2012-10-16 22:03:14 +00:00
|
|
|
/*
|
|
|
|
* Call ss->fork(). This must happen after @child is linked on
|
|
|
|
* css_set; otherwise, @child might change state between ->fork()
|
|
|
|
* and addition to css_set.
|
|
|
|
*/
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, i, have_fork_callback) {
|
2015-12-03 15:24:08 +00:00
|
|
|
ss->fork(child);
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
2020-02-05 13:26:20 +00:00
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
/* Make the new cset the root_cset of the new cgroup namespace. */
|
|
|
|
if (kargs->flags & CLONE_NEWCGROUP) {
|
|
|
|
struct css_set *rcset = child->nsproxy->cgroup_ns->root_cset;
|
|
|
|
|
|
|
|
get_css_set(cset);
|
|
|
|
child->nsproxy->cgroup_ns->root_cset = cset;
|
|
|
|
put_css_set(rcset);
|
|
|
|
}
|
|
|
|
|
2021-05-08 12:15:38 +00:00
|
|
|
/* Cgroup has to be killed so take down child immediately. */
|
|
|
|
if (unlikely(kill))
|
|
|
|
do_send_sig_info(SIGKILL, SEND_SIG_NOINFO, child, PIDTYPE_TGID);
|
|
|
|
|
2020-02-05 13:26:22 +00:00
|
|
|
cgroup_css_set_put_fork(kargs);
|
2007-10-19 06:39:36 +00:00
|
|
|
}
|
2012-10-16 22:03:14 +00:00
|
|
|
|
2007-10-19 06:39:33 +00:00
|
|
|
/**
|
|
|
|
* cgroup_exit - detach cgroup from exiting task
|
|
|
|
* @tsk: pointer to task_struct of exiting process
|
|
|
|
*
|
2019-10-04 10:57:39 +00:00
|
|
|
* Description: Detach cgroup from @tsk.
|
2007-10-19 06:39:33 +00:00
|
|
|
*
|
|
|
|
*/
|
2014-03-28 07:22:19 +00:00
|
|
|
void cgroup_exit(struct task_struct *tsk)
|
2007-10-19 06:39:33 +00:00
|
|
|
{
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2013-06-13 04:04:49 +00:00
|
|
|
struct css_set *cset;
|
2011-02-07 16:02:20 +00:00
|
|
|
int i;
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
2015-10-15 20:41:49 +00:00
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
WARN_ON_ONCE(list_empty(&tsk->cg_list));
|
|
|
|
cset = task_css_set(tsk);
|
|
|
|
css_set_move_task(tsk, cset, NULL, false);
|
|
|
|
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
|
|
|
|
cset->nr_tasks--;
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
|
2023-05-08 07:58:51 +00:00
|
|
|
if (dl_task(tsk))
|
|
|
|
dec_dl_tasks_cs(tsk);
|
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
WARN_ON_ONCE(cgroup_task_frozen(tsk));
|
2021-05-10 21:39:46 +00:00
|
|
|
if (unlikely(!(tsk->flags & PF_KTHREAD) &&
|
|
|
|
test_bit(CGRP_FREEZE, &task_dfl_cgroup(tsk)->flags)))
|
2019-10-24 19:03:51 +00:00
|
|
|
cgroup_update_frozen(task_dfl_cgroup(tsk));
|
cgroup: cgroup v2 freezer
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
2019-04-19 17:03:04 +00:00
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
spin_unlock_irq(&css_set_lock);
|
2007-10-19 06:39:36 +00:00
|
|
|
|
2015-06-06 00:02:14 +00:00
|
|
|
/* see cgroup_post_fork() for details */
|
2016-02-23 03:25:46 +00:00
|
|
|
do_each_subsys_mask(ss, i, have_exit_callback) {
|
2015-10-15 20:41:53 +00:00
|
|
|
ss->exit(tsk);
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
2015-10-15 20:41:53 +00:00
|
|
|
}
|
2013-06-25 18:53:37 +00:00
|
|
|
|
2019-01-28 16:00:13 +00:00
|
|
|
void cgroup_release(struct task_struct *task)
|
2015-10-15 20:41:53 +00:00
|
|
|
{
|
2015-10-15 20:41:53 +00:00
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
|
|
|
|
2019-01-28 16:00:13 +00:00
|
|
|
do_each_subsys_mask(ss, ssid, have_release_callback) {
|
|
|
|
ss->release(task);
|
2016-02-23 03:25:46 +00:00
|
|
|
} while_each_subsys_mask();
|
2019-05-31 17:38:58 +00:00
|
|
|
|
2019-10-24 19:03:51 +00:00
|
|
|
spin_lock_irq(&css_set_lock);
|
|
|
|
css_set_skip_task_iters(task_css_set(task), task);
|
|
|
|
list_del_init(&task->cg_list);
|
|
|
|
spin_unlock_irq(&css_set_lock);
|
2019-01-28 16:00:13 +00:00
|
|
|
}
|
2011-02-07 16:02:20 +00:00
|
|
|
|
2019-01-28 16:00:13 +00:00
|
|
|
void cgroup_free(struct task_struct *task)
|
|
|
|
{
|
|
|
|
struct css_set *cset = task_css_set(task);
|
2015-10-15 20:41:53 +00:00
|
|
|
put_css_set(cset);
|
2007-10-19 06:39:33 +00:00
|
|
|
}
|
2007-10-19 06:39:34 +00:00
|
|
|
|
2008-04-04 21:29:57 +00:00
|
|
|
static int __init cgroup_disable(char *str)
|
|
|
|
{
|
2013-06-25 18:53:37 +00:00
|
|
|
struct cgroup_subsys *ss;
|
2008-04-04 21:29:57 +00:00
|
|
|
char *token;
|
2013-06-25 18:53:37 +00:00
|
|
|
int i;
|
2008-04-04 21:29:57 +00:00
|
|
|
|
|
|
|
while ((token = strsep(&str, ",")) != NULL) {
|
|
|
|
if (!*token)
|
|
|
|
continue;
|
2012-09-13 07:50:55 +00:00
|
|
|
|
2014-02-08 15:36:58 +00:00
|
|
|
for_each_subsys(ss, i) {
|
2015-08-18 20:58:16 +00:00
|
|
|
if (strcmp(token, ss->name) &&
|
|
|
|
strcmp(token, ss->legacy_name))
|
|
|
|
continue;
|
2021-05-12 20:19:46 +00:00
|
|
|
|
|
|
|
static_branch_disable(cgroup_subsys_enabled_key[i]);
|
|
|
|
pr_info("Disabling %s control group subsystem\n",
|
|
|
|
ss->name);
|
2008-04-04 21:29:57 +00:00
|
|
|
}
|
2021-05-24 19:53:39 +00:00
|
|
|
|
|
|
|
for (i = 0; i < OPT_FEATURE_COUNT; i++) {
|
|
|
|
if (strcmp(token, cgroup_opt_feature_names[i]))
|
|
|
|
continue;
|
|
|
|
cgroup_feature_disable_mask |= 1 << i;
|
|
|
|
pr_info("Disabling %s control group feature\n",
|
|
|
|
cgroup_opt_feature_names[i]);
|
|
|
|
break;
|
|
|
|
}
|
2008-04-04 21:29:57 +00:00
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("cgroup_disable=", cgroup_disable);
|
cgroup: CSS ID support
Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.
This patch attaches unique ID to each css and provides following.
- css_lookup(subsys, id)
returns pointer to struct cgroup_subysys_state of id.
- css_get_next(subsys, id, rootid, depth, foundid)
returns the next css under "root" by scanning
When cgroup_subsys->use_id is set, an id for css is maintained.
The cgroup framework only parepares
- css_id of root css for subsys
- id is automatically attached at creation of css.
- id is *not* freed automatically. Because the cgroup framework
don't know lifetime of cgroup_subsys_state.
free_css_id() function is provided. This must be called by subsys.
There are several reasons to develop this.
- Saving space .... For example, memcg's swap_cgroup is array of
pointers to cgroup. But it is not necessary to be very fast.
By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
reduce much amount of memory usage.
- Scanning without lock.
CSS_ID provides "scan id under this ROOT" function. By this, scanning
css under root can be written without locks.
ex)
do {
rcu_read_lock();
next = cgroup_get_next(subsys, id, root, &found);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
id = found + 1
} while(...)
Characteristics:
- Each css has unique ID under subsys.
- Lifetime of ID is controlled by subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.
Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
As /proc's pid scan does, scan-by-ID is robust when scanning is done
by following kind of routine.
scan -> rest a while(release a lock) -> conitunue from interrupted
memcg's hierarchical reclaim does this.
- When subsys->use_id is set, # of css in the system is limited to
65535.
[bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 23:57:25 +00:00
|
|
|
|
2018-11-08 15:08:46 +00:00
|
|
|
void __init __weak enable_debug_cgroup(void) { }
|
|
|
|
|
|
|
|
static int __init enable_cgroup_debug(char *str)
|
|
|
|
{
|
|
|
|
cgroup_debug = true;
|
|
|
|
enable_debug_cgroup();
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
__setup("cgroup_debug", enable_cgroup_debug);
|
|
|
|
|
2023-09-27 14:25:40 +00:00
|
|
|
static int __init cgroup_favordynmods_setup(char *str)
|
|
|
|
{
|
|
|
|
return (kstrtobool(str, &have_favordynmods) == 0);
|
|
|
|
}
|
|
|
|
__setup("cgroup_favordynmods=", cgroup_favordynmods_setup);
|
|
|
|
|
2013-08-13 15:01:54 +00:00
|
|
|
/**
|
2014-05-13 16:11:01 +00:00
|
|
|
* css_tryget_online_from_dir - get corresponding css from a cgroup dentry
|
2013-08-26 22:40:56 +00:00
|
|
|
* @dentry: directory dentry of interest
|
|
|
|
* @ss: subsystem of interest
|
2013-08-13 15:01:54 +00:00
|
|
|
*
|
2014-02-11 16:52:47 +00:00
|
|
|
* If @dentry is a directory for a cgroup which has @ss enabled on it, try
|
|
|
|
* to get the corresponding css and return it. If such css doesn't exist
|
|
|
|
* or can't be pinned, an ERR_PTR value is returned.
|
2011-02-14 09:20:01 +00:00
|
|
|
*/
|
2014-05-13 16:11:01 +00:00
|
|
|
struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
|
|
|
|
struct cgroup_subsys *ss)
|
2011-02-14 09:20:01 +00:00
|
|
|
{
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct kernfs_node *kn = kernfs_node_from_dentry(dentry);
|
2016-02-23 15:00:51 +00:00
|
|
|
struct file_system_type *s_type = dentry->d_sb->s_type;
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
struct cgroup_subsys_state *css = NULL;
|
2011-02-14 09:20:01 +00:00
|
|
|
struct cgroup *cgrp;
|
|
|
|
|
2013-08-26 22:40:56 +00:00
|
|
|
/* is @dentry a cgroup dir? */
|
2016-02-23 15:00:51 +00:00
|
|
|
if ((s_type != &cgroup_fs_type && s_type != &cgroup2_fs_type) ||
|
|
|
|
!kn || kernfs_type(kn) != KERNFS_DIR)
|
2011-02-14 09:20:01 +00:00
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
|
2014-02-11 16:52:47 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
/*
|
|
|
|
* This path doesn't originate from kernfs and @kn could already
|
|
|
|
* have been or be removed at any point. @kn->priv is RCU
|
2014-09-04 06:43:07 +00:00
|
|
|
* protected for this access. See css_release_work_fn() for details.
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
*/
|
2016-12-27 19:49:09 +00:00
|
|
|
cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv);
|
cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs. Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.
Locking and object lifetime management being entangled with vfs is
probably the most egregious. vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work. Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.
Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs. This patch converts cgroup to use kernfs,
which will bring the following benefits.
* Separation from vfs internals. Locking and object lifetime
management is contained in cgroup proper making things a lot
simpler. This removes significant amount of locking convolutions,
hairy object lifetime rules and the restriction on multi-cgroup
operations.
* Can drop a lot of code to implement filesystem interface as most are
provided by kernfs.
* Proper "severing" semantics, which allows controllers to not worry
about lingering file accesses after offline.
While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large. The rest of the commit
message lists notable changes in different areas.
Overall
-------
* vfs constructs replaced with kernfs ones. cgroup->dentry w/ ->kn,
cgroupfs_root->sb w/ ->kf_root.
* All dentry accessors are removed. Helpers to map from kernfs
constructs are added.
* All vfs plumbing around dentry, inode and bdi removed.
* cgroup_mount() now directly looks for matching root and then
proceeds to create a new one if not found.
Synchronization and object lifetime
-----------------------------------
* vfs inode locking removed. Among other things, this removes the
need for the convolution in cgroup_cfts_commit(). Future patches
will further simplify it.
* vfs refcnting replaced with cgroup internal ones. cgroup->refcnt,
cgroupfs_root->refcnt added. cgroup_put_root() now directly puts
root->refcnt and when it reaches zero proceeds to destroy it thus
merging cgroup_put_root() and the former cgroup_kill_sb().
Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
when refcnt reaches zero.
* Unlike before, kernfs objects don't hold onto cgroup objects. When
cgroup destroys a kernfs node, all existing operations are drained
and the association is broken immediately. The same for
cgroupfs_roots and mounts.
* All operations which come through kernfs guarantee that the
associated cgroup is and stays valid for the duration of operation;
however, there are two paths which need to find out the associated
cgroup from dentry without going through kernfs -
css_tryget_from_dir() and cgroupstats_build(). For these two,
kernfs_node->priv is RCU managed so that they can dereference it
under RCU read lock.
File and directory handling
---------------------------
* File and directory operations converted to kernfs_ops and
kernfs_syscall_ops.
* xattrs is implicitly supported by kernfs. No need to worry about it
from cgroup. This means that "xattr" mount option is no longer
necessary. A future patch will add a deprecated warning message
when sane_behavior.
* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
private copy of one of the kernfs_ops to set its atomic_write_len.
cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
to handle it.
* cftype->lockdep_key added so that kernfs lockdep annotation can be
per cftype.
* Inidividual file entries and open states are now managed by kernfs.
No need to worry about them from cgroup. cfent, cgroup_open_file
and their friends are removed.
* kernfs_nodes are created deactivated and kernfs_activate()
invocations added to places where creation of new nodes are
committed.
* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
self-removal.
v2: - Li pointed out in an earlier patch that specifying "name="
during mount without subsystem specification should succeed if
there's an existing hierarchy with a matching name although it
should fail with -EINVAL if a new hierarchy should be created.
Prior to the conversion, this used by handled by deferring
failure from NULL return from cgroup_root_from_opts(), which was
necessary because root was being created before checking for
existing ones. Note that cgroup_root_from_opts() returned an
ERR_PTR() value for error conditions which require immediate
mount failure.
As we now have separate search and creation steps, deferring
failure from cgroup_root_from_opts() is no longer necessary.
cgroup_root_from_opts() is updated to always return ERR_PTR()
value on failure.
- The logic to match existing roots is updated so that a mount
attempt with a matching name but different subsys_mask are
rejected. This was handled by a separate matching loop under
the comment "Check for name clashes with existing mounts" but
got lost during conversion. Merge the check into the main
search loop.
- Add __rcu __force casting in RCU_INIT_POINTER() in
cgroup_destroy_locked() to avoid the sparse address space
warning reported by kbuild test bot. Maybe we want an explicit
interface to use kn->priv as RCU protected pointer?
v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to
cgroup_idr with cgroup_mutex").
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 16:52:49 +00:00
|
|
|
if (cgrp)
|
|
|
|
css = cgroup_css(cgrp, ss);
|
2014-02-11 16:52:47 +00:00
|
|
|
|
2014-05-13 16:11:01 +00:00
|
|
|
if (!css || !css_tryget_online(css))
|
2014-02-11 16:52:47 +00:00
|
|
|
css = ERR_PTR(-ENOENT);
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
return css;
|
2011-02-14 09:20:01 +00:00
|
|
|
}
|
|
|
|
|
2013-08-19 02:05:24 +00:00
|
|
|
/**
|
|
|
|
* css_from_id - lookup css by id
|
|
|
|
* @id: the cgroup id
|
|
|
|
* @ss: cgroup subsys to be looked into
|
|
|
|
*
|
|
|
|
* Returns the css if there's valid one with @id, otherwise returns NULL.
|
|
|
|
* Should be called under rcu_read_lock().
|
|
|
|
*/
|
|
|
|
struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
|
|
|
|
{
|
2014-05-04 19:09:13 +00:00
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
2016-06-17 16:24:27 +00:00
|
|
|
return idr_find(&ss->css_idr, id);
|
2011-02-14 09:20:01 +00:00
|
|
|
}
|
|
|
|
|
2015-11-20 20:55:52 +00:00
|
|
|
/**
|
|
|
|
* cgroup_get_from_path - lookup and get a cgroup from its default hierarchy path
|
|
|
|
* @path: path on the default hierarchy
|
|
|
|
*
|
|
|
|
* Find the cgroup at @path on the default hierarchy, increment its
|
|
|
|
* reference count and return it. Returns pointer to the found cgroup on
|
2021-10-25 06:19:14 +00:00
|
|
|
* success, ERR_PTR(-ENOENT) if @path doesn't exist or if the cgroup has already
|
|
|
|
* been released and ERR_PTR(-ENOTDIR) if @path points to a non-directory.
|
2015-11-20 20:55:52 +00:00
|
|
|
*/
|
|
|
|
struct cgroup *cgroup_get_from_path(const char *path)
|
|
|
|
{
|
|
|
|
struct kernfs_node *kn;
|
2021-10-25 06:19:14 +00:00
|
|
|
struct cgroup *cgrp = ERR_PTR(-ENOENT);
|
2022-08-26 16:52:35 +00:00
|
|
|
struct cgroup *root_cgrp;
|
2015-11-20 20:55:52 +00:00
|
|
|
|
2022-10-10 08:29:18 +00:00
|
|
|
root_cgrp = current_cgns_cgroup_dfl();
|
2022-08-26 16:52:35 +00:00
|
|
|
kn = kernfs_walk_and_get(root_cgrp->kn, path);
|
2021-10-25 06:19:14 +00:00
|
|
|
if (!kn)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (kernfs_type(kn) != KERNFS_DIR) {
|
|
|
|
cgrp = ERR_PTR(-ENOTDIR);
|
|
|
|
goto out_kernfs;
|
2015-11-20 20:55:52 +00:00
|
|
|
}
|
|
|
|
|
2021-10-25 06:19:14 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
|
|
|
|
cgrp = rcu_dereference(*(void __rcu __force **)&kn->priv);
|
|
|
|
if (!cgrp || !cgroup_tryget(cgrp))
|
|
|
|
cgrp = ERR_PTR(-ENOENT);
|
|
|
|
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
out_kernfs:
|
|
|
|
kernfs_put(kn);
|
|
|
|
out:
|
2015-11-20 20:55:52 +00:00
|
|
|
return cgrp;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(cgroup_get_from_path);
|
|
|
|
|
2016-06-30 17:28:42 +00:00
|
|
|
/**
|
2022-10-11 22:51:55 +00:00
|
|
|
* cgroup_v1v2_get_from_fd - get a cgroup pointer from a fd
|
2022-10-11 00:33:58 +00:00
|
|
|
* @fd: fd obtained by open(cgroup_dir)
|
2016-06-30 17:28:42 +00:00
|
|
|
*
|
|
|
|
* Find the cgroup from a fd which should be obtained
|
|
|
|
* by opening a cgroup directory. Returns a pointer to the
|
|
|
|
* cgroup on success. ERR_PTR is returned if the cgroup
|
|
|
|
* cannot be found.
|
|
|
|
*/
|
2022-10-11 00:33:58 +00:00
|
|
|
struct cgroup *cgroup_v1v2_get_from_fd(int fd)
|
2016-06-30 17:28:42 +00:00
|
|
|
{
|
|
|
|
struct cgroup *cgrp;
|
2022-08-21 16:01:26 +00:00
|
|
|
struct fd f = fdget_raw(fd);
|
|
|
|
if (!f.file)
|
2016-06-30 17:28:42 +00:00
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
|
2022-08-21 16:01:26 +00:00
|
|
|
cgrp = cgroup_v1v2_get_from_file(f.file);
|
|
|
|
fdput(f);
|
2016-06-30 17:28:42 +00:00
|
|
|
return cgrp;
|
|
|
|
}
|
2022-10-11 00:33:58 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_get_from_fd - same as cgroup_v1v2_get_from_fd, but only supports
|
|
|
|
* cgroup2.
|
2022-10-11 22:51:55 +00:00
|
|
|
* @fd: fd obtained by open(cgroup2_dir)
|
2022-10-11 00:33:58 +00:00
|
|
|
*/
|
|
|
|
struct cgroup *cgroup_get_from_fd(int fd)
|
|
|
|
{
|
|
|
|
struct cgroup *cgrp = cgroup_v1v2_get_from_fd(fd);
|
|
|
|
|
|
|
|
if (IS_ERR(cgrp))
|
|
|
|
return ERR_CAST(cgrp);
|
|
|
|
|
|
|
|
if (!cgroup_on_dfl(cgrp)) {
|
|
|
|
cgroup_put(cgrp);
|
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
}
|
|
|
|
return cgrp;
|
|
|
|
}
|
2016-06-30 17:28:42 +00:00
|
|
|
EXPORT_SYMBOL_GPL(cgroup_get_from_fd);
|
|
|
|
|
2019-06-14 17:12:45 +00:00
|
|
|
static u64 power_of_ten(int power)
|
|
|
|
{
|
|
|
|
u64 v = 1;
|
|
|
|
while (power--)
|
|
|
|
v *= 10;
|
|
|
|
return v;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* cgroup_parse_float - parse a floating number
|
|
|
|
* @input: input string
|
|
|
|
* @dec_shift: number of decimal digits to shift
|
|
|
|
* @v: output
|
|
|
|
*
|
|
|
|
* Parse a decimal floating point number in @input and store the result in
|
|
|
|
* @v with decimal point right shifted @dec_shift times. For example, if
|
|
|
|
* @input is "12.3456" and @dec_shift is 3, *@v will be set to 12345.
|
|
|
|
* Returns 0 on success, -errno otherwise.
|
|
|
|
*
|
|
|
|
* There's nothing cgroup specific about this function except that it's
|
|
|
|
* currently the only user.
|
|
|
|
*/
|
|
|
|
int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v)
|
|
|
|
{
|
|
|
|
s64 whole, frac = 0;
|
|
|
|
int fstart = 0, fend = 0, flen;
|
|
|
|
|
|
|
|
if (!sscanf(input, "%lld.%n%lld%n", &whole, &fstart, &frac, &fend))
|
|
|
|
return -EINVAL;
|
|
|
|
if (frac < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
flen = fend > fstart ? fend - fstart : 0;
|
|
|
|
if (flen < dec_shift)
|
|
|
|
frac *= power_of_ten(dec_shift - flen);
|
|
|
|
else
|
|
|
|
frac = DIV_ROUND_CLOSEST_ULL(frac, power_of_ten(flen - dec_shift));
|
|
|
|
|
|
|
|
*v = whole * power_of_ten(dec_shift) + frac;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
/*
|
|
|
|
* sock->sk_cgrp_data handling. For more info, see sock_cgroup_data
|
|
|
|
* definition in cgroup-defs.h.
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_SOCK_CGROUP_DATA
|
|
|
|
|
|
|
|
void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
|
|
|
|
{
|
bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interrupt
If cgroup_sk_alloc() is called from interrupt context, then just assign the
root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
than re-adding the NULL-test to the fast-path we can just assign it once from
cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
directly does /not/ change behavior for callers of sock_cgroup_ptr().
syzkaller was able to trigger a splat in the legacy netrom code base, where
the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
a non-NULL object. There are a few other candidates aside from netrom which
have similar pattern where in their accept-like implementation, they just call
to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
socket. None of them are related to core protocols where BPF cgroup programs
are running from. However, in future, they should follow to implement a similar
inheritance mechanism.
Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
configuration, the same issue was exposed also prior to 8520e224f547 due to
commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
cgroup") which added the early in_interrupt() return back then.
Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode")
Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup")
Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net
2021-09-27 12:39:20 +00:00
|
|
|
struct cgroup *cgroup;
|
2020-03-10 05:16:05 +00:00
|
|
|
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
rcu_read_lock();
|
bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interrupt
If cgroup_sk_alloc() is called from interrupt context, then just assign the
root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
than re-adding the NULL-test to the fast-path we can just assign it once from
cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
directly does /not/ change behavior for callers of sock_cgroup_ptr().
syzkaller was able to trigger a splat in the legacy netrom code base, where
the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
a non-NULL object. There are a few other candidates aside from netrom which
have similar pattern where in their accept-like implementation, they just call
to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
socket. None of them are related to core protocols where BPF cgroup programs
are running from. However, in future, they should follow to implement a similar
inheritance mechanism.
Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
configuration, the same issue was exposed also prior to 8520e224f547 due to
commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
cgroup") which added the early in_interrupt() return back then.
Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode")
Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup")
Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net
2021-09-27 12:39:20 +00:00
|
|
|
/* Don't associate the sock with unrelated interrupted task's cgroup. */
|
|
|
|
if (in_interrupt()) {
|
|
|
|
cgroup = &cgrp_dfl_root.cgrp;
|
|
|
|
cgroup_get(cgroup);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
while (true) {
|
|
|
|
struct css_set *cset;
|
|
|
|
|
|
|
|
cset = task_css_set(current);
|
|
|
|
if (likely(cgroup_tryget(cset->dfl_cgrp))) {
|
bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interrupt
If cgroup_sk_alloc() is called from interrupt context, then just assign the
root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
than re-adding the NULL-test to the fast-path we can just assign it once from
cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
directly does /not/ change behavior for callers of sock_cgroup_ptr().
syzkaller was able to trigger a splat in the legacy netrom code base, where
the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
a non-NULL object. There are a few other candidates aside from netrom which
have similar pattern where in their accept-like implementation, they just call
to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
socket. None of them are related to core protocols where BPF cgroup programs
are running from. However, in future, they should follow to implement a similar
inheritance mechanism.
Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
configuration, the same issue was exposed also prior to 8520e224f547 due to
commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
cgroup") which added the early in_interrupt() return back then.
Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode")
Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup")
Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net
2021-09-27 12:39:20 +00:00
|
|
|
cgroup = cset->dfl_cgrp;
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
cpu_relax();
|
|
|
|
}
|
bpf, cgroup: Assign cgroup in cgroup_sk_alloc when called from interrupt
If cgroup_sk_alloc() is called from interrupt context, then just assign the
root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
than re-adding the NULL-test to the fast-path we can just assign it once from
cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
directly does /not/ change behavior for callers of sock_cgroup_ptr().
syzkaller was able to trigger a splat in the legacy netrom code base, where
the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
a non-NULL object. There are a few other candidates aside from netrom which
have similar pattern where in their accept-like implementation, they just call
to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
socket. None of them are related to core protocols where BPF cgroup programs
are running from. However, in future, they should follow to implement a similar
inheritance mechanism.
Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
configuration, the same issue was exposed also prior to 8520e224f547 due to
commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
cgroup") which added the early in_interrupt() return back then.
Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode")
Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup")
Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com
Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net
2021-09-27 12:39:20 +00:00
|
|
|
out:
|
|
|
|
skcd->cgroup = cgroup;
|
|
|
|
cgroup_bpf_get(cgroup);
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
|
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.
sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.
The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.
This bug seems to be introduced since the beginning, commit
d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02 18:52:56 +00:00
|
|
|
void cgroup_sk_clone(struct sock_cgroup_data *skcd)
|
|
|
|
{
|
bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.
Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.
In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.
Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.
[0] https://github.com/nestybox/sysbox/
[1] https://kind.sigs.k8s.io/
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
2021-09-13 23:07:57 +00:00
|
|
|
struct cgroup *cgrp = sock_cgroup_ptr(skcd);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We might be cloning a socket which is left in an empty
|
|
|
|
* cgroup and the cgroup might have already been rmdir'd.
|
|
|
|
* Don't use cgroup_get_live().
|
|
|
|
*/
|
|
|
|
cgroup_get(cgrp);
|
|
|
|
cgroup_bpf_get(cgrp);
|
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.
sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.
The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.
This bug seems to be introduced since the beginning, commit
d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-02 18:52:56 +00:00
|
|
|
}
|
|
|
|
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
void cgroup_sk_free(struct sock_cgroup_data *skcd)
|
|
|
|
{
|
bpf: decouple the lifetime of cgroup_bpf from cgroup itself
Currently the lifetime of bpf programs attached to a cgroup is bound
to the lifetime of the cgroup itself. It means that if a user
forgets (or intentionally avoids) to detach a bpf program before
removing the cgroup, it will stay attached up to the release of the
cgroup. Since the cgroup can stay in the dying state (the state
between being rmdir()'ed and being released) for a very long time, it
leads to a waste of memory. Also, it blocks a possibility to implement
the memcg-based memory accounting for bpf objects, because a circular
reference dependency will occur. Charged memory pages are pinning the
corresponding memory cgroup, and if the memory cgroup is pinning
the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for
an attached bpf program to be executed is a live socket associated
with the cgroup. So in order to release all bpf data early, let's
count associated sockets using a new percpu refcounter. On cgroup
removal the counter is transitioned to the atomic mode, and as soon
as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from
the percpu ref counter callback directly, so instead an asynchronous
work is scheduled.
The reference counter is not socket specific, and can be used for any
other types of programs, which can be executed from a cgroup-bpf hook
outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: jolsa@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-25 16:37:39 +00:00
|
|
|
struct cgroup *cgrp = sock_cgroup_ptr(skcd);
|
|
|
|
|
|
|
|
cgroup_bpf_put(cgrp);
|
|
|
|
cgroup_put(cgrp);
|
sock, cgroup: add sock->sk_cgroup
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-07 22:38:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
#endif /* CONFIG_SOCK_CGROUP_DATA */
|
|
|
|
|
2017-11-06 18:30:28 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
|
|
|
static ssize_t show_delegatable_files(struct cftype *files, char *buf,
|
|
|
|
ssize_t size, const char *prefix)
|
|
|
|
{
|
|
|
|
struct cftype *cft;
|
|
|
|
ssize_t ret = 0;
|
|
|
|
|
|
|
|
for (cft = files; cft && cft->name[0] != '\0'; cft++) {
|
|
|
|
if (!(cft->flags & CFTYPE_NS_DELEGATABLE))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (prefix)
|
|
|
|
ret += snprintf(buf + ret, size - ret, "%s.", prefix);
|
|
|
|
|
|
|
|
ret += snprintf(buf + ret, size - ret, "%s\n", cft->name);
|
|
|
|
|
2018-11-04 02:27:41 +00:00
|
|
|
if (WARN_ON(ret >= size))
|
2017-11-06 18:30:28 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t delegate_show(struct kobject *kobj, struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
struct cgroup_subsys *ss;
|
|
|
|
int ssid;
|
|
|
|
ssize_t ret = 0;
|
|
|
|
|
2022-09-06 19:38:55 +00:00
|
|
|
ret = show_delegatable_files(cgroup_base_files, buf + ret,
|
|
|
|
PAGE_SIZE - ret, NULL);
|
|
|
|
if (cgroup_psi_enabled())
|
|
|
|
ret += show_delegatable_files(cgroup_psi_files, buf + ret,
|
|
|
|
PAGE_SIZE - ret, NULL);
|
2017-11-06 18:30:28 +00:00
|
|
|
|
|
|
|
for_each_subsys(ss, ssid)
|
|
|
|
ret += show_delegatable_files(ss->dfl_cftypes, buf + ret,
|
|
|
|
PAGE_SIZE - ret,
|
|
|
|
cgroup_subsys_name[ssid]);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate);
|
|
|
|
|
2017-11-06 18:30:29 +00:00
|
|
|
static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
return snprintf(buf, PAGE_SIZE,
|
|
|
|
"nsdelegate\n"
|
2022-07-23 14:28:28 +00:00
|
|
|
"favordynmods\n"
|
mm: memcontrol: recursive memory.low protection
Right now, the effective protection of any given cgroup is capped by its
own explicit memory.low setting, regardless of what the parent says. The
reasons for this are mostly historical and ease of implementation: to make
delegation of memory.low safe, effective protection is the min() of all
memory.low up the tree.
Unfortunately, this limitation makes it impossible to protect an entire
subtree from another without forcing the user to make explicit protection
allocations all the way to the leaf cgroups - something that is highly
undesirable in real life scenarios.
Consider memory in a data center host. At the cgroup top level, we have a
distinction between system management software and the actual workload the
system is executing. Both branches are further subdivided into individual
services, job components etc.
We want to protect the workload as a whole from the system management
software, but that doesn't mean we want to protect and prioritize
individual workload wrt each other. Their memory demand can vary over
time, and we'd want the VM to simply cache the hottest data within the
workload subtree. Yet, the current memory.low limitations force us to
allocate a fixed amount of protection to each workload component in order
to get protection from system management software in general. This
results in very inefficient resource distribution.
Another concern with mandating downward allocation is that, as the
complexity of the cgroup tree grows, it gets harder for the lower levels
to be informed about decisions made at the host-level. Consider a
container inside a namespace that in turn creates its own nested tree of
cgroups to run multiple workloads. It'd be extremely difficult to
configure memory.low parameters in those leaf cgroups that on one hand
balance pressure among siblings as the container desires, while also
reflecting the host-level protection from e.g. rpm upgrades, that lie
beyond one or more delegation and namespacing points in the tree.
It's highly unusual from a cgroup interface POV that nested levels have to
be aware of and reflect decisions made at higher levels for them to be
effective.
To enable such use cases and scale configurability for complex trees, this
patch implements a resource inheritance model for memory that is similar
to how the CPU and the IO controller implement work-conserving resource
allocations: a share of a resource allocated to a subree always applies to
the entire subtree recursively, while allowing, but not mandating,
children to further specify distribution rules.
That means that if protection is explicitly allocated among siblings,
those configured shares are being followed during page reclaim just like
they are now. However, if the memory.low set at a higher level is not
fully claimed by the children in that subtree, the "floating" remainder is
applied to each cgroup in the tree in proportion to its size. Since
reclaim pressure is applied in proportion to size as well, each child in
that tree gets the same boost, and the effect is neutral among siblings -
with respect to each other, they behave as if no memory control was
enabled at all, and the VM simply balances the memory demands optimally
within the subtree. But collectively those cgroups enjoy a boost over the
cgroups in neighboring trees.
E.g. a leaf cgroup with a memory.low setting of 0 no longer means that
it's not getting a share of the hierarchically assigned resource, just
that it doesn't claim a fixed amount of it to protect from its siblings.
This allows us to recursively protect one subtree (workload) from another
(system management), while letting subgroups compete freely among each
other - without having to assign fixed shares to each leaf, and without
nested groups having to echo higher-level settings.
The floating protection composes naturally with fixed protection.
Consider the following example tree:
A A: low = 2G
/ \ A1: low = 1G
A1 A2 A2: low = 0G
As outside pressure is applied to this tree, A1 will enjoy a fixed
protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
evenly among A1 and A2, coming out to 1.5G and 0.5G.
There is a slight risk of regressing theoretical setups where the
top-level cgroups don't know about the true budgeting and set bogusly high
"bypass" values that are meaningfully allocated down the tree. Such
setups would rely on unclaimed protection to be discarded, and
distributing it would change the intended behavior. Be safe and hide the
new behavior behind a mount option, 'memory_recursiveprot'.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-02 04:07:07 +00:00
|
|
|
"memory_localevents\n"
|
hugetlb: memcg: account hugetlb-backed memory in memory controller
Currently, hugetlb memory usage is not acounted for in the memory
controller, which could lead to memory overprotection for cgroups with
hugetlb-backed memory. This has been observed in our production system.
For instance, here is one of our usecases: suppose there are two 32G
containers. The machine is booted with hugetlb_cma=6G, and each container
may or may not use up to 3 gigantic page, depending on the workload within
it. The rest is anon, cache, slab, etc. We can set the hugetlb cgroup
limit of each cgroup to 3G to enforce hugetlb fairness. But it is very
difficult to configure memory.max to keep overall consumption, including
anon, cache, slab etc. fair.
What we have had to resort to is to constantly poll hugetlb usage and
readjust memory.max. Similar procedure is done to other memory limits
(memory.low for e.g). However, this is rather cumbersome and buggy.
Furthermore, when there is a delay in memory limits correction, (for e.g
when hugetlb usage changes within consecutive runs of the userspace
agent), the system could be in an over/underprotected state.
This patch rectifies this issue by charging the memcg when the hugetlb
folio is utilized, and uncharging when the folio is freed (analogous to
the hugetlb controller). Note that we do not charge when the folio is
allocated to the hugetlb pool, because at this point it is not owned by
any memcg.
Some caveats to consider:
* This feature is only available on cgroup v2.
* There is no hugetlb pool management involved in the memory
controller. As stated above, hugetlb folios are only charged towards
the memory controller when it is used. Host overcommit management
has to consider it when configuring hard limits.
* Failure to charge towards the memcg results in SIGBUS. This could
happen even if the hugetlb pool still has pages (but the cgroup
limit is hit and reclaim attempt fails).
* When this feature is enabled, hugetlb pages contribute to memory
reclaim protection. low, min limits tuning must take into account
hugetlb memory.
* Hugetlb pages utilized while this option is not selected will not
be tracked by the memory controller (even if cgroup v2 is remounted
later on).
Link: https://lkml.kernel.org/r/20231006184629.155543-4-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tejun heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-06 18:46:28 +00:00
|
|
|
"memory_recursiveprot\n"
|
|
|
|
"memory_hugetlb_accounting\n");
|
2017-11-06 18:30:29 +00:00
|
|
|
}
|
|
|
|
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);
|
|
|
|
|
2017-11-06 18:30:28 +00:00
|
|
|
static struct attribute *cgroup_sysfs_attrs[] = {
|
|
|
|
&cgroup_delegate_attr.attr,
|
2017-11-06 18:30:29 +00:00
|
|
|
&cgroup_features_attr.attr,
|
2017-11-06 18:30:28 +00:00
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const struct attribute_group cgroup_sysfs_attr_group = {
|
|
|
|
.attrs = cgroup_sysfs_attrs,
|
|
|
|
.name = "cgroup",
|
|
|
|
};
|
|
|
|
|
|
|
|
static int __init cgroup_sysfs_init(void)
|
|
|
|
{
|
|
|
|
return sysfs_create_group(kernel_kobj, &cgroup_sysfs_attr_group);
|
|
|
|
}
|
|
|
|
subsys_initcall(cgroup_sysfs_init);
|
2019-05-13 19:37:17 +00:00
|
|
|
|
2017-11-06 18:30:28 +00:00
|
|
|
#endif /* CONFIG_SYSFS */
|