Factor out css enabling and showing into cgroup_apply_control_enable().
* Nest subsystem walk inside child walk. The child walk will later be
converted to subtree walk which is a bit more expensive.
* Instead of operating on the differential masks @css_enable, simply
enable or show csses according to the current cgroup_control() and
cgroup_ss_mask(). This leads to the same result and is simpler and
more robust.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Factor out css disabling and hiding into cgroup_apply_control_disable().
* Nest subsystem walk inside child walk. The child walk will later be
converted to subtree walk which is a bit more expensive.
* Instead of operating on the differential masks @css_enable and
@css_disable, simply disable or hide csses according to the current
cgroup_control() and cgroup_ss_mask(). This leads to the same
result and is simpler and more robust.
* This allows error handling path to share the same code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Factor out async css offline draining into cgroup_drain_offline().
* Nest subsystem walk inside child walk. The child walk will later be
converted to subtree walk which is a bit more expensive.
* Relocate the draining above subsystem mask preparation, which
doesn't create any behavior differences but helps further
refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
When a controller is enabled and visible on a non-root cgroup is
determined by subtree_control and subtree_ss_mask of the parent
cgroup. For a root cgroup, by the type of the hierarchy and which
controllers are attached to it. Deciding the above on each usage is
fragile and unnecessarily complicates the users.
This patch introduces cgroup_control() and cgroup_ss_mask() which
calculate and return the [visibly] enabled subsyste mask for the
specified cgroup and conver the existing usages.
* cgroup_e_css() is restructured for simplicity.
* cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
longer need to distinguish root and non-root cases.
* With cgroup_control(), cgroup_controllers_show() can now handle both
root and non-root cases. cgroup_root_controllers_show() is removed.
v2: cgroup_control() updated to yield the correct result on v1
hierarchies too. cgroup_subtree_control_write() converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
We're in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs. This patch factors out cgroup_create() out of
cgroup_mkdir(). cgroup_create() contains all internal object creation
and initialization. cgroup_mkdir() uses cgroup_create() to create the
internal cgroup and adds interface directory and file creation.
This patch doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Currently, operations to initialize internal objects and create
interface directory and files are intermixed in cgroup_mkdir(). We're
in the process of refactoring cgroup and css management paths to
separate them out to eventually allow cgroups which aren't visible
through cgroup fs.
This patch reorders operations inside cgroup_mkdir() so that interface
directory and file handling comes after internal object
initialization. This will enable further refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Currently, whether a css (cgroup_subsys_state) has its interface files
created is not tracked and assumed to change together with the owning
cgroup's lifecycle. cgroup directory and interface creation is being
separated out from internal object creation to help refactoring and
eventually allow cgroups which are not visible through cgroupfs.
This patch adds CSS_VISIBLE to track whether a css has its interface
files created and perform management operations only when necessary
which helps decoupling interface file handling from internal object
lifecycle. After this patch, all css interface file management
functions can be called regardless of the current state and will
achieve the expected result.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
Currently, interface files are created when a css is created depending
on whether @visible is set. This patch separates out the two into
separate steps to help code refactoring and eventually allow cgroups
which aren't visible through cgroup fs.
Move css_populate_dir() out of create_css() and drop @visible. While
at it, rename the function to css_create() for consistency.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
During task migration, tasks may transfer between two css_sets which
are associated with the same cgroup. If those tasks are the only
tasks in the cgroup, this currently triggers a spurious de-populated
event on the cgroup.
Fix it by bumping up populated count before bumping it down during
migration to ensure that it doesn't reach zero spuriously.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
css_sets are hashed by their subsys[] contents and in cgroup_init()
init_css_set is hashed early, before subsystem inits, when all entries
in its subsys[] are NULL, so that cgroup_dfl_root initialization can
find and link to it. As subsystems are initialized,
init_css_set.subsys[] is filled up but the hashing is never updated
making init_css_set hashed in the wrong place. While incorrect, this
doesn't cause a critical failure as css_set management code would
create an identical css_set dynamically.
Fix it by rehashing init_css_set after subsystems are initialized.
While at it, drop unnecessary @key local variable.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Zefan Li <lizefan@huawei.com>
An associated css can be around for quite a while after a cgroup
directory has been removed. In general, it makes sense to reset it to
defaults so as not to worry about any remnants. For instance, memory
cgroup needs to reset memory.low, otherwise pages charged to a dead
cgroup might never get reclaimed. There's ->css_reset callback, which
would fit perfectly for the purpose. Currently, it's only called when a
subsystem is disabled in the unified hierarchy and there are other
subsystems dependant on it. Let's call it on css destruction as well.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is a mistake about the print format name:id <--> %d:%s, which
the name is 'char *' type and id is 'int' type. Change "name:id" to
"id:name" instead to be consistent with "cgroup_subsys %d:%s".
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No internal process rule is enforced by cgroup_migrate_prepare_dst()
during process migration. It tests whether the target cgroup's
->child_subsys_mask is zero which is different from "subtree_control"
write path which tests ->subtree_control. This hasn't mattered
because up until now, both ->child_subsys_mask and ->subtree_control
are zero or non-zero at the same time. However, with the planned
addition of implicit controllers, this will no longer be true.
This patch prepares for the change by making
cgorup_migrate_prepare_dst() test ->subtree_control instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
The function currently returns -EBADF for a directory on the default
hierarchy. Make it also recognize cgroup2_fs_type. This will be used
for perf_event cgroup2 support.
Signed-off-by: Tejun Heo <tj@kernel.org>
After the recent do_each_subsys_mask() conversion, there's no reason
to use ulong for subsystem masks. We'll be adding more subsystem
masks to persistent data structures, let's reduce its size to u16
which should be enough for now and the foreseeable future.
This doesn't create any noticeable behavior differences.
v2: Johannes spotted that the initial patch missed cgroup_no_v1_mask.
Converted.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
There are several places in cgroup_subtree_control_write() which can
use do_each_subsys_mask() instead of manual mask testing. Use it.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
for_each_subsys_which() allows iterating subsystems specified in a
subsystem bitmask; unfortunately, it requires the mask to be an
unsigned long l-value which can be inconvenient and makes it awkward
to use a smaller type for subsystem masks.
This patch converts for_each_subsy_which() to do-while style which
allows it to drop the l-value requirement. The new iterator is named
do_each_subsys_mask() / while_each_subsys_mask().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
This reverts commit 56c807ba4e.
cgroup_subsys->css_e_css_changed() was supposed to be used by cgroup
writeback support; however, the change to per-inode cgroup association
made it unnecessary and the callback doesn't have any user. Remove
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
cgroup_addrm_files() incorrectly returned 0 after add failure. Fix
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Testing cgroup2 can be painful with system software automatically
mounting and populating all cgroup controllers in v1 mode. Sometimes
they can be unmounted from rc.local, sometimes even that is too late.
Provide a commandline option to disable certain controllers in v1
mounts, so that they remain available for cgroup2 mounts.
Example use:
cgroup_no_v1=memory,cpu
cgroup_no_v1=all
Disabling will be confirmed at boot-time as such:
[ 0.013770] Disabling cpu control group subsystem in v1 mounts
[ 0.016004] Disabling memory control group subsystem in v1 mounts
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free(). Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children. This behavior is unexpected and led to bugs in cpu and
memory controller.
The previous patch updated ordering for css_offline() which fixes the
cpu controller issue. While there currently isn't a known bug caused
by misordering of css_free() invocations, let's fix it too for
consistency.
css_free() ordering can be trivially fixed by moving putting of the
parent css below css_free() invocation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free(). Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children. This behavior is unexpected and led to bugs in cpu and
memory controller.
This patch updates offline path so that a parent css is never offlined
before its children. Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.
This fixes the memory controller bug and allows the fix for cpu
controller.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
If "cpuset.memory_migrate" is set, when a process is moved from one
cpuset to another with a different memory node mask, pages in used by
the process are migrated to the new set of nodes. This was performed
synchronously in the ->attach() callback, which is synchronized
against process management. Recently, the synchronization was changed
from per-process rwsem to global percpu rwsem for simplicity and
optimization.
Combined with the synchronous mm migration, this led to deadlocks
because mm migration could schedule a work item which may in turn try
to create a new worker blocking on the process management lock held
from cgroup process migration path.
This heavy an operation shouldn't be performed synchronously from that
deep inside cgroup migration in the first place. This patch punts the
actual migration to an ordered workqueue and updates cgroup process
migration and cpuset config update paths to flush the workqueue after
all locks are released. This way, the operations still seem
synchronous to userland without entangling mm migration with process
management synchronization. CPU hotplug can also invoke mm migration
but there's no reason for it to wait for mm migrations and thus
doesn't synchronize against their completions.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: stable@vger.kernel.org # v4.4+
Pull cgroup updates from Tejun Heo:
- cgroup v2 interface is now official. It's no longer hidden behind a
devel flag and can be mounted using the new cgroup2 fs type.
Unfortunately, cpu v2 interface hasn't made it yet due to the
discussion around in-process hierarchical resource distribution and
only memory and io controllers can be used on the v2 interface at the
moment.
- The existing documentation which has always been a bit of mess is
relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
is added as the authoritative documentation for the v2 interface.
- Some features are added through for-4.5-ancestor-test branch to
enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
netfilter changes will be merged through the net tree which pulled in
the said branch.
- Various cleanups
* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rename cgroup documentations
cgroup: fix a typo.
cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
cgroup: demote subsystem init messages to KERN_DEBUG
cgroup: Fix uninitialized variable warning
cgroup: put controller Kconfig options in meaningful order
cgroup: clean up the kernel configuration menu nomenclature
cgroup_pids: fix a typo.
Subject: cgroup: Fix incomplete dd command in blkio documentation
cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
cpuset: Replace all instances of time_t with time64_t
cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <davem@davemloft.net>
bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup") added global
spinlock cgroup_sk_update_lock but erroneously skipped initializer
leading to uninitialized spinlock warning. Fix it by using
DEFINE_SPINLOCK().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: David S. Miller <davem@davemloft.net>
In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound. As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.
net_cls and net_prio controllers are examples of the latter. They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.
Both net_cls and net_prio aren't properly hierarchical. Both inherit
configuration from the parent on creation but there's no interaction
afterwards. An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level. net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.
While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.
In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used. Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid. This is to avoid adding yet another cgroup related field
to struct sock.
As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead. It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs. Non-critical inaccuracies from small race windows won't
make any noticeable difference.
This patch doesn't make use of the pointer yet. The following patch
will implement netfilter match for cgroup2 membership.
v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
cgroup specific field.
v3: Add comments explaining why sock_data_prioidx() and
sock_data_classid() use different fallback values.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
A css_set represents the relationship between a set of tasks and
css's. css_set never pinned the associated css's. This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.
Unfortunately, afcf6c8b75 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release. Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.
general protection fault: 0000 [#1] PREEMPT SMP
task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
...
Call Trace:
<IRQ>
[<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
[<ffffffff810f8893>] cgroup_free+0x53/0xe0
[<ffffffff8104ed62>] __put_task_struct+0x42/0x130
[<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
[<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
[<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
[<ffffffff81056e54>] __do_softirq+0xd4/0x460
[<ffffffff81057369>] irq_exit+0x89/0xa0
[<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
[<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
<EOI>
...
Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
RSP <ffff88001fc03e20>
---[ end trace 89a4a4b916b90c49 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
Fix it by making css_set pin the associate css's until its release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
Fixes: afcf6c8b75 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
Implement cgroup_get_from_path() using kernfs_walk_and_get() which
obtains a default hierarchy cgroup from its path. This will be used
to allow cgroup path based matching from outside cgroup proper -
e.g. networking and perf.
v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
Signed-off-by: Tejun Heo <tj@kernel.org>
cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question. While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.
This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level. cgroup_is_descendant() is accordingly reimplmented
and made inline.
Signed-off-by: Tejun Heo <tj@kernel.org>
With major controllers - cpu, memory and io - shaping up for the
unified hierarchy, cgroup2 is about ready to be, gradually, released
into the wild. Replace __DEVEL__sane_behavior flag which was used to
select the unified hierarchy with a separate filesystem type "cgroup2"
so that unified hierarchy can be mounted as follows.
mount -t cgroup2 none $MOUNT_POINT
The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").
v2: Assign a different magic number to cgroup2 fs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
6f60eade24 ("cgroup: generalize obtaining the handles of and
notifying cgroup files") introduced cftype->file_offset so that the
handles for per-css file instances can be recorded. These handles
then can be used, for example, to generate file modified
notifications.
Unfortunately, it made the wrong assumption that files are created
once for a given css and removed on its destruction. Due to the
dependencies among subsystems, a css may be hidden from userland and
then later shown again. This is implemented by removing and
re-creating the affected files, so the associated kernfs_node for a
given cgroup file may change over time. This incorrect assumption led
to the corruption of css->files lists.
Reimplement cftype->file_offset handling so that cgroup_file->kn is
protected by a lock and updated as files are created and destroyed.
This also makes keeping them on per-cgroup list unnecessary.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: James Sedgwick <jsedgwick@fb.com>
Fixes: 6f60eade24 ("cgroup: generalize obtaining the handles of and notifying cgroup files")
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zefan Li <lizefan@huawei.com>
__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".
Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.
This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.
This patch then converts a number of sites
o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.
o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.
o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.
o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.
The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.
The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
css_task_iter_next() checked @it->cur_task before grabbing
css_set_lock and assumed that the result won't change afterwards;
however, tasks could leave the cgroup being iterated terminating the
iterator before css_task_lock is acquired. If this happens,
css_task_iter_next() tries to calculate the current task from NULL
cg_list pointer leading to the following oops.
BUG: unable to handle kernel paging request at fffffffffffff7d0
IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
...
CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1
Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014
task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000
RIP: 0010:[<ffffffff810d5f22>] [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
RSP: 0018:ffff88083404fd28 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0
RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278
RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0
R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400
R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58
FS: 00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0
Stack:
0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248
ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee
0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000
Call Trace:
[<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550
[<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20
[<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0
[<ffffffff811cab76>] seq_read+0x166/0x380
[<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180
[<ffffffff811a7398>] __vfs_read+0x18/0x50
[<ffffffff811a745d>] vfs_read+0x8d/0x150
[<ffffffff811a756f>] SyS_read+0x4f/0xb0
[<ffffffff818d4772>] system_call_fastpath+0x12/0x17
Fix it by moving the termination condition check inside css_set_lock.
@it->cur_task is now cleared after being put and @it->task_pos is
tested for termination instead of @it->cset_pos as they indicate the
same condition and @it->task_pos is what's being dereferenced.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Calvin Owens <calvinowens@fb.com>
Fixes: ed27b9f7a1 ("cgroup: don't hold css_set_rwsem across css task iteration")
Acked-by: Zefan Li <lizefan@huawei.com>
Now that interfaces for the major three controllers - cpu, memory, io
- are shaping up, there's no reason to have an option to force legacy
files to show up on the unified hierarchy for testing. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
The init sequence shouldn't fail short of bugs and even when it does
it's better to continue with the rest of initialization and we were
silently ignoring /proc/cgroups creation failure.
Drop the explicit error handling and wrap sysfs_create_mount_point(),
register_filesystem() and proc_create() with WARN_ON()s.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
pids controller is completely broken in that it uncharges when a task
exits allowing zombies to escape resource control. With the recent
updates, cgroup core now maintains cgroup association till task free
and pids controller can be fixed by uncharging on free instead of
exit.
This patch adds cgroup_subsys->free() method and update pids
controller to use it instead of ->exit() for uncharging.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.
No controller actually needs an exiting task to be disassociated with
non-root cgroups. Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.
Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from. Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction. With cgroup association being
reset on exit, fixing it is pretty painful.
There's no reason to reset cgroup membership on exit. The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal. It can still pin and point to the css_set so that its cgroup
membership is maintained. This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.
* Previous patches decoupled populated_cnt tracking from css_set
lifetime, so a dying task can be simply unlinked from its css_set
while pinning and pointing to the css_set. This keeps css_set
association from task side alive while hiding it from "cgroup.procs"
and populated_cnt tracking. The css_set reference is dropped when
the task_struct is freed.
* ->exit() callback no longer needs the css arguments as the
associated css never changes once PF_EXITING is set. Removed.
* cpu and perf_events controllers no longer need ->exit() callbacks.
There's no reason to explicitly switch away on exit. The final
schedule out is enough. The callbacks are removed.
* On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
still reports "/" for all zombies. On the default hierarchy,
"/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
to at the time of exit. If the cgroup gets removed before the task
is reaped, " (deleted)" is appended.
v2: Build brekage due to missing dummy cgroup_free() when
!CONFIG_CGROUP fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
css_set_rwsem is the inner lock protecting css_sets and is accessed
from hot paths such as fork and exit. Internally, it has no reason to
be a rwsem or even mutex. There are no internal blocking operations
while holding it. This was rwsem because css task iteration used to
expose it to external iterator users. As the previous patch updated
css task iteration such that the locking is not leaked to its users,
there's no reason to keep it a rwsem.
This patch converts css_set_rwsem to a spinlock and rename it to
css_set_lock. It uses bh-safe operations as a planned usage needs to
access it from RCU callback context.
Signed-off-by: Tejun Heo <tj@kernel.org>
css_sets are synchronized through css_set_rwsem but the locking scheme
is kinda bizarre. The hot paths - fork and exit - have to write lock
the rwsem making the rw part pointless; furthermore, many readers
already hold cgroup_mutex.
One of the readers is css task iteration. It read locks the rwsem
over the entire duration of iteration. This leads to silly locking
behavior. When cpuset tries to migrate processes of a cgroup to a
different NUMA node, css_set_rwsem is held across the entire migration
attempt which can take a long time locking out forking, exiting and
other cgroup operations.
This patch updates css task iteration so that it locks css_set_rwsem
only while the iterator is being advanced. css task iteration
involves two levels - css_set and task iteration. As css_sets in use
are practically immutable, simply pinning the current one is enough
for resuming iteration afterwards. Task iteration is tricky as tasks
may leave their css_set while iteration is in progress. This is
solved by keeping track of active iterators and advancing them if
their next task leaves its css_set.
v2: put_task_struct() in css_task_iter_next() moved outside
css_set_rwsem. A later patch will add cgroup operations to
task_struct free path which may grab the same lock and this avoids
deadlock possibilities.
css_set_move_task() updated to use list_for_each_entry_safe() when
walking task_iters and advancing them. This is necessary as
advancing an iter may remove it from the list.
Signed-off-by: Tejun Heo <tj@kernel.org>
* Rename css_advance_task_iter() to css_task_iter_advance_css_set()
and make it clear it->task_pos too at the end of the iteration.
* Factor out css_task_iter_advance() from css_task_iter_next(). The
new function whines if called on a terminated iterator.
Except for the termination check, this is pure reorganization and
doesn't introduce any behavior changes. This will help the planned
locking update for css_task_iter.
Signed-off-by: Tejun Heo <tj@kernel.org>
A task is associated and disassociated with its css_set in three
places - during migration, after a new task is created and when a task
exits. The first is handled by cgroup_task_migrate() and the latter
two are open-coded.
These are similar operations and spreading them over multiple places
makes it harder to follow and update. This patch collects all task
css_set [dis]association operations into css_set_move_task().
While css_set_move_task() may check whether populated state needs to
be updated when not strictly necessary, the behavior is essentially
equivalent before and after this patch.
Signed-off-by: Tejun Heo <tj@kernel.org>