linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-25 21:51:40 +00:00

A mirror of the official Linux kernel repository just in case

Go to file

Christian Brauner 661ee62809 cgroup: introduce cgroup.kill Introduce the cgroup.kill file. It does what it says on the tin and allows a caller to kill a cgroup by writing "1" into cgroup.kill. The file is available in non-root cgroups. Killing cgroups is a process directed operation, i.e. the whole thread-group is affected. Consequently trying to write to cgroup.kill in threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior aligns with cgroup.procs where reads in threaded-cgroups are rejected with EOPNOTSUPP. The cgroup.kill file is write-only since killing a cgroup is an event not which makes it different from e.g. freezer where a cgroup transitions between the two states. As with all new cgroup features cgroup.kill is recursive by default. Killing a cgroup is protected against concurrent migrations through the cgroup mutex. To protect against forkbombs and to mitigate the effect of racing forks a new CGRP_KILL css set lock protected flag is introduced that is set prior to killing a cgroup and unset after the cgroup has been killed. We can then check in cgroup_post_fork() where we hold the css set lock already whether the cgroup is currently being killed. If so we send the child a SIGKILL signal immediately taking it down as soon as it returns to userspace. To make the killing of the child semantically clean it is killed after all cgroup attachment operations have been finalized. There are various use-cases of this interface: - Containers usually have a conservative layout where each container usually has a delegated cgroup. For such layouts there is a 1:1 mapping between container and cgroup. If the container in addition uses a separate pid namespace then killing a container usually becomes a simple kill -9 <container-init-pid> from an ancestor pid namespace. However, there are quite a few scenarios where that isn't true. For example, there are containers that share the cgroup with other processes on purpose that are supposed to be bound to the lifetime of the container but are not in the same pidns of the container. Containers that are in a delegated cgroup but share the pid namespace with the host or other containers. - Service managers such as systemd use cgroups to group and organize processes belonging to a service. They usually rely on a recursive algorithm now to kill a service. With cgroup.kill this becomes a simple write to cgroup.kill. - Userspace OOM implementations can make good use of this feature to efficiently take down whole cgroups quickly. - The kill program can gain a new kill --cgroup /sys/fs/cgroup/delegated flag to take down cgroups. A few observations about the semantics: - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is not specified we are not taking cgroup mutex meaning the cgroup can be killed while a process in that cgroup is forking. If the kill request happens right before cgroup_can_fork() and before the parent grabs its siglock the parent is guaranteed to see the pending SIGKILL. In addition we perform another check in cgroup_post_fork() whether the cgroup is being killed and is so take down the child (see above). This is robust enough and protects gainst forkbombs. If userspace really really wants to have stricter protection the simple solution would be to grab the write side of the cgroup threadgroup rwsem which will force all ongoing forks to complete before killing starts. We concluded that this is not necessary as the semantics for concurrent forking should simply align with freezer where a similar check as cgroup_post_fork() is performed. For all other cases CLONE_INTO_CGROUP is required. In this case we will grab the cgroup mutex so the cgroup can't be killed while we fork. Once we're done with the fork and have dropped cgroup mutex we are visible and will be found by any subsequent kill request. - We obviously don't kill kthreads. This means a cgroup that has a kthread will not become empty after killing and consequently no unpopulated event will be generated. The assumption is that kthreads should be in the root cgroup only anyway so this is not an issue. - We skip killing tasks that already have pending fatal signals. - Freezer doesn't care about tasks in different pid namespaces, i.e. if you have two tasks in different pid namespaces the cgroup would still be frozen. The cgroup.kill mechanism consequently behaves the same way, i.e. we kill all processes and ignore in which pid namespace they exist. - If the caller is located in a cgroup that is killed the caller will obviously be killed as well. Link: https://lore.kernel.org/r/20210503143922.3093755-1-brauner@kernel.org Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: cgroups@vger.kernel.org Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tejun Heo <tj@kernel.org>		2021-05-10 10:41:10 -04:00
arch	Handle power-gating of AMD IOMMU perf counters properly when they are used.	2021-05-09 13:00:26 -07:00
block	block-5.13-2021-05-09	2021-05-09 13:25:14 -07:00
certs	Kbuild updates for v5.13 (2nd)	2021-05-08 10:00:11 -07:00
crypto	for-5.13/drivers-2021-04-27	2021-04-28 14:39:37 -07:00
Documentation	A set of scheduler updates:	2021-05-09 13:14:34 -07:00
drivers	fbmem: fix horribly incorrect placement of __maybe_unused	2021-05-09 14:03:33 -07:00
fs	3 small SMB3 chmultichannel related changesets (also for stable)	2021-05-09 13:19:29 -07:00
include	cgroup: introduce cgroup.kill	2021-05-10 10:41:10 -04:00
init	Merge branch 'akpm' (patches from Andrew)	2021-05-07 00:34:51 -07:00
ipc	ipc/sem.c: spelling fix	2021-05-07 00:26:34 -07:00
kernel	cgroup: introduce cgroup.kill	2021-05-10 10:41:10 -04:00
lib	Kbuild updates for v5.13 (2nd)	2021-05-08 10:00:11 -07:00
LICENSES	LICENSES: Add the CC-BY-4.0 license	2020-12-08 10:33:27 -07:00
mm	mm: fix typos in comments	2021-05-07 00:26:35 -07:00
net	Networking fixes for 5.13-rc1, including fixes from bpf, can	2021-05-08 08:31:46 -07:00
samples	Kbuild updates for v5.13 (2nd)	2021-05-08 10:00:11 -07:00
scripts	Kbuild updates for v5.13 (2nd)	2021-05-08 10:00:11 -07:00
security	Simple code cleanup	2021-05-05 12:08:06 -07:00
sound	sound fixes for 5.13-rc1	2021-05-07 11:40:18 -07:00
tools	Kbuild updates for v5.13 (2nd)	2021-05-08 10:00:11 -07:00
usr	.gitignore: prefix local generated files with a slash	2021-05-02 00:43:35 +09:00
virt	KVM: Boost vCPU candidate in user mode which is delivering interrupt	2021-04-21 12:20:03 -04:00
.clang-format	cxl for 5.12	2021-02-24 09:38:36 -08:00
.cocciconfig
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	.gitignore: ignore only top-level modules.builtin	2021-05-02 00:43:35 +09:00
.mailmap	It's been a relatively busy cycle in docsland, though more than usually	2021-04-26 13:22:43 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: move Murali Karicheri to credits	2021-04-29 15:47:30 -07:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	kbuild: ensure full rebuild when the compiler is updated	2020-05-12 13:28:33 +09:00
MAINTAINERS	Networking fixes for 5.13-rc1, including fixes from bpf, can	2021-05-08 08:31:46 -07:00
Makefile	Linux 5.13-rc1	2021-05-09 14:17:44 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.