linux

mainlining shenanigans

Go to file

Christian Brauner 303cc571d1 nsproxy: attach to namespaces via pidfds For quite a while we have been thinking about using pidfds to attach to namespaces. This patchset has existed for about a year already but we've wanted to wait to see how the general api would be received and adopted. Now that more and more programs in userspace have started using pidfds for process management it's time to send this one out. This patch makes it possible to use pidfds to attach to the namespaces of another process, i.e. they can be passed as the first argument to the setns() syscall. When only a single namespace type is specified the semantics are equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, when a pidfd is passed, multiple namespace flags can be specified in the second setns() argument and setns() will attach the caller to all the specified namespaces all at once or to none of them. Specifying 0 is not valid together with a pidfd. Here are just two obvious examples: setns(pidfd, CLONE_NEWPID \| CLONE_NEWNS \| CLONE_NEWNET); setns(pidfd, CLONE_NEWUSER); Allowing to also attach subsets of namespaces supports various use-cases where callers setns to a subset of namespaces to retain privilege, perform an action and then re-attach another subset of namespaces. If the need arises, as Eric suggested, we can extend this patchset to assume even more context than just attaching all namespaces. His suggestion specifically was about assuming the process' root directory when setns(pidfd, 0) or setns(pidfd, SETNS_PIDFD) is specified. For now, just keep it flexible in terms of supporting subsets of namespaces but let's wait until we have users asking for even more context to be assumed. At that point we can add an extension. The obvious example where this is useful is a standard container manager interacting with a running container: pushing and pulling files or directories, injecting mounts, attaching/execing any kind of process, managing network devices all these operations require attaching to all or at least multiple namespaces at the same time. Given that nowadays most containers are spawned with all namespaces enabled we're currently looking at at least 14 syscalls, 7 to open the /proc/<pid>/ns/<ns> nsfds, another 7 to actually perform the namespace switch. With time namespaces we're looking at about 16 syscalls. (We could amortize the first 7 or 8 syscalls for opening the nsfds by stashing them in each container's monitor process but that would mean we need to send around those file descriptors through unix sockets everytime we want to interact with the container or keep on-disk state. Even in scenarios where a caller wants to join a particular namespace in a particular order callers still profit from batching other namespaces. That mostly applies to the user namespace but all container runtimes I found join the user namespace first no matter if it privileges or deprivileges the container similar to how unshare behaves.) With pidfds this becomes a single syscall no matter how many namespaces are supposed to be attached to. A decently designed, large-scale container manager usually isn't the parent of any of the containers it spawns so the containers don't die when it crashes or needs to update or reinitialize. This means that for the manager to interact with containers through pids is inherently racy especially on systems where the maximum pid number is not significicantly bumped. This is even more problematic since we often spawn and manage thousands or ten-thousands of containers. Interacting with a container through a pid thus can become risky quite quickly. Especially since we allow for an administrator to enable advanced features such as syscall interception where we're performing syscalls in lieu of the container. In all of those cases we use pidfds if they are available and we pass them around as stable references. Using them to setns() to the target process' namespaces is as reliable as using nsfds. Either the target process is already dead and we get ESRCH or we manage to attach to its namespaces but we can't accidently attach to another process' namespaces. So pidfds lend themselves to be used with this api. The other main advantage is that with this change the pidfd becomes the only relevant token for most container interactions and it's the only token we need to create and send around. Apart from significiantly reducing the number of syscalls from double digit to single digit which is a decent reason post-spectre/meltdown this also allows to switch to a set of namespaces atomically, i.e. either attaching to all the specified namespaces succeeds or we fail. If we fail we haven't changed a single namespace. There are currently three namespaces that can fail (other than for ENOMEM which really is not very interesting since we then have other problems anyway) for non-trivial reasons, user, mount, and pid namespaces. We can fail to attach to a pid namespace if it is not our current active pid namespace or a descendant of it. We can fail to attach to a user namespace because we are multi-threaded or because our current mount namespace shares filesystem state with other tasks, or because we're trying to setns() to the same user namespace, i.e. the target task has the same user namespace as we do. We can fail to attach to a mount namespace because it shares filesystem state with other tasks or because we fail to lookup the new root for the new mount namespace. In most non-pathological scenarios these issues can be somewhat mitigated. But there are cases where we're half-attached to some namespace and failing to attach to another one. I've talked about some of these problem during the hallway track (something only the pre-COVID-19 generation will remember) of Plumbers in Los Angeles in 2018(?). Even if all these issues could be avoided with super careful userspace coding it would be nicer to have this done in-kernel. Pidfds seem to lend themselves nicely for this. The other neat thing about this is that setns() becomes an actual counterpart to the namespace bits of unshare(). Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Jann Horn <jannh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20200505140432.181565-3-christian.brauner@ubuntu.com		2020-05-13 11:41:22 +02:00
arch	- Add -fasynchronous-unwind-tables to the vDSO CFLAGS.	2020-05-01 17:09:31 -07:00
block	block: remove the bd_openers checks in blk_drop_partitions	2020-04-30 10:25:43 -06:00
certs	.gitignore: add SPDX License Identifier	2020-03-25 11:50:48 +01:00
crypto	x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2	2020-04-09 00:12:48 +09:00
Documentation	dmaengine fixes for v5.7-rc4	2020-05-02 11:16:14 -07:00
drivers	IOMMU Fixes for Linux v5.7-rc3	2020-05-03 11:04:57 -07:00
fs	nsproxy: attach to namespaces via pidfds	2020-05-13 11:41:22 +02:00
include	nsproxy: attach to namespaces via pidfds	2020-05-13 11:41:22 +02:00
init	Kbuild updates for v5.7 (2nd)	2020-04-11 09:46:12 -07:00
ipc	nsproxy: add struct nsset	2020-05-09 13:57:12 +02:00
kernel	nsproxy: attach to namespaces via pidfds	2020-05-13 11:41:22 +02:00
lib	linux-kselftest-kunit-5.7-rc4	2020-04-30 16:32:47 -07:00
LICENSES	LICENSES: Rename other to deprecated	2019-05-03 06:34:32 -06:00
mm	mm: check that mm is still valid in madvise()	2020-04-24 13:28:03 -07:00
net	nsproxy: add struct nsset	2020-05-09 13:57:12 +02:00
samples	vmalloc: fix remap_vmalloc_range() bounds checks	2020-04-21 11:11:56 -07:00
scripts	Kbuild fixes for v5.7	2020-04-24 10:39:32 -07:00
security	selinux/stable-5.7 PR 20200430	2020-04-30 16:35:45 -07:00
sound	sound fixes for 5.7-rc4	2020-05-01 11:05:28 -07:00
tools	linux-kselftest-5.7-rc4	2020-04-30 16:28:49 -07:00
usr	kbuild: fix comment about missing include guard detection	2020-04-11 12:09:48 +09:00
virt	KVM: Pass kvm_init()'s opaque param to additional arch funcs	2020-03-31 10:48:03 -04:00
.clang-format	clang-format: Update with the latest for_each macro list	2020-04-18 13:49:33 +02:00
.cocciconfig
.get_maintainer.ignore	Opt out of scripts/get_maintainer.pl	2019-05-16 10:53:40 -07:00
.gitattributes	.gitattributes: use 'dts' diff driver for dts files	2019-12-04 19:44:11 -08:00
.gitignore	.gitignore: add SPDX License Identifier	2020-03-25 11:50:48 +01:00
.mailmap	mailmap: Add Sedat Dilek (replacement for expired email address)	2020-04-11 09:28:34 -07:00
COPYING	COPYING: state that all contributions really are covered by this file	2020-02-10 13:32:20 -08:00
CREDITS	MAINTAINERS: Hand MIPS over to Thomas	2020-02-24 22:43:18 -08:00
Kbuild	kbuild: rename hostprogs-y/always to hostprogs/always-y	2020-02-04 01:53:07 +09:00
Kconfig	docs: kbuild: convert docs to ReST and rename to *.rst	2019-06-14 14:21:21 -06:00
MAINTAINERS	for-5.7-rc3-tag	2020-05-03 11:30:08 -07:00
Makefile	Linux 5.7-rc4	2020-05-03 14:56:04 -07:00
README	Drop all 00-INDEX files from Documentation/	2018-09-09 15:08:58 -06:00

README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.