linux/kernel
Christian Brauner b3e5838252
clone: add CLONE_PIDFD
This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/<pid>.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <christian@brauner.io>
Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2019-05-07 14:31:03 +02:00
..
bpf xdp: fix cpumap redirect SKB creation bug 2019-03-29 12:15:02 -07:00
cgroup Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 14:08:19 -07:00
configs kvm_config: add CONFIG_VIRTIO_MENU 2018-10-24 20:55:56 -04:00
debug kdb: use bool for binary state indicators 2018-12-30 08:31:52 +00:00
dma memblock: drop memblock_alloc_*_nopanic() variants 2019-03-12 10:04:02 -07:00
events perf/core improvements and fixes: 2019-03-22 22:50:41 +01:00
gcov kernel/gcov/gcc_3_4.c: use struct_size() in kzalloc() 2019-03-07 18:32:02 -08:00
irq genirq: Mark expected switch case fall-through 2019-03-23 12:32:01 +01:00
livepatch Merge branch 'for-5.1/atomic-replace' into for-linus 2019-03-05 15:56:59 +01:00
locking locking/lockdep: Only call init_rcu_head() after RCU has been initialized 2019-03-09 14:15:51 +01:00
power treewide: add checks for the return value of memblock_alloc*() 2019-03-12 10:04:02 -07:00
printk fbdev changes for v5.1: 2019-03-15 14:22:59 -07:00
rcu Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-03-06 07:59:36 -08:00
sched Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-03-24 11:42:10 -07:00
time time/jiffies: Make refined_jiffies static 2019-03-22 13:38:26 +01:00
trace syscalls: Remove start and number from syscall_get_arguments() args 2019-04-05 09:26:43 -04:00
.gitignore kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
acct.c
async.c async: Add support for queueing on specific NUMA node 2019-01-31 14:20:54 +01:00
audit_fsnotify.c audit: add syscall information to CONFIG_CHANGE records 2019-01-18 17:53:29 -05:00
audit_tree.c audit: hand taken context to audit_kill_trees for syscall logging 2019-01-14 18:01:05 -05:00
audit_watch.c audit: add syscall information to CONFIG_CHANGE records 2019-01-18 17:53:29 -05:00
audit.c audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL 2019-02-03 17:49:35 -05:00
audit.h audit: hide auditsc_get_stamp and audit_serial prototypes 2019-02-07 21:44:27 -05:00
auditfilter.c audit: mark expected switch fall-through 2019-02-12 20:17:13 -05:00
auditsc.c audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL 2019-02-03 17:49:35 -05:00
backtracetest.c
bounds.c kbuild: fix kernel/bounds.c 'W=1' warning 2018-10-31 08:54:14 -07:00
capability.c LSM: add SafeSetID module that gates setid calls 2019-01-25 11:22:43 -08:00
compat.c time: make adjtime compat handling available for 32 bit 2019-02-07 00:13:27 +01:00
configs.c kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
context_tracking.c
cpu_pm.c
cpu.c cpu/hotplug: Prevent crash when CPU bringup fails on CONFIG_HOTPLUG_CPU=n 2019-03-28 13:34:58 +01:00
crash_core.c kexec: export PG_offline to VMCOREINFO 2019-03-05 21:07:14 -08:00
crash_dump.c
cred.c SELinux: Remove cred security blob poisoning 2019-01-08 13:18:44 -08:00
delayacct.c delayacct: track delays from thrashing cache pages 2018-10-26 16:26:32 -07:00
dma.c
elfcore.c
exec_domain.c
exit.c Merge branch 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2019-03-07 10:11:41 -08:00
extable.c
fail_function.c kernel/fail_function.c: remove meaningless null pointer check before debugfs_remove_recursive 2018-10-31 08:54:12 -07:00
fork.c clone: add CLONE_PIDFD 2019-05-07 14:31:03 +02:00
freezer.c PM / reboot: Eliminate race between reboot and suspend 2018-08-06 12:35:20 +02:00
futex.c futex: Ensure that futex address is aligned in handle_futex_death() 2019-03-22 13:05:26 +01:00
groups.c
hung_task.c kernel/hung_task.c: Use continuously blocked time when reporting. 2019-03-07 18:31:59 -08:00
iomem.c
irq_work.c
jump_label.c jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
kallsyms.c bpf: Add module name [bpf] to ksymbols for bpf programs 2019-01-21 17:38:56 -03:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks bpf: introduce bpf_spin_lock 2019-02-01 20:55:38 +01:00
Kconfig.preempt kconfig: warn no new line at end of file 2018-12-15 17:44:35 +09:00
kcov.c kcov: convert kcov.refcount to refcount_t 2019-03-07 18:32:02 -08:00
kexec_core.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
kexec_file.c kexec_file: kexec_walk_memblock() only walks a dedicated region at kdump 2018-12-06 14:38:50 +00:00
kexec_internal.h
kexec.c
kmod.c
kprobes.c kprobes: Search non-suffixed symbol in blacklist 2019-02-13 08:16:40 +01:00
ksysfs.c
kthread.c Merge branch 'akpm' (patches from Andrew) 2019-03-06 10:31:36 -08:00
latencytop.c
Makefile kernel/configs: use .incbin directive to embed config_data.gz 2019-03-07 18:32:02 -08:00
memremap.c mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm 2018-12-28 12:11:52 -08:00
module_signing.c modsign: use all trusted keys to verify module signature 2018-11-07 14:41:41 +01:00
module-internal.h
module.c dynamic_debug: add static inline stub for ddebug_add_module 2019-03-07 18:32:00 -08:00
notifier.c
nsproxy.c
padata.c padata: clean an indentation issue, remove extraneous space 2018-11-16 14:11:04 +08:00
panic.c kernel/panic.c: taint: fix debugfs_simple_attr.cocci warnings 2019-03-07 18:31:59 -08:00
params.c
pid_namespace.c signal: Use group_send_sig_info to kill all processes in a pid namespace 2018-09-16 16:08:25 +02:00
pid.c Fix failure path in alloc_pid() 2018-12-28 12:42:30 -08:00
profile.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ptrace.c ptrace: take into account saved_sigmask in PTRACE{GET,SET}SIGMASK 2019-03-29 10:01:37 -07:00
range.c
reboot.c kernel/reboot.c: export pm_power_off_prepare 2018-09-11 16:13:24 +01:00
relay.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-03-12 13:27:20 -07:00
resource.c device-dax for 5.1 2019-03-16 13:05:32 -07:00
rseq.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
seccomp.c syscalls: Remove start and number from syscall_get_arguments() args 2019-04-05 09:26:43 -04:00
signal.c signal: don't silently convert SI_USER signals to non-current pidfd 2019-04-01 23:03:18 +02:00
smp.c cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM 2019-01-30 19:27:00 +01:00
smpboot.c
smpboot.h
softirq.c softirq: Don't skip softirq execution when softirq thread is parking 2019-02-10 21:51:39 +01:00
stackleak.c stackleak: Mark stackleak_track_stack() as notrace 2018-12-05 19:31:44 -08:00
stacktrace.c
stop_machine.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-08-13 11:25:07 -07:00
sys_ni.c pidfd patches for v5.1-rc1 2019-03-16 13:47:14 -07:00
sys.c Merge branch 'akpm' (patches from Andrew) 2019-03-07 19:25:37 -08:00
sysctl_binary.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
sysctl.c kernel/sysctl.c: fix out-of-bounds access when setting file-max 2019-04-05 16:02:31 -10:00
task_work.c
taskstats.c
test_kprobes.c
torture.c Merge branches 'doc.2019.01.26a', 'fixes.2019.01.26a', 'sil.2019.01.26a', 'spdx.2019.02.09a', 'srcu.2019.01.26a' and 'torture.2019.01.26a' into HEAD 2019-02-09 08:47:52 -08:00
tracepoint.c tracing: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c umh: add exit routine for UMH process 2019-01-11 18:05:40 -08:00
up.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
user_namespace.c userns: also map extents in the reverse map to kernel IDs 2018-11-07 23:51:16 -06:00
user-return-notifier.c
user.c userns: use irqsave variant of refcount_dec_and_lock() 2018-08-22 10:52:47 -07:00
utsname_sysctl.c sys: don't hold uts_sem while accessing userspace memory 2018-08-11 02:05:53 -05:00
utsname.c
watchdog_hld.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
watchdog.c watchdog: Respect watchdog cpumask on CPU hotplug 2019-03-28 13:32:01 +01:00
workqueue_internal.h psi: fix aggregation idle shut-off 2019-02-01 15:46:23 -08:00
workqueue.c workqueue: Only unregister a registered lockdep key 2019-03-21 12:00:18 +01:00