linux/kernel
Serge E. Hallyn 4f41fc5962 cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point.  Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b.  The root dentry field of the mountinfo
entry will show '/a/b'.

 cat > /tmp/do1 << EOF
 mount -t cgroup -o freezer freezer /mnt
 grep freezer /proc/self/mountinfo
 EOF

 unshare -Gm  bash /tmp/do1
 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
 > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

 grep freezer /proc/self/cgroup
 9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

 mount --bind /sys/fs/cgroup/freezer/a/b /mnt
 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
        awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653                         # Our shell
14254                        # cat(1)
        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

                                      # propagating to other mountns
        /proc/self/cgroup:      7:freezer:/
        mountinfo:              /a/b    /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace.  So the same steps as above:

        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /../..  /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /mnt/freezer

cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
cgroup.procs           freezer.self_freezing    notify_on_release
3164
2653                   # First shell that placed in this cgroup
3164                   # Shell started by 'unshare'
14197                  # cat(1)

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 12:15:03 -04:00
..
bpf bpf/verifier: reject invalid LD_ABS | BPF_DW instruction 2016-04-14 01:31:50 -04:00
configs kconfig: add xenconfig defconfig helper 2015-06-16 11:04:29 +01:00
debug mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings 2016-02-22 08:51:37 +01:00
events Merge branch 'PAGE_CACHE_SIZE-removal' 2016-04-04 10:50:24 -07:00
gcov gcov: use within_module() helper. 2015-12-04 22:46:25 +01:00
irq genirq: Dont allow affinity mask to be updated on IPIs 2016-04-21 12:05:15 +02:00
livepatch livepatch/module: remove livepatch module notifier 2016-03-17 09:45:10 +01:00
locking locking/pvqspinlock: Fix division by zero in qstat_read() 2016-04-19 10:49:19 +02:00
power Power management and ACPI material for v4.6-rc1, part 2 2016-03-24 22:59:58 -07:00
printk printk: add clear_idx symbol to vmcoreinfo 2016-03-17 15:09:34 -07:00
rcu kernel: add kcov code coverage 2016-03-22 15:36:02 -07:00
sched locking/atomic, sched: Unexport fetch_or() 2016-03-29 11:52:11 +02:00
time timers/nohz: Convert tick dependency mask to atomic_t 2016-03-29 11:52:11 +02:00
trace arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections 2016-03-25 16:37:42 -07:00
.gitignore certs: add .gitignore to stop git nagging about x509_certificate_list 2015-10-21 15:18:35 +01:00
acct.c
async.c async: export current_is_async() 2015-11-19 17:51:48 +01:00
audit_fsnotify.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
audit_tree.c audit: audit_tree_match can be boolean 2015-11-04 08:23:51 -05:00
audit_watch.c Merge branch 'stable-4.6' of git://git.infradead.org/users/pcmoore/audit 2016-03-19 17:52:49 -07:00
audit.c Merge branch 'stable-4.6' of git://git.infradead.org/users/pcmoore/audit 2016-03-19 17:52:49 -07:00
audit.h security: Make inode argument of inode_getsecid non-const 2015-12-24 11:09:39 -05:00
auditfilter.c audit: Fix typo in comment 2016-02-08 11:25:39 -05:00
auditsc.c auditsc: for seccomp events, log syscall compat state using in_compat_syscall 2016-03-22 15:36:02 -07:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends 2015-12-03 10:24:08 -05:00
cgroup_pids.c cgroup_pids: fix a typo. 2015-12-14 14:54:37 -05:00
cgroup.c cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces 2016-05-09 12:15:03 -04:00
compat.c
configs.c
context_tracking.c context_tracking: Switch to new static_branch API 2015-11-24 09:56:43 +01:00
cpu_pm.c kernel/cpu_pm: fix cpu_cluster_pm_exit comment 2015-09-03 02:42:20 +02:00
cpu.c cpu/hotplug: Fix rollback during error-out in __cpu_disable() 2016-04-22 09:49:49 +02:00
cpuset.c cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback 2016-04-25 15:45:14 -04:00
crash_dump.c
cred.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
delayacct.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
dma.c
elfcore.c
exec_domain.c
exit.c oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space 2016-03-25 16:37:42 -07:00
extable.c kernel/extable.c: remove duplicated include 2015-09-10 13:29:01 -07:00
fork.c kernel: add kcov code coverage 2016-03-22 15:36:02 -07:00
freezer.c
futex_compat.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-01-20 17:09:18 -08:00
futex.c futex: Acknowledge a new waiter in counter before plist 2016-04-21 11:06:09 +02:00
groups.c
hung_task.c kernel/hung_task.c: use timeout diff when timeout is updated 2016-03-22 15:36:02 -07:00
irq_work.c treewide: Remove old email address 2015-11-23 09:44:58 +01:00
jump_label.c treewide: Remove old email address 2015-11-23 09:44:58 +01:00
kallsyms.c kallsyms: add support for relative offsets in kallsyms address table 2016-03-15 16:55:16 -07:00
kcmp.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-01-20 17:09:18 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kernel: add kcov code coverage 2016-03-22 15:36:02 -07:00
kexec_core.c kexec: Set IORESOURCE_SYSTEM_RAM for System RAM 2016-01-30 09:49:57 +01:00
kexec_file.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security 2016-03-17 11:33:45 -07:00
kexec_internal.h kexec: move some memembers and definitions within the scope of CONFIG_KEXEC_FILE 2016-01-20 17:09:18 -08:00
kexec.c kexec: set KEXEC_TYPE_CRASH before sanity_check_segment_list() 2016-01-20 17:09:18 -08:00
kmod.c kmod: don't run async usermode helper as a child of kworker thread 2015-10-23 17:55:10 +09:00
kprobes.c perf/x86/hw_breakpoints: Disallow kernel breakpoints unless kprobe-safe 2015-08-04 10:16:54 +02:00
ksysfs.c rcu: Remove TINY_RCU bloat from pointless boot parameters 2015-12-07 16:59:37 -08:00
kthread.c kernel/kthread.c:kthread_create_on_node(): clarify documentation 2015-09-04 16:54:41 -07:00
latencytop.c sched/debug: Make schedstats a runtime tunable that is disabled by default 2016-02-09 11:54:23 +01:00
Makefile kernel: add kcov code coverage 2016-03-22 15:36:02 -07:00
membarrier.c sys_membarrier(): system-wide memory barrier (generic, x86) 2015-09-11 15:21:34 -07:00
memremap.c memremap: add MEMREMAP_WC flag 2016-03-22 15:36:02 -07:00
module_signing.c X.509: Make algo identifiers text instead of enum 2016-03-03 21:49:27 +00:00
module-internal.h
module.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching 2016-03-17 21:46:32 -07:00
notifier.c Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-09-01 08:40:25 -07:00
nsproxy.c cgroup: introduce cgroup namespaces 2016-02-16 13:04:58 -05:00
padata.c
panic.c panic: change nmi_panic from macro to function 2016-03-22 15:36:02 -07:00
params.c Nothing exciting, minor tweaks and cleanups. 2015-11-09 15:53:39 -08:00
pid_namespace.c
pid.c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-01-31 15:44:04 -08:00
profile.c profile: hide unused functions when !CONFIG_PROC_FS 2016-03-22 15:36:02 -07:00
ptrace.c ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock 2016-03-22 15:36:02 -07:00
range.c
reboot.c kexec: split kexec_load syscall from kexec core code 2015-09-10 13:29:01 -07:00
relay.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
resource.c /proc/iomem: only expose physical resource addresses to privileged users 2016-04-14 12:56:09 -07:00
seccomp.c seccomp: check in_compat_syscall, not is_compat_task, in strict mode 2016-03-22 15:36:02 -07:00
signal.c kernel/signal.c: add compile-time check for __ARCH_SI_PREAMBLE_SIZE 2016-03-22 15:36:02 -07:00
smp.c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-03-15 13:50:29 -07:00
smpboot.c cpu/hotplug: Unpark smpboot threads from the state machine 2016-03-01 20:36:56 +01:00
smpboot.h cpu/hotplug: Create hotplug threads 2016-03-01 20:36:56 +01:00
softirq.c arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections 2016-03-25 16:37:42 -07:00
stacktrace.c
stop_machine.c kernel/stop_machine.c: remove CONFIG_SMP dependencies 2016-01-16 11:17:24 -08:00
sys_ni.c vfs: add copy_file_range syscall and vfs helper 2015-12-01 14:00:53 -05:00
sys.c timer: convert timer_slack_ns from unsigned long to u64 2016-03-17 15:09:34 -07:00
sysctl_binary.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-03-22 15:36:02 -07:00
sysctl.c mm: scale kswapd watermarks in proportion to memory 2016-03-17 15:09:34 -07:00
task_work.c task_work: remove fifo ordering guarantee 2015-09-05 13:46:58 -07:00
taskstats.c
test_kprobes.c
torture.c torture: Consolidate cond_resched_rcu_qs() into stutter_wait() 2015-10-06 11:25:01 -07:00
tracepoint.c kernel/...: convert pr_warning to pr_warn 2016-03-22 15:36:02 -07:00
tsacct.c time, acct: Drop irq save & restore from __acct_update_integrals() 2016-02-29 09:53:09 +01:00
uid16.c
up.c
user_namespace.c kernel/*: switch to memdup_user_nul() 2016-01-04 10:27:55 -05:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog.c watchdog: don't run proc_watchdog_update if new value is same as old 2016-03-17 15:09:34 -07:00
workqueue_internal.h sched/core: Get rid of 'cpu' argument in wq_worker_sleeping() 2016-03-02 10:28:47 -05:00
workqueue.c Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2016-03-18 20:05:39 -07:00