linux/kernel
Tejun Heo 2bd59d48eb cgroup: convert to kernfs
cgroup filesystem code was derived from the original sysfs
implementation which was heavily intertwined with vfs objects and
locking with the goal of re-using the existing vfs infrastructure.
That experiment turned out rather disastrous and sysfs switched, a
long time ago, to distributed filesystem model where a separate
representation is maintained which is queried by vfs.  Unfortunately,
cgroup stuck with the failed experiment all these years and
accumulated even more problems over time.

Locking and object lifetime management being entangled with vfs is
probably the most egregious.  vfs is never designed to be misused like
this and cgroup ends up jumping through various convoluted dancing to
make things work.  Even then, operations across multiple cgroups can't
be done safely as it'll deadlock with rename locking.

Recently, kernfs is separated out from sysfs so that it can be used by
users other than sysfs.  This patch converts cgroup to use kernfs,
which will bring the following benefits.

* Separation from vfs internals.  Locking and object lifetime
  management is contained in cgroup proper making things a lot
  simpler.  This removes significant amount of locking convolutions,
  hairy object lifetime rules and the restriction on multi-cgroup
  operations.

* Can drop a lot of code to implement filesystem interface as most are
  provided by kernfs.

* Proper "severing" semantics, which allows controllers to not worry
  about lingering file accesses after offline.

While the preceding patches did as much as possible to make the
transition less painful, large part of the conversion has to be one
discrete step making this patch rather large.  The rest of the commit
message lists notable changes in different areas.

Overall
-------

* vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
  cgroupfs_root->sb w/ ->kf_root.

* All dentry accessors are removed.  Helpers to map from kernfs
  constructs are added.

* All vfs plumbing around dentry, inode and bdi removed.

* cgroup_mount() now directly looks for matching root and then
  proceeds to create a new one if not found.

Synchronization and object lifetime
-----------------------------------

* vfs inode locking removed.  Among other things, this removes the
  need for the convolution in cgroup_cfts_commit().  Future patches
  will further simplify it.

* vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
  cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
  root->refcnt and when it reaches zero proceeds to destroy it thus
  merging cgroup_put_root() and the former cgroup_kill_sb().
  Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
  when refcnt reaches zero.

* Unlike before, kernfs objects don't hold onto cgroup objects.  When
  cgroup destroys a kernfs node, all existing operations are drained
  and the association is broken immediately.  The same for
  cgroupfs_roots and mounts.

* All operations which come through kernfs guarantee that the
  associated cgroup is and stays valid for the duration of operation;
  however, there are two paths which need to find out the associated
  cgroup from dentry without going through kernfs -
  css_tryget_from_dir() and cgroupstats_build().  For these two,
  kernfs_node->priv is RCU managed so that they can dereference it
  under RCU read lock.

File and directory handling
---------------------------

* File and directory operations converted to kernfs_ops and
  kernfs_syscall_ops.

* xattrs is implicitly supported by kernfs.  No need to worry about it
  from cgroup.  This means that "xattr" mount option is no longer
  necessary.  A future patch will add a deprecated warning message
  when sane_behavior.

* When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
  private copy of one of the kernfs_ops to set its atomic_write_len.
  cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
  to handle it.

* cftype->lockdep_key added so that kernfs lockdep annotation can be
  per cftype.

* Inidividual file entries and open states are now managed by kernfs.
  No need to worry about them from cgroup.  cfent, cgroup_open_file
  and their friends are removed.

* kernfs_nodes are created deactivated and kernfs_activate()
  invocations added to places where creation of new nodes are
  committed.

* cgroup_rmdir() uses kernfs_[un]break_active_protection() for
  self-removal.

v2: - Li pointed out in an earlier patch that specifying "name="
      during mount without subsystem specification should succeed if
      there's an existing hierarchy with a matching name although it
      should fail with -EINVAL if a new hierarchy should be created.
      Prior to the conversion, this used by handled by deferring
      failure from NULL return from cgroup_root_from_opts(), which was
      necessary because root was being created before checking for
      existing ones.  Note that cgroup_root_from_opts() returned an
      ERR_PTR() value for error conditions which require immediate
      mount failure.

      As we now have separate search and creation steps, deferring
      failure from cgroup_root_from_opts() is no longer necessary.
      cgroup_root_from_opts() is updated to always return ERR_PTR()
      value on failure.

    - The logic to match existing roots is updated so that a mount
      attempt with a matching name but different subsys_mask are
      rejected.  This was handled by a separate matching loop under
      the comment "Check for name clashes with existing mounts" but
      got lost during conversion.  Merge the check into the main
      search loop.

    - Add __rcu __force casting in RCU_INIT_POINTER() in
      cgroup_destroy_locked() to avoid the sparse address space
      warning reported by kbuild test bot.  Maybe we want an explicit
      interface to use kn->priv as RCU protected pointer?

v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.

v4: Rebased on top of 0ab02ca8f8 ("cgroup: protect modifications to
    cgroup_idr with cgroup_mutex").

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: kbuild test robot fengguang.wu@intel.com>
2014-02-11 11:52:49 -05:00
..
cpu sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding 2014-01-13 17:38:55 +01:00
debug kgdb/kdb: Fix no KDB config problem 2014-01-25 08:55:09 +01:00
events cgroup: improve css_from_dir() into css_tryget_from_dir() 2014-02-11 11:52:47 -05:00
gcov gcov: reuse kbasename helper 2013-11-13 12:09:34 +09:00
irq Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-12-02 10:15:39 -08:00
locking Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-20 10:42:08 -08:00
power arm, pm, vmpressure: add missing slab.h includes 2014-02-03 13:24:01 -05:00
printk printk: flush conflicting continuation line 2014-01-23 16:36:56 -08:00
rcu Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-01-28 08:38:04 -08:00
sched cgroup: clean up cgroup_subsys names and initialization 2014-02-08 10:36:58 -05:00
time Merge branch 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/urgent 2014-01-25 08:27:26 +01:00
trace Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block 2014-01-30 11:19:05 -08:00
.gitignore Ignore generated file kernel/x509_certificate_list 2013-12-10 18:21:34 +00:00
acct.c
async.c
audit_tree.c fsnotify: remove pointless NULL initializers 2014-01-21 16:19:41 -08:00
audit_watch.c fsnotify: remove pointless NULL initializers 2014-01-21 16:19:41 -08:00
audit.c audit: fix location of __net_initdata for audit_net_ops 2014-01-17 17:14:32 -05:00
audit.h audit: Convert int limit uses to u32 2014-01-14 14:54:00 -05:00
auditfilter.c audit: log on errors from filter user rules 2014-01-13 22:32:31 -05:00
auditsc.c audit: fix dangling keywords in audit_log_set_loginuid() output 2014-01-13 22:32:38 -05:00
backtracetest.c
bounds.c mm: do not allocate page->ptl dynamically, if spinlock_t fits to long 2013-12-20 12:25:45 -08:00
capability.c audit: Simplify and correct audit_log_capset 2014-01-13 22:26:48 -05:00
cgroup_freezer.c cgroup: clean up cgroup_subsys names and initialization 2014-02-08 10:36:58 -05:00
cgroup.c cgroup: convert to kernfs 2014-02-11 11:52:49 -05:00
compat.c
configs.c
context_tracking.c context_tracking: Wrap static key check into more intuitive function name 2013-12-02 20:43:14 +01:00
cpu_pm.c
cpu.c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-11-14 16:55:11 +09:00
cpuset.c cgroup: clean up cgroup_subsys names and initialization 2014-02-08 10:36:58 -05:00
crash_dump.c
cred.c
delayacct.c kernel/delayacct.c: remove redundant checking in __delayacct_add_tsk() 2013-11-13 12:09:12 +09:00
dma.c
elfcore.c switch elf_core_write_extra_phdrs() to dump_emit() 2013-11-09 00:16:23 -05:00
exec_domain.c
exit.c introduce for_each_thread() to replace the buggy while_each_thread() 2014-01-21 16:19:46 -08:00
extable.c kernel/extable: fix address-checks for core_kernel and init areas 2013-11-28 09:49:41 -08:00
fork.c exec: kill task_struct->did_exec 2014-01-23 16:37:02 -08:00
freezer.c libata, freezer: avoid block device removal while system is frozen 2013-12-19 13:50:32 -05:00
futex_compat.c
futex.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-20 10:42:08 -08:00
groups.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
hrtimer.c sched/deadline: Add SCHED_DEADLINE structures & implementation 2014-01-13 13:41:06 +01:00
hung_task.c hung_task: Display every hung task warning 2014-01-25 12:13:33 +01:00
irq_work.c
itimer.c
jump_label.c static_key: WARN on usage before jump_label_init was called 2013-10-19 19:45:35 -04:00
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS 2013-11-15 09:32:22 +09:00
Kconfig.locks
Kconfig.preempt
kexec.c kernel/kexec.c: use vscnprintf() instead of vsnprintf() in vmcoreinfo_append_str() 2014-01-27 21:02:40 -08:00
kmod.c kernel/kmod.c: check for NULL in call_usermodehelper_exec() 2013-09-30 14:31:02 -07:00
kprobes.c kprobes: use KSYM_NAME_LEN to size identifier buffers 2013-11-13 12:09:26 +09:00
ksysfs.c kdump: fix exported size of vmcoreinfo note 2014-01-23 16:37:03 -08:00
kthread.c kthread: make kthread_create() killable 2013-11-13 12:08:59 +09:00
latencytop.c
Makefile KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y 2013-12-13 15:59:11 +00:00
module_signing.c keys: change asymmetric keys to use common hash definitions 2013-10-25 17:15:18 -04:00
module-internal.h KEYS: Separate the kernel signature checking keyring from module signing 2013-09-25 17:17:01 +01:00
module.c module: Add missing newline in printk call. 2014-01-21 09:59:16 +10:30
notifier.c
nsproxy.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2013-09-07 14:35:32 -07:00
padata.c padata: Fix wrong usage of rcu_dereference() 2013-12-05 21:28:42 +08:00
panic.c panic: Make panic_timeout configurable 2013-11-26 12:12:26 +01:00
params.c params: improve standard definitions 2013-12-04 14:09:46 +10:30
pid_namespace.c pid_namespace: make freeing struct pid_namespace rcu-delayed 2013-10-24 23:43:29 -04:00
pid.c pidns: fix free_pid() to handle the first fork failure 2013-09-30 14:31:03 -07:00
posix-cpu-timers.c posix-timers: Convert abuses of BUG_ON to WARN_ON 2013-12-09 16:56:29 +01:00
posix-timers.c
profile.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
ptrace.c exec/ptrace: fix get_dumpable() incorrect tests 2013-11-13 12:09:33 +09:00
range.c
reboot.c kexec: migrate to reboot cpu 2013-12-18 19:04:50 -08:00
relay.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
res_counter.c memcg: reduce function dereference 2013-09-12 15:38:02 -07:00
resource.c
seccomp.c
signal.c kernel/signal.c: change do_signal_stop/do_sigaction to use while_each_thread() 2014-01-23 16:37:02 -08:00
smp.c kernel/smp.c: remove cpumask_ipi 2014-01-30 16:56:54 -08:00
smpboot.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
smpboot.h
softirq.c Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-31 09:02:51 -08:00
stacktrace.c
stop_machine.c stop_machine: Fix race between stop_two_cpus() and stop_cpus() 2013-11-11 12:43:38 +01:00
sys_ni.c
sys.c kernel/sys.c: k_getrusage() can use while_each_thread() 2014-01-23 16:37:02 -08:00
sysctl_binary.c kernel/sysctl_binary.c: use scnprintf() instead of snprintf() 2013-11-13 12:09:33 +09:00
sysctl.c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-31 08:59:46 -08:00
system_certificates.S KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
system_keyring.c KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
task_work.c task_work: documentation 2013-09-11 15:58:27 -07:00
taskstats.c genetlink: only pass array to genl_register_family_with_ops() 2013-11-19 16:39:05 -05:00
test_kprobes.c
time.c
timeconst.bc
timer.c timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) 2013-11-19 14:59:50 +01:00
tracepoint.c
tsacct.c
uid16.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
up.c kernel: provide a __smp_call_function_single stub for !CONFIG_SMP 2013-11-15 09:32:22 +09:00
user_namespace.c KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches 2013-09-24 10:35:19 +01:00
user-return-notifier.c
user.c KEYS: fix uninitialized persistent_keyring_register_sem 2013-12-13 15:59:11 +00:00
utsname_sysctl.c
utsname.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
watchdog.c watchdog: update watchdog_thresh properly 2013-09-24 17:00:25 -07:00
workqueue_internal.h
workqueue.c Merge branch 'for-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2014-01-21 17:46:31 -08:00