linux/fs/proc
Nick Piggin 31e6b01f41 fs: rcu-walk for path lookup
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.

This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.

The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
  of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
  not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
  access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
  refcounts are not required for persistence. Also we are free to perform mount
  lookups, and to assume dentry mount points and mount roots are stable up and
  down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
  so we can load this tuple atomically, and also check whether any of its
  members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
  sequence after the child is found in case anything changed in the parent
  during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
  limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.

When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.

Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).

The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links

In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.

Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.

Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07 17:50:27 +11:00
..
array.c CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials 2010-07-29 15:16:17 -07:00
base.c fs: change d_delete semantics 2011-01-07 17:50:18 +11:00
cmdline.c proc: switch /proc/cmdline to seq_file 2008-10-23 14:29:04 +04:00
cpuinfo.c proc: move /proc/cpuinfo code to fs/proc/cpuinfo.c 2008-10-23 15:05:11 +04:00
devices.c proc: move /proc/devices code to fs/proc/devices.c 2008-10-23 15:02:18 +04:00
generic.c fs: change d_delete semantics 2011-01-07 17:50:18 +11:00
inode.c fs: icache RCU free inodes 2011-01-07 17:50:26 +11:00
internal.h proc: rename de_get() to pde_get() and inline it 2009-12-16 07:19:57 -08:00
interrupts.c proc: move /proc/interrupts boilerplate code to fs/proc/interrupts.c 2008-10-23 15:15:46 +04:00
Kconfig vmcore: it is not experimental any more 2010-10-26 16:52:05 -07:00
kcore.c kcore: add _text to KCORE_TEXT 2010-05-27 09:12:47 -07:00
kmsg.c procfs: Use generic_file_llseek in /proc/kmsg 2010-04-09 16:35:41 +02:00
loadavg.c sched, timers: cleanup avenrun users 2009-05-15 15:32:45 +02:00
Makefile procfs: simplify conditional processing of fs/proc.o. 2010-08-11 08:59:20 -07:00
meminfo.c hwpoison: fix/proc/meminfo alignment 2009-10-29 07:39:25 -07:00
mmu.c fs/proc/mmu.c: headers butchery 2007-10-17 08:42:48 -07:00
nommu.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
page.c proc: export uncached bit properly in /proc/kpageflags 2010-09-09 18:57:23 -07:00
proc_devtree.c of: Drop properties with "/" in their name 2010-06-13 18:12:24 -06:00
proc_net.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
proc_sysctl.c fs: rcu-walk for path lookup 2011-01-07 17:50:27 +11:00
proc_tty.c Revert "tty: Add a new file /proc/tty/consoles" 2010-10-23 08:14:12 -07:00
root.c switch procfs to ->mount() 2010-10-29 04:17:01 -04:00
softirqs.c procfs: fix /proc/softirqs formatting 2010-10-27 18:03:13 -07:00
stat.c /proc/stat: fix scalability of irq sum of all cpu 2010-10-27 18:03:13 -07:00
task_mmu.c pagemap: set pagemap walk limit to PMD boundary 2010-11-25 06:50:46 +09:00
task_nommu.c nommu: add '[stack]' label to /proc/pid/maps output 2010-06-29 15:29:30 -07:00
uptime.c [PATCH] Fix idle time field in /proc/uptime 2009-09-24 10:16:24 +02:00
version.c proc: switch /proc/version to seq_file 2008-10-23 14:19:58 +04:00
vmcore.c /proc/vmcore: fix seeking 2010-09-22 17:22:38 -07:00