linux/fs
Nick Piggin 6416ccb789 fs: scale files_lock
fs: scale files_lock

Improve scalability of files_lock by adding per-cpu, per-sb files lists,
protected with an lglock. The lglock provides fast access to the per-cpu lists
to add and remove files. It also provides a snapshot of all the per-cpu lists
(although this is very slow).

One difficulty with this approach is that a file can be removed from the list
by another CPU. We must track which per-cpu list the file is on with a new
variale in the file struct (packed into a hole on 64-bit archs). Scalability
could suffer if files are frequently removed from different cpu's list.

However loads with frequent removal of files imply short interval between
adding and removing the files, and the scheduler attempts to avoid moving
processes too far away. Also, even in the case of cross-CPU removal, the
hardware has much more opportunity to parallelise cacheline transfers with N
cachelines than with 1.

A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
degenerates to contending on a single lock, which is no worse than before. When
more than one CPU are allocating files, even if they are always freed by
different CPUs, there will be more parallelism than the single-lock case.

Testing results:

On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
to remove the file, the number of times it is removed by the same CPU that
added it, and the number of times it is removed by the same node that added it.

Booting:    locks=  25049 cpu-hits=  23174 (92.5%) node-hits=  23945 (95.6%)
kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
dbench 64   locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

So a file is removed from the same CPU it was added by over 90% of the time.
It remains within the same node 95% of the time.

Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

                throughput
2.6.34-rc2      24.5
+patch          24.9

                us      sys     idle    IO wait (in %)
2.6.34-rc2      51.25   28.25   17.25   3.25
+patch          53.75   18.5    19      8.75

So significantly less CPU time spent in kernel code, higher idle time and
slightly higher throughput.

Single threaded performance difference was within the noise of microbenchmarks.
That is not to say penalty does not exist, the code is larger and more memory
accesses required so it will be slightly slower.

Cc: linux-kernel@vger.kernel.org
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-18 08:35:48 -04:00
..
9p v9fs: fixup for inode_setattr being removed 2010-08-11 00:08:00 -04:00
adfs check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
affs AFFS: wait for sb synchronization when needed 2010-08-09 16:48:51 -04:00
afs Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 2010-08-13 10:37:30 -07:00
autofs autofs/autofs4: Move compat_ioctl handling into fs 2010-08-09 00:13:34 +02:00
autofs4 autofs4: remove unneeded null check in try_to_fill_dentry() 2010-08-11 08:59:06 -07:00
befs fix typos concerning "initiali[zs]e" 2010-06-16 18:05:05 +02:00
bfs BFS: clean up the superblock usage 2010-08-09 16:48:53 -04:00
btrfs Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block 2010-08-10 15:22:42 -07:00
cachefiles Add a dummy printk function for the maintenance of unused printks 2010-08-12 09:51:35 -07:00
ceph ceph: generalize mon requests, add pool op support 2010-08-10 14:41:25 -07:00
cifs cifs: update README to include details about 'fsc' option 2010-08-11 17:11:28 +00:00
coda Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block 2010-08-10 15:22:42 -07:00
configfs fix setattr error handling in sysfs, configfs 2010-06-04 17:16:29 -04:00
cramfs cramfs: only unlock new inodes 2010-08-18 01:01:33 -04:00
debugfs Add x64 support to debugfs 2010-05-19 22:41:57 -04:00
devpts Simplify devpts_get_sb() failure exits 2010-05-21 18:31:12 -04:00
dlm fs/dlm: Drop unnecessary null test 2010-08-05 14:23:45 -05:00
ecryptfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6 2010-08-10 12:14:39 -07:00
efs
exofs Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd 2010-08-11 09:19:43 -07:00
exportfs
ext2 mbcache: Remove unused features 2010-08-09 16:48:45 -04:00
ext3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
ext4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
fat remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
freevxfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
fscache Add a dummy printk function for the maintenance of unused printks 2010-08-12 09:51:35 -07:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
gfs2 Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block 2010-08-10 15:22:42 -07:00
hfs convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
hfsplus convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
hostfs hostfs ->follow_link() braino 2010-08-18 06:21:10 -04:00
hpfs switch hpfs to ->evict_inode() 2010-08-09 16:48:17 -04:00
hppfs switch hppfs to ->evict_inode() 2010-08-09 16:48:16 -04:00
hugetlbfs new helper: end_writeback() 2010-08-09 16:47:49 -04:00
isofs isofs: Fix lseek() to position beyond 4 GB 2010-08-11 00:29:47 -04:00
jbd remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
jbd2 remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
jffs2 Merge git://git.infradead.org/mtd-2.6 2010-08-10 11:49:21 -07:00
jfs jfs: don't allow os2 xattr namespace overlap with others 2010-08-10 15:33:09 -07:00
lockd
logfs logfs: kill BKL 2010-08-14 00:24:24 +02:00
minix switch minix to ->evict_inode(), fix write_inode/delete_inode race 2010-08-09 16:47:53 -04:00
ncpfs Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing 2010-08-10 13:58:28 -07:00
nfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 2010-08-13 10:37:30 -07:00
nfs_common
nfsd Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify 2010-08-10 11:39:13 -07:00
nilfs2 kill BH_Ordered flag 2010-08-18 01:09:00 -04:00
nls
notify Revert "fsnotify: store struct file not struct path" 2010-08-12 14:23:04 -07:00
ntfs convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
ocfs2 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 2010-08-13 10:43:50 -07:00
omfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bcopeland/omfs 2010-08-10 11:47:36 -07:00
openpromfs
partitions [S390] partitions: fix build error in ibm partition detection code 2010-08-13 10:06:55 +02:00
proc mm: fix up some user-visible effects of the stack guard page 2010-08-15 11:35:52 -07:00
qnx4 get rid of cont_write_begin_newtrunc 2010-08-09 16:47:31 -04:00
quota Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
ramfs check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
reiserfs remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
romfs
smbfs switch smbfs to evict_inode() 2010-08-09 16:48:00 -04:00
squashfs Squashfs: fix checkpatch.pl warnings 2010-08-08 22:29:33 +00:00
sysfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
sysv fs/sysv/super.c: add support for non-PDP11 v7 filesystems 2010-08-11 08:59:23 -07:00
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
udf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
ufs remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
xfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
aio.c aio: fix wrong subsystem comments 2010-08-05 13:21:23 -07:00
anon_inodes.c Revert "anon_inode: set S_IFREG on the anon_inode" 2010-05-27 22:03:05 -04:00
attr.c check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
bad_inode.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
binfmt_aout.c
binfmt_elf_fdpic.c binfmt_elf_fdpic: Fix clear_user() error handling 2010-06-01 08:11:06 -07:00
binfmt_elf.c
binfmt_em86.c
binfmt_flat.c flat: tweak default stack alignment 2010-06-29 15:29:31 -07:00
binfmt_misc.c convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
binfmt_script.c
binfmt_som.c
bio-integrity.c
bio.c block: unify flags for struct bio and struct request 2010-08-07 18:20:39 +02:00
block_dev.c blkdev: cgroup whitelist permission fix 2010-08-11 08:59:18 -07:00
buffer.c remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
char_dev.c Fix init ordering of /dev/console vs callers of modprobe 2010-08-06 09:17:02 -07:00
compat_binfmt_elf.c
compat_ioctl.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
compat.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
dcache.c fs: remove extra lookup in __lookup_hash 2010-08-18 08:35:47 -04:00
dcookies.c
direct-io.c sort out blockdev_direct_IO variants 2010-08-09 16:47:29 -04:00
drop_caches.c simplify checks for I_CLEAR/I_FREEING 2010-08-09 16:47:44 -04:00
eventfd.c
eventpoll.c sched, wait: Use wrapper functions 2010-05-11 17:43:58 +02:00
exec.c fs: fs_struct rwlock to spinlock 2010-08-18 08:35:46 -04:00
fcntl.c vfs: O_* bit numbers uniqueness check 2010-08-11 08:59:02 -07:00
fifo.c
file_table.c fs: scale files_lock 2010-08-18 08:35:48 -04:00
file.c vfs: use kmalloc() to allocate fdmem if possible 2010-08-11 08:59:02 -07:00
filesystems.c
fs_struct.c fs: fs_struct rwlock to spinlock 2010-08-18 08:35:46 -04:00
fs-writeback.c mm: fix writeback_in_progress() 2010-08-12 08:43:30 -07:00
generic_acl.c vfs: update ctime when changing the file's permission by setfacl 2010-08-18 01:04:22 -04:00
inode.c Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify 2010-08-10 11:39:13 -07:00
internal.h tty: fix fu_list abuse 2010-08-18 08:35:47 -04:00
ioctl.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
ioprio.c
Kconfig fs/Kconfig: Fix typo Userpace -> Userspace 2010-07-20 17:30:22 +02:00
Kconfig.binfmt
libfs.c check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
locks.c
Makefile Take statfs variants to fs/statfs.c 2010-05-21 18:31:17 -04:00
mbcache.c mbcache: Limit the maximum number of cache entries 2010-08-18 06:24:41 -04:00
mpage.c
namei.c fs: remove extra lookup in __lookup_hash 2010-08-18 08:35:47 -04:00
namespace.c vfs: remove unused MNT_STRICTATIME 2010-08-11 00:29:47 -04:00
nfsctl.c
no-block.c
open.c fs: cleanup files_lock locking 2010-08-18 08:35:47 -04:00
pipe.c pipe: fix check in "set size" fcntl 2010-06-10 19:08:34 +02:00
pnode.c
pnode.h
posix_acl.c
read_write.c fsnotify: pass a file instead of an inode to open, read, and write 2010-07-28 09:58:32 -04:00
read_write.h
readdir.c vfs: fix warning: 'dirent' is used uninitialized in this function 2010-08-09 20:45:05 -07:00
select.c
seq_file.c
signalfd.c signalfd: fill in ssi_int for posix timers and message queues 2010-08-11 08:59:20 -07:00
splice.c splice: fix misuse of SPLICE_F_NONBLOCK 2010-08-07 18:52:56 +02:00
stack.c
stat.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
statfs.c add f_flags to struct statfs(64) 2010-08-09 16:48:44 -04:00
super.c fs: scale files_lock 2010-08-18 08:35:48 -04:00
sync.c get rid of file_fsync() 2010-08-09 16:47:43 -04:00
timerfd.c fs/timerfd.c: make use of wait_event_interruptible_locked_irq() 2010-05-20 13:21:42 -07:00
utimes.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
xattr_acl.c
xattr.c fs: xattr_handler table should be const 2010-05-21 18:31:18 -04:00