Commit Graph

14587 Commits

Author SHA1 Message Date
Al Viro
224c886643 Fix adfs GET_FRAG_ID() on big-endian
Missing conversion to host-endian before doing shifts

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:14 -04:00
Al Viro
ffdc9064f8 repair adfs ->write_inode(), switch to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
bea6b64c27 switch omfs to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
90de066443 switch udf to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
a932801543 switch ufs to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
05459ca81a repair sysv_write_inode(), switch sysv to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
0d7916d7e9 switch minix to simple_fsync()
* get minix_write_inode() to honour the second argument
* now we can use simple_fsync() for minixfs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
e1740a462e switch ext2 to simple_fsync()
kill ext2_sync_file() (along with ext2/fsync.c), get rid of
ext2_update_inode() - it's an alias of ext2_write_inode().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
b522412aea Sanitize ->fsync() for FAT
* mark directory data blocks as assoc. metadata
* add new inode to deal with FAT, mark FAT blocks as assoc. metadata of that
* now ->fsync() is trivial both for files and directories

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
964f536966 fs/qnx4: sanitize includes
fs-internal parts of qnx4_fs.h taken to fs/qnx4/qnx4.h, includes adjusted,
qnx4_fs.h doesn't need unifdef anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
79d2576758 Sanitize qnx4 fsync handling
* have directory operations use mark_buffer_dirty_inode(),
  so that sync_mapping_buffers() would get those.
* make qnx4_write_inode() honour its last argument.
* get rid of insane copies of very ancient "walk the indirect blocks"
  in qnx4/fsync - they never matched the actual fs layout and, fortunately,
  never'd been called.  Again, all this junk is not needed; ->fsync()
  should just do sync_mapping_buffers + sync_inode (and if we implement
  block allocation for qnx4, we'll need to use mark_buffer_dirty_inode()
  for extent blocks)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Al Viro
d5aacad548 New helper - simple_fsync()
writes associated buffers, then does sync_inode() to write
the inode itself (and to make it clean).  Depends on
->write_inode() honouring the second argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Alessio Igor Bogani
337eb00a2c Push BKL down into ->remount_fs()
[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Nick Piggin
4195f73d13 fs: block_dump missing dentry locking
I think the block_dump output in __mark_inode_dirty is missing dentry locking.
Surely the i_dentry list can change any time, so we may not even *get* a
dentry there. If we do get one by chance, then it would appear to be able to
go away or get renamed at any time...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Nick Piggin
545b9fd3d7 fs: remove incorrect I_NEW warnings
Some filesystems can call in to sync an inode that is still in the
I_NEW state (eg. ext family, when mounted with -osync). This is OK
because the filesystem has sole access to the new inode, so it can
modify i_state without races (because no other thread should be
modifying it, by definition of I_NEW). Ie. a false positive, so
remove the warnings.

The races are described here 7ef0d7377c,
which is also where the warnings were introduced.

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Christoph Hellwig
f95022161d xfs: remove ->write_super and stop maintaining ->s_dirt
the write_super method is used for

 (1) writing back the superblock periodically from pdflush
 (2) called just before ->sync_fs for data integerity syncs

We don't need (1) because we have our own peridoc writeout through xfssyncd,
and we don't need (2) because xfs_fs_sync_fs performs a proper synchronous
superblock writeout after all other data and metadata has been written out.

Also remove ->s_dirt tracking as it's only used to decide when too call
->write_super.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Jens Axboe
13205fb926 ntfs: remove old debug check for dirty data in ntfs_put_super()
This should not trigger anymore, so kill it.

Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Theodore Ts'o
9fd5746fd3 fs: Remove i_cindex from struct inode
The only user of the i_cindex element in the inode structure is used
is by the firewire drivers.  As part of an attempt to slim down the
inode structure to save memory --- since a typical Linux system will
have hundreds of thousands if not millions of inodes cached, a
reduction in the size inode has high leverage.

The firewire driver does not need i_cindex in any fast path, so it's
simple enough to calculate when it is needed, instead of wasting space
in the inode structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: krh@redhat.com
Cc: stefanr@s5r6.in-berlin.de
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Christoph Hellwig
ebc1ac1645 ->write_super lock_super pushdown
Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

 * bfs, nilfs2 - no other uses of s_lock and have internal locks in
	->write_super
 * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
 * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
 	->write_super
 * xfs - no other uses of s_lock and uses internal lock (buffer lock on
	superblock buffer) to serialize ->write_super.  Also xfs_fs_write_super
	is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Christoph Hellwig
01ba687577 jffs2: move jffs2_write_super to super.c
jffs2_write_super is only called from super.c and doesn't use any
functionality from fs.c.  So move it over to super.c and make it
static there.

[should go in through the vfs tree as it is a requirement for the
 next patch]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Al Viro
4aa98cf768 Push BKL down into do_remount_sb()
[folded fix from Jiri Slaby]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Al Viro
7f78d4cd4c Push BKL down beyond VFS-only parts of do_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Al Viro
6fac98dd21 Push BKL into do_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Al Viro
bbd6851a32 Push lock_super() into the ->remount_fs() of filesystems that care about it
Note that since we can't run into contention between remount_fs and write_super
(due to exclusion on s_umount), we have to care only about filesystems that
touch lock_super() on their own.  Out of those ext3, ext4, hpfs, sysv and ufs
do need it; fat doesn't since its ->remount_fs() only accesses assign-once
data (basically, it's "we have no atime on directories and only have atime on
files for vfat; force nodiratime and possibly noatime into *flags").

[folded a build fix from hch]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Christoph Hellwig
6cfd014842 push BKL down into ->put_super
Move BKL into ->put_super from the only caller.  A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment.  Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
a9e220f832 No need to do lock_super() for exclusion in generic_shutdown_super()
We can't run into contention on it.  All other callers of lock_super()
either hold s_umount (and we have it exclusive) or hold an active
reference to superblock in question, which prevents the call of
generic_shutdown_super() while the reference is held.  So we can
replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
(and corresponding change for unlock_super(), of course).

Since ext4 expects s_lock held for its put_super, take lock_super()
into it.  The rest of filesystems do not care at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
62c6943b4b Trim a bit of crap from fs.h
do_remount_sb() is fs/internal.h fodder, fsync_no_super() is long gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
443b94baaa Make sure that all callers of remount hold s_umount exclusive
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Christoph Hellwig
5af7926ff3 enforce ->sync_fs is only called for rw superblock
Make sure a superblock really is writeable by checking MS_RDONLY
under s_umount.  sync_filesystems needed some re-arragement for
that, but all but one sync_filesystem caller had the correct locking
already so that we could add that check there.  cachefiles grew
s_umount locking.

I've also added a WARN_ON to sync_filesystem to assert this for
future callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig
e500475338 cleanup sync_supers
Merge the write_super helper into sync_super and move the check for
->write_super earlier so that we can avoid grabbing a reference to
a superblock that doesn't have it.

While we're at it also add a little comment documenting sync_supers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Alexey Dobriyan
f3da392e9f dcache: extrace and use d_unlinked()
d_unlinked() will be used in middle-term to ban checkpointing when opened
but unlinked file is detected, and in long term, to detect such situation
and special case on it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig
8c85e12512 remove ->write_super call in generic_shutdown_super
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.

Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.

Exceptions:

 - affs already has identical copy & pasted code at the beginning of
   affs_put_super so no need to do it twice.
 - xfs does the right thing without it and I have changes pending for
   the xfs tree touching this are so I don't really need conflicts
   here..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig
517bfae283 qnx4: remove ->write_super
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
94cb993f2e ocfs2: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
b7d245de25 gfs2: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
ca41f7b918 ext3: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
59d697b702 btrfs: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Jan Kara
c3f8a40c1c quota: Introduce writeout_quota_sb() (version 4)
Introduce this function which just writes all the quota structures but
avoids all the syncing and cache pruning work to expose quota structures
to userspace. Use this function from __sync_filesystem when wait == 0.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Christoph Hellwig
850b201b08 quota: cleanup dquota sync functions (version 4)
Currently the VFS calls vfs_dq_sync to sync out disk quotas for a given
superblock.  This is a small wrapper around sync_dquots which for the
case of a non-NULL superblock is a small wrapper around quota_sync_sb.

Just make quota_sync_sb global (rename it to sync_quota_sb) and call it
directly.  Also call it directly for those cases in quota.c that have a
superblock and leave sync_dquots purely an iterator over sync_quota_sb and
remove it's superblock argument.

To make this nicer move the check for the lack of a quota_sync method
from the callers into sync_quota_sb.

[folded build fix from Alexander Beregalov <a.beregalov@gmail.com>]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Jan Kara
60b0680fa2 vfs: Rename fsync_super() to sync_filesystem() (version 4)
Rename the function so that it better describe what it really does. Also
remove the unnecessary include of buffer_head.h.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Jan Kara
c15c54f5f0 vfs: Move syncing code from super.c to sync.c (version 4)
Move sync_filesystems(), __fsync_super(), fsync_super() from
super.c to sync.c where it fits better.

[build fixes folded]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Jan Kara
5cee5815d1 vfs: Make sys_sync() use fsync_super() (version 4)
It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.

sync_blockdevs() introduced a couple of patches ago is gone now.

[build fixes folded]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Jan Kara
429479f031 vfs: Make __fsync_super() a static function (version 4)
__fsync_super() does the same thing as fsync_super(). So change the only
caller to use fsync_super() and make __fsync_super() static. This removes
unnecessarily duplicated call to sync_blockdev() and prepares ground
for the changes to __fsync_super() in the following patches.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Jan Kara
bfe881255c vfs: Call ->sync_fs() even if s_dirt is 0 (version 4)
sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then
->sync_fs() isn't called. This does not really make much sence since s_dirt is
generally used by a filesystem to mean that ->write_super() needs to be called.
But ->sync_fs() does different things. I even suspect that some filesystems
(btrfs?) sets s_dirt just to fool this logic.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Jan Kara
5a3e5cb8e0 vfs: Fix sys_sync() and fsync_super() reliability (version 4)
So far, do_sync() called:
  sync_inodes(0);
  sync_supers();
  sync_filesystems(0);
  sync_filesystems(1);
  sync_inodes(1);

This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
and others are hit by this) when racing e.g. with background writeback. A
similar problem hits also other filesystems (e.g. ext2) because of
write_supers() being called before the sync_inodes(1).

Change the ordering of calls in do_sync() - this requires a new function
sync_blockdevs() to preserve the property that block devices are always synced
after write_super() / sync_fs() call.

The same issue is fixed in __fsync_super() function used on umount /
remount read-only.

[AV: build fixes]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Christoph Hellwig
876a9f76ab remove s_async_list
Remove the unused s_async_list in the superblock, a leftover of the
broken async inode deletion code that leaked into mainline.  Having this
in the middle of the sync/unmount path is not helpful for the following
cleanups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
864d7c4c06 fs: move mark_files_ro into file_table.c
This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).

[AV: ... and it shouldn't be static after that move]

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
96029c4e09 fs: introduce mnt_clone_write
This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
 avg = 462.286
 std = 5.46106

After:
 avg = 453.12
 std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
d3ef3d7351 fs: mnt_want_write speedup
This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
A microbenchmark yes, but it exercises some important paths in the mm.

Before:
 avg = 501.9
 std = 14.7773

After:
 avg = 462.286
 std = 5.46106

(50 runs of each, stddev gives a reasonable confidence, but there is quite
a bit of variation there still)

It does this by removing the complex per-cpu locking and counter-cache and
replaces it with a percpu counter in struct vfsmount. This makes the code
much simpler, and avoids spinlocks (although the msync is still pretty
costly, unfortunately). It results in about 900 bytes smaller code too. It
does increase the size of a vfsmount, however.

It should also give a speedup on large systems if CPUs are frequently operating
on different mounts (because the existing scheme has to operate on an atomic in
the struct vfsmount when switching between mounts). But I'm most interested in
the single threaded path performance for the moment.

[AV: minor cleanup]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
Al Viro
3174c21b74 Move junk from proc_fs.h to fs/proc/internal.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
1c755af4df switch lookup_mnt()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
79ed022619 switch follow_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
9393bd07cf switch follow_down()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
589ff870ed Switch collect_mounts() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
bab77ebf51 switch follow_up() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
e64c390ca0 switch rqst_exp_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
91c9fa8f75 switch rqst_exp_get_by_name()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
5bf3bd2b5c switch exp_parent() to struct path
... and lose the always-NULL last argument (non-NULL case had been
split off a while ago).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
55430e2ece nfsd struct path use: exp_get_by_name()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
dd5cae6e97 Don't bother with check_mnt() in do_add_mount() on shrinkable ones
These guys are what we add as submounts; checks for "is that attached in
our namespace" are simply irrelevant for those and counterproductive for
use of private vfsmount trees a-la what NFS folks want.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
5b85711953 Make vfs_path_lookup() use starting point as root
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
2a73787110 Cache root in nameidata
New field: nd->root.  When pathname resolution wants to know the root,
check if nd->root.mnt is non-NULL; use nd->root if it is, otherwise
copy current->fs->root there.  After path_walk() is finished, we check
if we'd got a cached value in nd->root and drop it.  Before calling
path_walk() we should either set nd->root.mnt to NULL *or* copy (and
pin down) some path to nd->root.  In the latter case we won't be
looking at current->fs->root at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
9b4a9b14a7 Preparations to caching root in path_walk()
Split do_path_lookup(), opencode the call from do_filp_open()
do_filp_open() is the only caller of do_path_lookup() that
cares about root afterwards (it keeps resolving symlinks on
O_CREAT path after it'd done LOOKUP_PARENT walk).  So when
we start caching fs->root in path_walk(), it'll need a different
treatment.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:58 -04:00
Al Viro
4e44b6852e Get rid of path_lookup in autofs4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:58 -04:00
Jeff Mahoney
73422811d2 reiserfs: allow exposing privroot w/ xattrs enabled
This patch adds an -oexpose_privroot option to allow access to the privroot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:58 -04:00
Felix Blyakher
35fd035968 Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs 2009-06-11 16:56:49 -05:00
Linus Torvalds
a525890cb6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (23 commits)
  Btrfs: fix extent_buffer leak during tree log replay
  Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
  Btrfs: fix -o nodatasum printk spelling
  Btrfs: check duplicate backrefs for both data and metadata
  Btrfs: init worker struct fields before kthread-run
  Btrfs: pin buffers during write_dev_supers
  Btrfs: avoid races between super writeout and device list updates
  Fix btrfs when ACLs are configured out
  Btrfs: fdatasync should skip metadata writeout
  Btrfs: remove crc32c.h and use libcrc32c directly.
  Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
  Btrfs: autodetect SSD devices
  Btrfs: add mount -o ssd_spread to spread allocations out
  Btrfs: avoid allocation clusters that are too spread out
  Btrfs: Add mount -o nossd
  Btrfs: avoid IO stalls behind congested devices in a multi-device FS
  Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
  Btrfs: fix metadata dirty throttling limits
  Btrfs: reduce mount -o ssd CPU usage
  Btrfs: balance btree more often
  ...
2009-06-11 14:23:12 -07:00
Linus Torvalds
3bb66d7f8c Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fsnotify: allow groups to set freeing_mark to null
  inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
  dnotify: do not bother to lock entry->lock when reading mask
  dnotify: do not use ?true:false when assigning to a bool
  fsnotify: move events should indicate the event was on a child
  inotify: reimplement inotify using fsnotify
  fsnotify: handle filesystem unmounts with fsnotify marks
  fsnotify: fsnotify marks on inodes pin them in core
  fsnotify: allow groups to add private data to events
  fsnotify: add correlations between events
  fsnotify: include pathnames with entries when possible
  fsnotify: generic notification queue and waitq
  dnotify: reimplement dnotify using fsnotify
  fsnotify: parent event notification
  fsnotify: add marks to inodes so groups can interpret how to handle those inodes
  fsnotify: unified filesystem notification backend
2009-06-11 14:22:55 -07:00
Linus Torvalds
512626a04e Merge branch 'for-linus' of git://linux-arm.org/linux-2.6
* 'for-linus' of git://linux-arm.org/linux-2.6:
  kmemleak: Add the corresponding MAINTAINERS entry
  kmemleak: Simple testing module for kmemleak
  kmemleak: Enable the building of the memory leak detector
  kmemleak: Remove some of the kmemleak false positives
  kmemleak: Add modules support
  kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
  kmemleak: Add the vmalloc memory allocation/freeing hooks
  kmemleak: Add the slub memory allocation/freeing hooks
  kmemleak: Add the slob memory allocation/freeing hooks
  kmemleak: Add the slab memory allocation/freeing hooks
  kmemleak: Add documentation on the memory leak detector
  kmemleak: Add the base support

Manual conflict resolution (with the slab/earlyboot changes) in:
	drivers/char/vt.c
	init/main.c
	mm/slab.c
2009-06-11 14:15:57 -07:00
Linus Torvalds
8a1ca8cedd Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
  perf_counter: Turn off by default
  perf_counter: Add counter->id to the throttle event
  perf_counter: Better align code
  perf_counter: Rename L2 to LL cache
  perf_counter: Standardize event names
  perf_counter: Rename enums
  perf_counter tools: Clean up u64 usage
  perf_counter: Rename perf_counter_limit sysctl
  perf_counter: More paranoia settings
  perf_counter: powerpc: Implement generalized cache events for POWER processors
  perf_counters: powerpc: Add support for POWER7 processors
  perf_counter: Accurate period data
  perf_counter: Introduce struct for sample data
  perf_counter tools: Normalize data using per sample period data
  perf_counter: Annotate exit ctx recursion
  perf_counter tools: Propagate signals properly
  perf_counter tools: Small frequency related fixes
  perf_counter: More aggressive frequency adjustment
  perf_counter/x86: Fix the model number of Intel Core2 processors
  perf_counter, x86: Correct some event and umask values for Intel processors
  ...
2009-06-11 14:01:07 -07:00
Eric Paris
a092ee20fd fsnotify: allow groups to set freeing_mark to null
Most fsnotify listeners (all but inotify) do not care about marks being
freed.  Allow groups to set freeing_mark to null and do not call any
function if it is set that way.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:55 -04:00
Eric Paris
e42e27736d inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
inotify and dnotify will both indicate that they want any event which came
from a child inode.  The fix is to mask off FS_EVENT_ON_CHILD when deciding
if inotify or dnotify is interested in a given event.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:54 -04:00
Eric Paris
ce61856bd2 dnotify: do not bother to lock entry->lock when reading mask
entry->lock is needed to make sure entry->mask does not change while
manipulating it.  In dnotify_should_send_event() we don't care if we get an
old or a new mask value out of this entry so there is no point it taking
the lock.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:54 -04:00
Eric Paris
5ac697b793 dnotify: do not use ?true:false when assigning to a bool
dnotify_should send event assigned a bool using ?true:false when computing
a bit operation.  This is poitless and the bool type does this for us.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:54 -04:00
Eric Paris
63c882a054 inotify: reimplement inotify using fsnotify
Reimplement inotify_user using fsnotify.  This should be feature for feature
exactly the same as the original inotify_user.  This does not make any changes
to the in kernel inotify feature used by audit.  Those patches (and the eventual
removal of in kernel inotify) will come after the new inotify_user proves to be
working correctly.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
164bc61951 fsnotify: handle filesystem unmounts with fsnotify marks
When an fs is unmounted with an fsnotify mark entry attached to one of its
inodes we need to destroy that mark entry and we also (like inotify) send
an unmount event.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
1ef5f13c6c fsnotify: fsnotify marks on inodes pin them in core
This patch pins any inodes with an fsnotify mark in core.  The idea is that
as soon as the mark is removed from the inode->fsnotify_mark_entries list
the inode will be iput.  In reality is doesn't quite work exactly this way.
The igrab will happen when the mark is added to an inode, but the iput will
happen when the inode pointer is NULL'd inside the mark.

It's possible that 2 racing things will try to remove the mark from
different directions.  One may try to remove the mark because of an
explicit request and one might try to remove it because the inode was
deleted.  It's possible that the removal because of inode deletion will
remove the mark from the inode's list, but the removal by explicit request
will actually set entry->inode == NULL; and call the iput.  This is safe.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
e4aff11736 fsnotify: allow groups to add private data to events
inotify needs per group information attached to events.  This patch allows
groups to attach private information and implements a callback so that
information can be freed when an event is being destroyed.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
47882c6f51 fsnotify: add correlations between events
As part of the standard inotify events it includes a correlation cookie
between two dentry move operations.  This patch includes the same behaviour
in fsnotify events.  It is needed so that inotify userspace can be
implemented on top of fsnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
62ffe5dfba fsnotify: include pathnames with entries when possible
When inotify wants to send events to a directory about a child it includes
the name of the original file.  This patch collects that filename and makes
it available for notification.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
a2d8bc6cb4 fsnotify: generic notification queue and waitq
inotify needs to do asyc notification in which event information is stored
on a queue until the listener is ready to receive it.  This patch
implements a generic notification queue for inotify (and later fanotify) to
store events to be sent at a later time.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
3c5119c05d dnotify: reimplement dnotify using fsnotify
Reimplement dnotify using fsnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
c28f7e56e9 fsnotify: parent event notification
inotify and dnotify both use a similar parent notification mechanism.  We
add a generic parent notification mechanism to fsnotify for both of these
to use.  This new machanism also adds the dentry flag optimization which
exists for inotify to dnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
3be25f49b9 fsnotify: add marks to inodes so groups can interpret how to handle those inodes
This patch creates a way for fsnotify groups to attach marks to inodes.
These marks have little meaning to the generic fsnotify infrastructure
and thus their meaning should be interpreted by the group that attached
them to the inode's list.

dnotify and inotify  will make use of these markings to indicate which
inodes are of interest to their respective groups.  But this implementation
has the useful property that in the future other listeners could actually
use the marks for the exact opposite reason, aka to indicate which inodes
it had NO interest in.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
90586523eb fsnotify: unified filesystem notification backend
fsnotify is a backend for filesystem notification.  fsnotify does
not provide any userspace interface but does provide the basis
needed for other notification schemes such as dnotify.  fsnotify
can be extended to be the backend for inotify or the upcoming
fanotify.  fsnotify provides a mechanism for "groups" to register for
some set of filesystem events and to then deliver those events to
those groups for processing.

fsnotify has a number of benefits, the first being actually shrinking the size
of an inode.  Before fsnotify to support both dnotify and inotify an inode had

        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */
        struct list_head        inotify_watches; /* watches on this inode */
        struct mutex            inotify_mutex;  /* protects the watches list

But with fsnotify this same functionallity (and more) is done with just

        __u32                   i_fsnotify_mask; /* all events for this inode */
        struct hlist_head       i_fsnotify_mark_entries; /* marks on this inode */

That's right, inotify, dnotify, and fanotify all in 64 bits.  We used that
much space just in inotify_watches alone, before this patch set.

fsnotify object lifetime and locking is MUCH better than what we have today.
inotify locking is incredibly complex.  See 8f7b0ba1c8 as an example of
what's been busted since inception.  inotify needs to know internal semantics
of superblock destruction and unmounting to function.  The inode pinning and
vfs contortions are horrible.

no fsnotify implementers do allocation under locks.  This means things like
f04b30de3 which (due to an overabundance of caution) changes GFP_KERNEL to
GFP_NOFS can be reverted.  There are no longer any allocation rules when using
or implementing your own fsnotify listener.

fsnotify paves the way for fanotify.  In brief fanotify is a notification
mechanism that delivers the lisener both an 'event' and an open file descriptor
to the object in question.  This means that fanotify is pathname agnostic.
Some on lkml may not care for the original companies or users that pushed for
TALPA, but fanotify was designed with flexibility and input for other users in
mind.  The readahead group expressed interest in fanotify as it could be used
to profile disk access on boot without breaking the audit system.  The desktop
search groups have also expressed interest in fanotify as it solves a number
of the race conditions and problems present with managing inotify when more
than a limited number of specific files are of interest.  fanotify can provide
for a userspace access control system which makes it a clean interface for AV
vendors to hook without trying to do binary patching on the syscall table,
LSM, and everywhere else they do their things today.  With this patch series
fanotify can be implemented in less than 1200 lines of easy to review code.
Almost all of which is the socket based user interface.

This patch series builds fsnotify to the point that it can implement
dnotify and inotify_user.  Patches exist and will be sent soon after
acceptance to finish the in kernel inotify conversion (audit) and implement
fanotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:52 -04:00
Linus Torvalds
871fa90791 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
  jfs: Add missing mutex_unlock call to error path
  missing unlock in jfs_quota_write()
2009-06-11 11:27:09 -07:00
Linus Torvalds
c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Linus Torvalds
0a33f80a83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (25 commits)
  GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta
  GFS2: Fix cache coherency between truncate and O_DIRECT read
  GFS2: Fix locking issue mounting gfs2meta fs
  GFS2: Remove unused variable
  GFS2: smbd proccess hangs with flock() call.
  GFS2: Remove args subdir from gfs2 sysfs files
  GFS2: Remove lockstruct subdir from gfs2 sysfs files
  GFS2: Move gfs2_unlink_ok into ops_inode.c
  GFS2: Move gfs2_readlinki into ops_inode.c
  GFS2: Move gfs2_rmdiri into ops_inode.c
  GFS2: Merge mount.c and ops_super.c into super.c
  GFS2: Clean up some file names
  GFS2: Be more aggressive in reclaiming unlinked inodes
  GFS2: Add a rgrp bitmap full flag
  GFS2: Improve resource group error handling
  GFS2: Don't warn when delete inode fails on ro filesystem
  GFS2: Update docs
  GFS2: Umount recovery race fix
  GFS2: Remove a couple of unused sysfs entries
  GFS2: Add commit= mount option
  ...
2009-06-11 10:36:12 -07:00
Linus Torvalds
ddbb868493 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: remove never-used in6_addr option
  cifs: add addr= mount option alias for ip=
  [CIFS] Add mention of new mount parm (forceuid) to cifs readme
  cifs: make overriding of ownership conditional on new mount options
  cifs: fix IPv6 address length check
  cifs: clean up set_cifs_acl interfaces
  cifs: reorganize get_cifs_acl
  [CIFS] Update readme to indicate change to default mount (serverino)
  cifs: make serverino the default when mounting
  cifs: rename cifs_iget to cifs_root_iget
  cifs: make cnvrtDosUnixTm take a little-endian args and an offset
  cifs: have cifs_NTtimeToUnix take a little-endian arg
  cifs: tighten up default file_mode/dir_mode
  cifs: fix artificial limit on reading symlinks
2009-06-11 10:02:46 -07:00
Linus Torvalds
3296ca27f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
  nommu: Provide mmap_min_addr definition.
  TOMOYO: Add description of lists and structures.
  TOMOYO: Remove unused field.
  integrity: ima audit dentry_open failure
  TOMOYO: Remove unused parameter.
  security: use mmap_min_addr indepedently of security models
  TOMOYO: Simplify policy reader.
  TOMOYO: Remove redundant markers.
  SELinux: define audit permissions for audit tree netlink messages
  TOMOYO: Remove unused mutex.
  tomoyo: avoid get+put of task_struct
  smack: Remove redundant initialization.
  integrity: nfsd imbalance bug fix
  rootplug: Remove redundant initialization.
  smack: do not beyond ARRAY_SIZE of data
  integrity: move ima_counts_get
  integrity: path_check update
  IMA: Add __init notation to ima functions
  IMA: Minimal IMA policy and boot param for TCB IMA policy
  selinux: remove obsolete read buffer limit from sel_read_bool
  ...
2009-06-11 10:01:41 -07:00
Linus Torvalds
e893123c73 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
  ext4: Avoid corrupting the uninitialized bit in the extent during truncate
  ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
  ext4: fix dx_map_entry to support 256k directory blocks
  ext4: truncate the file properly if we fail to copy data from userspace
  ext4: Avoid leaking blocks after a block allocation failure
  ext4: Change all super.c messages to print the device
  ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
  ext4: super.c whitespace cleanup
  jbd2: Fix minor typos in comments in fs/jbd2/journal.c
  ext4: Clean up calls to ext4_get_group_desc()
  ext4: remove unused function __ext4_write_dirty_metadata
  ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
  ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
  ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
  ext4: down i_data_sem only for read when walking tree for fiemap
  ext4: Add a comprehensive block validity check to ext4_get_blocks()
  ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
  ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
  ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
  ext4: Add documentation to the ext4_*get_block* functions
  ...
2009-06-11 10:00:50 -07:00
Catalin Marinas
2e1483c995 kmemleak: Remove some of the kmemleak false positives
There are allocations for which the main pointer cannot be found but
they are not memory leaks. This patch fixes some of them. For more
information on false positives, see Documentation/kmemleak.txt.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-11 17:04:18 +01:00
Linus Torvalds
49c355617f Merge branch 'serial-from-alan'
* serial-from-alan: (79 commits)
  moxa: prevent opening unavailable ports
  imx: serial: use tty_encode_baud_rate to set true rate
  imx: serial: add IrDA support to serial driver
  imx: serial: use rational library function
  lib: isolate rational fractions helper function
  imx: serial: handle initialisation failure correctly
  imx: serial: be sure to stop xmit upon shutdown
  imx: serial: notify higher layers in case xmit IRQ was not called
  imx: serial: fix one bit field type
  imx: serial: fix whitespaces (no changes in functionality)
  tty: use prepare/finish_wait
  tty: remove sleep_on
  sierra: driver interface blacklisting
  sierra: driver urb handling improvements
  tty: resolve some sierra breakage
  timbuart: Fix the termios logic
  serial: Added Timberdale UART driver
  tty: Add URL for ttydev queue
  devpts: unregister the file system on error
  tty: Untangle termios and mm mutex dependencies
  ...
2009-06-11 08:57:47 -07:00
Ingo Molnar
940010c5a3 Merge branch 'linus' into perfcounters/core
Conflicts:
	arch/x86/kernel/irqinit.c
	arch/x86/kernel/irqinit_64.c
	arch/x86/kernel/traps.c
	arch/x86/mm/fault.c
	include/linux/sched.h
	kernel/exit.c
2009-06-11 17:55:42 +02:00
Alan Cox
93d5581e20 devpts: unregister the file system on error
Closes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=13429

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-11 08:51:06 -07:00
Chris Mason
b263c2c8bf Btrfs: fix extent_buffer leak during tree log replay
During tree log replay, we read in the tree log roots,
process them and then free them.  A recent change
takes an extra reference on the root node of the tree
when the root is read in, and stores that reference
in root->commit_root.

This reference was not being freed, leaving us with
one buffer pinned in ram for each subvol with
a tree log root after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:24:47 -04:00
Chris Mason
0b4dcea579 Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
This happens during subvol creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:13:35 -04:00
Chris Mason
067c28adc5 Btrfs: fix -o nodatasum printk spelling
It was printing nodatacsum, which was not the correct option name.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 09:30:13 -04:00
Yan Zheng
85d4198e40 Btrfs: check duplicate backrefs for both data and metadata
lookup_inline_extent_backref only checks for duplicate backref for data
extents. It assumes backrefs for tree block never conflict.

This patch makes lookup_inline_extent_backref check for duplicate backrefs
for both data and tree block, so that we can detect potential bug earlier.
This is a safety check, strictly speaking it is not required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 08:51:34 -04:00
Linus Torvalds
8623661180 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
  Revert "x86, bts: reenable ptrace branch trace support"
  tracing: do not translate event helper macros in print format
  ftrace/documentation: fix typo in function grapher name
  tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
  tracing: add protection around module events unload
  tracing: add trace_seq_vprint interface
  tracing: fix the block trace points print size
  tracing/events: convert block trace points to TRACE_EVENT()
  ring-buffer: fix ret in rb_add_time_stamp
  ring-buffer: pass in lockdep class key for reader_lock
  tracing: add annotation to what type of stack trace is recorded
  tracing: fix multiple use of __print_flags and __print_symbolic
  tracing/events: fix output format of user stack
  tracing/events: fix output format of kernel stack
  tracing/trace_stack: fix the number of entries in the header
  ring-buffer: discard timestamps that are at the start of the buffer
  ring-buffer: try to discard unneeded timestamps
  ring-buffer: fix bug in ring_buffer_discard_commit
  ftrace: do not profile functions when disabled
  tracing: make trace pipe recognize latency format flag
  ...
2009-06-10 19:53:40 -07:00
James Morris
73fbad283c Merge branch 'next' into for-linus 2009-06-11 11:03:14 +10:00
Shin Hong
fd0fb038d5 Btrfs: init worker struct fields before kthread-run
This patch fixes a bug which may result race condition
between btrfs_start_workers() and worker_loop().

btrfs_start_workers() executed in a parent thread writes
on workers->worker and worker_loop() in a child thread
reads workers->worker. However, there is no synchronization
enforcing the order of two operations.

This patch makes btrfs_start_workers() fill workers->worker
before it starts a child thread with worker_loop()

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 20:11:29 -04:00
Linus Torvalds
99e97b860e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: fix typo in sched-rt-group.txt file
  ftrace: fix typo about map of kernel priority in ftrace.txt file.
  sched: properly define the sched_group::cpumask and sched_domain::span fields
  sched, timers: cleanup avenrun users
  sched, timers: move calc_load() to scheduler
  sched: Don't export sched_mc_power_savings on multi-socket single core system
  sched: emit thread info flags with stack trace
  sched: rt: document the risk of small values in the bandwidth settings
  sched: Replace first_cpu() with cpumask_first() in ILB nomination code
  sched: remove extra call overhead for schedule()
  sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus())
  wait: don't use __wake_up_common()
  sched: Nominate a power-efficient ilb in select_nohz_balancer()
  sched: Nominate idle load balancer from a semi-idle package.
  sched: remove redundant hierarchy walk in check_preempt_wakeup
2009-06-10 15:32:59 -07:00
Michal Simek
0e0c62123b fs/bio.c: add missing __user annotation
As reported by sparse:

fs/bio.c:720:13: warning: incorrect type in assignment (different address spaces)
fs/bio.c:720:13:    expected char *iov_addr
fs/bio.c:720:13:    got void [noderef] <asn:1>*
fs/bio.c:724:36: warning: incorrect type in argument 2 (different address spaces)
fs/bio.c:724:36:    expected void const [noderef] <asn:1>*from
fs/bio.c:724:36:    got char *iov_addr

Signed-off-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-10 23:07:15 +02:00
Hisashi Hifumi
4eedeb75e7 Btrfs: pin buffers during write_dev_supers
write_dev_supers is called in sequence.  First is it called with wait == 0,
which starts IO on all of the super blocks for a given device.  Then it is
called with wait == 1 to make sure they all reach the disk.

It doesn't currently pin the buffers between the two calls, and it also
assumes the buffers won't go away between the two calls, leading to
an oops if the VM manages to free the buffers in the middle of the sync.

This fixes that assumption and updates the code to return an error if things
are not up to date when the wait == 1 run is done.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 16:49:25 -04:00
Chris Mason
e5e9a5206a Btrfs: avoid races between super writeout and device list updates
On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete.  There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened  with no transation writers running.

With the btrfs fsync log and other async transaction updates, this
has been racey for some time.  This adds a mutex to protect
the device list.  The existing volume mutex could not be reused due to
transaction lock ordering requirements.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 15:17:02 -04:00
Jeff Layton
61b6bc525a cifs: remove never-used in6_addr option
This option was never used to my knowledge. Remove it before someone
does...

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-10 18:34:35 +00:00
Aneesh Kumar K.V
a41f207169 ext4: Avoid corrupting the uninitialized bit in the extent during truncate
The unitialized bit was not properly getting preserved in in an extent
which is partially truncated because the it was geting set to the
value of the first extent to be removed or truncated as part of the
truncate operation, and if there are multiple extents are getting
removed or modified as part of the truncate operation, it is only the
last extent which will might be partially truncated, and its
uninitalized bit is not necessarily the same as the first extent to be
truncated.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-10 14:22:55 -04:00
Jeff Layton
58f7f68f22 cifs: add addr= mount option alias for ip=
When you look in /proc/mounts, the address of the server gets displayed
as "addr=". That's really a better option to use anyway since it's more
generic. What if we eventually want to support non-IP transports? It
also makes CIFS option consistent with the NFS option of the same name.

Begin the migration to that option name by adding an alias for ip=
called addr=.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-10 15:39:14 +00:00
Al Viro
7df336ec12 Fix btrfs when ACLs are configured out
... otherwise generic_permission() will allow *anything* for all
files you don't own and that have some group permissions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:36:43 -04:00
Hisashi Hifumi
524724ed1f Btrfs: fdatasync should skip metadata writeout
In btrfs, fdatasync and fsync are identical, but
fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.

--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run

Results:
-2.6.30-rc8
Test execution summary:
    total time:                          1980.6540s
    total number of events:              10001
    total time taken by event execution: 1192.9804
    per-request statistics:
         min:                            0.0000s
         avg:                            0.1193s
         max:                            15.3720s
         approx.  95 percentile:         0.7257s

Threads fairness:
    events (avg/stddev):           625.0625/151.32
    execution time (avg/stddev):   74.5613/9.46

-2.6.30-rc8-patched
Test execution summary:
    total time:                          1695.9118s
    total number of events:              10000
    total time taken by event execution: 871.3214
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0871s
         max:                            10.4644s
         approx.  95 percentile:         0.4787s

Threads fairness:
    events (avg/stddev):           625.0000/131.86
    execution time (avg/stddev):   54.4576/8.98

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
David Woodhouse
163e783e6a Btrfs: remove crc32c.h and use libcrc32c directly.
There's no need to preserve this abstraction; it used to let us use
hardware crc32c support directly, but libcrc32c is already doing that for us
through the crypto API -- so we're already using the Intel crc32c
acceleration where appropriate.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
Christoph Hellwig
6cbff00f46 Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
c289811cc0 Btrfs: autodetect SSD devices
During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS.  If they are all
non-rotating, SSD mode is enabled by default.

If the FS was mounted with -o nossd, the non-rotating
flag is ignored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
451d7585a8 Btrfs: add mount -o ssd_spread to spread allocations out
Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in.  It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
c604480171 Btrfs: avoid allocation clusters that are too spread out
In SSD mode for data, and all the time for metadata the allocator
will try to find a cluster of nearby blocks for allocations.  This
commit adds extra checks to make sure that each free block in the
cluster is close to the last one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:51 -04:00
Chris Mason
3b30c22f64 Btrfs: Add mount -o nossd
This allows you to turn off the ssd mode via remount.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:50 -04:00
Chris Mason
d644d8a1e3 Btrfs: avoid IO stalls behind congested devices in a multi-device FS
The btrfs IO submission threads try to service a bunch of devices with a small
number of threads.  They do a congestion check to try and avoid waiting
on requests for a busy device.

The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO.  The counter used to decide if we can switch to the next device
is somewhat overloaded.  It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.

This commit adds a new counter for the number of bios sent while
servicing a device.  It doesn't get reset or fiddled with.  On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason
d84275c938 Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests.  For each btrfs device, there are
two lists of bios.  One list is for WRITE_SYNC bios and the other
is for regular priority bios.

The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios.  This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.

WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking.  Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.

If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages.  If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason
585ad2c379 Btrfs: fix metadata dirty throttling limits
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.

A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks.  This commit switches it
to use the correct counter.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason
2c943de6ad Btrfs: reduce mount -o ssd CPU usage
The block allocator in SSD mode will try to find groups of free blocks
that are close together.  This commit makes it loop less on a given
group size before bumping it.

The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason
cfbb930846 Btrfs: balance btree more often
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done.  This commit
makes us more aggressively balance leaves and nodes as they become
less full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Chris Mason
b361242102 Btrfs: stop avoiding balancing at the end of the transaction.
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.

With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore.  So, the balance avoidance is no longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Yan Zheng
5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Yan Zheng
5c939df56c btrfs: Fix set/clear_extent_bit for 'end == (u64)-1'
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Christoph Hellwig
ef14f0c157 xfs: use generic Posix ACL code
This patch rips out the XFS ACL handling code and uses the generic
fs/posix_acl.c code instead.  The ondisk format is of course left
unchanged.

This also introduces the same ACL caching all other Linux filesystems do
by adding pointers to the acl and default acl in struct xfs_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-10 17:07:47 +02:00
Ryusuke Konishi
c3a7abf06c nilfs2: support contiguous lookup of blocks
Although get_block() callback function can return extent of contiguous
blocks with bh->b_size, nilfs_get_block() function did not support
this feature.

This adds contiguous lookup feature to the block mapping codes of
nilfs, and allows the nilfs_get_blocks() function to return the extent
information by applying the feature.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
fa032744ad nilfs2: add sync_page method to page caches of meta data
This applies block_sync_page() function to the sync_page method of
page caches for meta data files, gc page caches, and btree node
buffers.  This is a companion patch of ("nilfs2: enable sync_page
mothod") which applied the function for data pages.

This allows lock_page() for those meta data to unplug pending bio
requests.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
a53b4751ae nilfs2: use device's backing_dev_info for btree node caches
Previously, default_backing_dev_info was used for the mapping of btree
node caches.  This uses device dependent backing_dev_info to allow
detailed control of the device for the btree node pages.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
30c25be71f nilfs2: return EBUSY against delete request on snapshot
This helps userland programs like the rmcp command to distinguish
error codes returned against a checkpoint removal request.

Previously -EPERM was returned, and not discriminable from real
permission errors.  This also allows removal of the latest checkpoint
because the deletion leads to create a new checkpoint, and thus it's
harmless for the filesystem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
e85dc1d529 nilfs2: enable sync_page method
This adds a missing sync_page method which unplugs bio requests when
waiting for page locks. This will improve read performance of nilfs.

Here is a measurement result using dd command.

Without this patch:

 # mount -t nilfs2 /dev/sde1 /test
 # dd if=/test/aaa of=/dev/null bs=512k
 1024+0 records in
 1024+0 records out
 536870912 bytes (537 MB) copied, 6.00688 seconds, 89.4 MB/s

With this patch:

 # mount -t nilfs2 /dev/sde1 /test
 # dd if=/test/aaa of=/dev/null bs=512k
 1024+0 records in
 1024+0 records out
 536870912 bytes (537 MB) copied, 3.54998 seconds, 151 MB/s

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
30bda0b8ae nilfs2: set bio unplug flag for the last bio in segment
This sets BIO_RW_UNPLUG flag on the last bio of each segment during
write.  The last bio should be unplugged immediately because the
caller waits for the completion after the submission.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
003ff182fd nilfs2: allow future expansion of metadata read out via get info ioctl
Nilfs has some ioctl commands to read out metadata from meta data
files:

 - NILFS_IOCTL_GET_CPINFO for checkpoint file,
 - NILFS_IOCTL_GET_SUINFO for segment usage file, and
 - NILFS_IOCTL_GET_VINFO for Disk Address Transalation (DAT) file,
   respectively.

Every routine on these metadata files is implemented so that it allows
future expansion of on-disk format.  But, the above ioctl commands do
not support expansion even though nilfs_argv structure can handle
arbitrary size for data exchanged via ioctl.

This allows future expansion of the following structures which give
basic format of the "get information" ioctls:

 - struct nilfs_cpinfo
 - struct nilfs_suinfo
 - struct nilfs_vinfo

So, this introduces forward compatility of such ioctl commands.

In this patch, a sanity check in nilfs_ioctl_get_info() function is
changed to accept larger data structure [1], and metadata read
routines are rewritten so that they become compatible for larger
structures; the routines will just ignore the remaining fields which
the current version of nilfs doesn't know.

[1] The ioctl function already has another upper limit (PAGE_SIZE
    against a structure, which appears in nilfs_ioctl_wrap_copy
    function), and this will not cause security problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Hisashi Hifumi
258ef67e24 NILFS2: Pagecache usage optimization on NILFS2
Hi,

I introduced "is_partially_uptodate" aops for NILFS2.

A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.

I did a performance test using the sysbench.

1 --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0 --fil
e-rw-ratio=1 run

-2.6.30-rc5

Test execution summary:
    total time:                          151.2907s
    total number of events:              200000
    total time taken by event execution: 2409.8387
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0120s
         max:                            0.9306s
         approx.  95 percentile:         0.0439s

Threads fairness:
    events (avg/stddev):           12500.0000/238.52
    execution time (avg/stddev):   150.6149/0.01

-2.6.30-rc5-patched

Test execution summary:
    total time:                          140.8828s
    total number of events:              200000
    total time taken by event execution: 2240.8577
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0112s
         max:                            0.8750s
         approx.  95 percentile:         0.0418s

Threads fairness:
    events (avg/stddev):           12500.0000/218.43
    execution time (avg/stddev):   140.0536/0.01

arch: ia64
pagesize: 16k

Thanks.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
7cde31d7d6 nilfs2: remove nilfs_btree_operations from btree mapping
will remove indirect function calls using nilfs_btree_operations
table.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
355c6b6103 nilfs2: remove nilfs_direct_operations from direct mapping
will remove indirect function calls using nilfs_direct_operations
table.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
d4b961576d nilfs2: remove bmap pointer operations
Previously, the bmap codes of nilfs used three types of function
tables.  The abuse of indirect function calls decreased source
readability and suffered many indirect jumps which would confuse
branch prediction of processors.

This eliminates one type of the function tables,
nilfs_bmap_ptr_operations, which was used to dispatch low level
pointer operations of the nilfs bmap.

This adds a new integer variable "b_ptr_type" to nilfs_bmap struct,
and uses the value to select the pointer operations.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
3033342a0b nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct
This will cut off 16 bytes from the nilfs_bmap struct which is
embedded in the on-memory inode of nilfs.

The b_high field was never used, and the b_low field stores a constant
value which can be determined by whether the inode uses btree for
block mapping or not.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
e473c1f265 nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function
This indirect function is set to NULL only for gc cache inodes, but
the gc cache inodes never call this function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
f198dbb9cf nilfs2: move get block functions in bmap.c into btree codes
Two get block function for btree nodes, nilfs_bmap_get_block() and
nilfs_bmap_get_new_block(), are called only from the btree codes.
This relocation will increase opportunities of compiler optimization.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
9f098900ad nilfs2: remove nilfs_bmap_delete_block
nilfs_bmap_delete_block() is a wrapper function calling
nilfs_btnode_delete().  This removes it for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
087d01b425 nilfs2: remove nilfs_bmap_put_block
nilfs_bmap_put_block() is a wrapper function calling brelse().  This
eliminates the wrapper for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
654137dd46 nilfs2: remove header file for segment list operations
This will eliminate obsolete list operations of nilfs_segment_entry
structure which has been used to handle mutiple segment numbers.

The patch ("nilfs2: remove list of freeing segments") removed use of
the structure from the segment constructor code, and this patch
simplifies the remaining code by integrating it into recovery.c.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
071cb4b819 nilfs2: eliminate removal list of segments
This will clean up the removal list of segments and the related
functions from segment.c and ioctl.c, which have hurt code
readability.

This elimination is applied by using nilfs_sufile_updatev() previously
introduced in the patch ("nilfs2: add sufile function that can modify
multiple segment usages").

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
dda54f4b87 nilfs2: add sufile function that can modify multiple segment usages
This is a preparation for the later cleanup patch ("nilfs2: remove
list of freeing segments").

This adds nilfs_sufile_updatev() to sufile, which can modify multiple
segment usages at a time.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
d97a51a7e3 nilfs2: unify bmap operations starting use of indirect block address
This simplifies some low level functions of bmap.

Three bmap pointer operations, nilfs_bmap_start_v(),
nilfs_bmap_commit_v(), and nilfs_bmap_abort_v(), are unified into one
nilfs_bmap_start_v() function. And the related indirect function calls
are replaced with it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
6582207064 nilfs2: remove nilfs_dat_prepare_free function
This function is unused.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Boaz Harrosh
fc2fac5b5f [SCSI] libosd: Define an osd_dev wrapper to retrieve the request_queue
libosd users that need to work with bios, must sometime use
the request_queue associated with the osd_dev. Make a wrapper for
that, and convert all in-tree users.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10 09:00:13 -05:00
Boaz Harrosh
62f469b596 [SCSI] libosd: osd_req_{read,write} takes a length parameter
For supporting of chained-bios we can not inspect the first
bio only, as before. Caller shall pass the total length of the
request, ie. sum_bytes(bio-chain).

Also since the bio might be a chain we don't set it's direction
on behalf of it's callers. The bio direction should be properly
set prior to this call. So fix a couple of write users that now
need to set the bio direction properly

[In this patch I change both library code and user sites at
 exofs, to make it easy on integration. It should be submitted
 via James's scsi-misc tree.]

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10 08:59:52 -05:00
Boaz Harrosh
0e35afbc8b [SCSI] libosd: osd_req_{read,write}_kern new API
By popular demand, define usefull wrappers for osd_req_read/write
that recieve kernel pointers. All users had their own.

Also remove these from exofs

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10 08:57:07 -05:00
Steven Whitehouse
003dec8913 GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta
These don't need to be separate functions.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-10 10:31:45 +01:00
Steven Whitehouse
40bc9a27e0 GFS2: Fix cache coherency between truncate and O_DIRECT read
If a page was partially zeroed as the result of a truncate, then it was
not being correctly marked dirty. This resulted in the deleted data
reappearing if the file was read back via direct I/O.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-10 09:09:40 +01:00
Jan Kara
a61d90d75d jbd: fix race in buffer processing in commit code
In commit code, we scan buffers attached to a transaction.  During this
scan, we sometimes have to drop j_list_lock and then we recheck whether
the journal buffer head didn't get freed by journal_try_to_free_buffers().
 But checking for buffer_jbd(bh) isn't enough because a new journal head
could get attached to our buffer head.  So add a check whether the journal
head remained the same and whether it's still at the same transaction and
list.

This is a nasty bug and can cause problems like memory corruption (use after
free) or trigger various assertions in JBD code (observed).

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09 16:59:03 -07:00
Ian Kent
463aea1a1c autofs4: remove hashed check in validate_wait()
The recent ->lookup() deadlock correction required the directory inode
mutex to be dropped while waiting for expire completion.  We were
concerned about side effects from this change and one has been identified.

I saw several error messages.

They cause autofs to become quite confused and don't really point to the
actual problem.

Things like:

handle_packet_missing_direct:1376: can't find map entry for (43,1827932)

which is usually totally fatal (although in this case it wouldn't be
except that I treat is as such because it normally is).

do_mount_direct: direct trigger not valid or already mounted
/test/nested/g3c/s1/ss1

which is recoverable, however if this problem is at play it can cause
autofs to become quite confused as to the dependencies in the mount tree
because mount triggers end up mounted multiple times.  It's hard to
accurately check for this over mounting case and automount shouldn't need
to if the kernel module is doing its job.

There was one other message, similar in consequence of this last one but I
can't locate a log example just now.

When checking if a mount has already completed prior to adding a new mount
request to the wait queue we check if the dentry is hashed and, if so, if
it is a mount point.  But, if a mount successfully completed while we
slept on the wait queue mutex the dentry must exist for the mount to have
completed so the test is not really needed.

Mounts can also be done on top of a global root dentry, so for the above
case, where a mount request completes and the wait queue entry has already
been removed, the hashed test returning false can cause an incorrect
callback to the daemon.  Also, d_mountpoint() is not sufficient to check
if a mount has completed for the multi-mount case when we don't have a
real mount at the base of the tree.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09 16:59:03 -07:00
Hisashi Hifumi
e04cc15f52 ocfs2: fdatasync should skip unimportant metadata writeout
In ocfs2, fdatasync and fsync are identical.
I think fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.

#sysbench --num-threads=16 --max-requests=300000 --test=fileio
--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run

Results:
-2.6.30-rc8
Test execution summary:
    total time:                          107.1445s
    total number of events:              119559
    total time taken by event execution: 116.1050
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0010s
         max:                            0.1220s
         approx.  95 percentile:         0.0016s

Threads fairness:
    events (avg/stddev):           7472.4375/303.60
    execution time (avg/stddev):   7.2566/0.64

-2.6.30-rc8-patched
Test execution summary:
    total time:                          86.8529s
    total number of events:              300016
    total time taken by event execution: 24.3077
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0001s
         max:                            0.0336s
         approx.  95 percentile:         0.0001s

Threads fairness:
    events (avg/stddev):           18751.0000/718.75
    execution time (avg/stddev):   1.5192/0.05

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-09 10:45:47 -07:00
Li Zefan
55782138e4 tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:34:23 -04:00
Theodore Ts'o
0eab928221 ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
If a non-existent file is opened via O_WRONLY|O_CREAT|O_TRUNC, there's
no need to treat this as a true file truncation, so we shouldn't
activate the replace-via-truncate hueristic.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 09:54:40 -04:00
Tejun Heo
151060ac13 CUSE: implement CUSE - Character device in Userspace
CUSE enables implementing character devices in userspace.  With recent
additions of ioctl and poll support, FUSE already has most of what's
necessary to implement character devices.  All CUSE has to do is
bonding all those components - FUSE, chardev and the driver model -
nicely.

When client opens /dev/cuse, kernel starts conversation with
CUSE_INIT.  The client tells CUSE which device it wants to create.  As
the previous patch made fuse_file usable without associated
fuse_inode, CUSE doesn't create super block or inodes.  It attaches
fuse_file to cdev file->private_data during open and set ff->fi to
NULL.  The rest of the operation is almost identical to FUSE direct IO
case.

Each CUSE device has a corresponding directory /sys/class/cuse/DEVNAME
(which is symlink to /sys/devices/virtual/class/DEVNAME if
SYSFS_DEPRECATED is turned off) which hosts "waiting" and "abort"
among other things.  Those two files have the same meaning as the FUSE
control files.

The only notable lacking feature compared to in-kernel implementation
is mmap support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-06-09 11:24:11 +02:00
James Morris
0b4ec6e4e0 Merge branch 'master' into next 2009-06-09 09:27:53 +10:00
Toshiyuki Okajima
9aee228607 ext4: fix dx_map_entry to support 256k directory blocks
The dx_map_entry structure doesn't support over 64KB block size by
current usage of its member("offs"). Because "offs" treats an offset
of copies of the ext4_dir_entry_2 structure as is. This member size is
16 bits. But real offset for over 64KB(256KB) block size needs 18
bits. However, real offset keeps 4 byte boundary, so lower 2 bits is
not used.

Therefore, we do the following to fix this limitation:
For "store": 
	we divide the real offset by 4 and then store this result to "offs" 
	member.
For "use":
	we multiply "offs" member by 4 and then use this result 
	as real offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-08 12:41:35 -04:00
Christoph Hellwig
8b5403a6d7 xfs: remove SYNC_BDFLUSH
SYNC_BDFLUSH is a leftover from IRIX and rather misnamed for todays
code.  Make xfs_sync_fsdata and xfs_dq_sync use the SYNC_TRYLOCK flag
for not blocking on logs just as the inode sync code already does.

For xfs_sync_fsdata it's a trivial 1:1 replacement, but for xfs_qm_sync
I use the opportunity to decouple the non-blocking lock case from the
different flushing modes, similar to the inode sync code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:37:16 +02:00
Christoph Hellwig
b0710ccc6d xfs: remove SYNC_IOWAIT
We want to wait for all I/O to finish when we do data integrity syncs.  So
there is no reason to keep SYNC_WAIT separate from SYNC_IOWAIT.  This
causes a little change in behaviour for the ENOSPC flushing code which now
does a second submission and wait of buffered I/O, but that should finish
ASAP as we already did an asynchronous writeout earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:37:11 +02:00
Christoph Hellwig
075fe10286 xfs: split xfs_sync_inodes
xfs_sync_inodes is used to write back either file data or inode metadata.
In general we always do these separately, except for one fishy case in
xfs_fs_put_super that does both.  So separate xfs_sync_inodes into
separate xfs_sync_data and xfs_sync_attr functions.  In xfs_fs_put_super
we first call the data sync and then the attr sync as that was the previous
order.  The moved log force in that path doesn't make a difference because
we will force the log again as part of the real unmount process.

The filesystem readonly checks are not performed by the new function but
instead moved into the callers, given that most callers alredy have it
further up in the stack.  Also add debug checks that we do not pass in
incorrect flags in the new xfs_sync_data and xfs_sync_attr function and
fix the one place that did pass in a wrong flag.

Also remove a comment mentioning xfs_sync_inodes that has been incorrect
for a while because we always take either the iolock or ilock in the
sync path these days.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:48 +02:00
Christoph Hellwig
fe588ed328 xfs: use generic inode iterator in xfs_qm_dqrele_all_inodes
Use xfs_inode_ag_iterator instead of opencoding the inode walk in the
quota code.  Mark xfs_inode_ag_iterator and xfs_sync_inode_valid non-static
to allow using them from the quota code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:27 +02:00
Dave Chinner
75f3cb1393 xfs: introduce a per-ag inode iterator
Given that we walk across the per-ag inode lists so often, it makes sense to
introduce an iterator for this.

Convert the sync and reclaim code to use this new iterator, quota code will
follow in the next patch.

Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode
already under reclaim.  This simplifies the AG iterator and doesn't
matter for the only other caller.

[hch: merged the lookup and execute callbacks back into one to get the
 pag_ici_lock locking correct and simplify the code flow]

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:14 +02:00
Dave Chinner
abc1064742 xfs: remove unused parameter from xfs_reclaim_inodes
The noblock parameter of xfs_reclaim_inodes is only ever set to zero. Remove
it and all the conditional code that is never executed.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:12 +02:00
Dave Chinner
1da8eecab5 xfs: factor out inode validation for sync
Separate the validation of inodes found by the radix
tree walk from the radix tree lookup.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:07 +02:00
Christoph Hellwig
845b6d0cbb xfs: split inode flushing from xfs_sync_inodes_ag
In many cases we only want to sync inode metadata. Split out the inode
flushing into a separate helper to prepare factoring the inode sync code.

Based on a patch from Dave Chinner, but redone to keep the current behaviour
exactly and leave changes to the flushing logic to another patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:05 +02:00
Dave Chinner
5a34d5cd09 xfs: split inode data writeback from xfs_sync_inodes_ag
In many cases we only want to sync inode data. Start spliting the inode sync
into data sync and inode sync by factoring out the inode data flush.

[hch: minor cleanups]

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:03 +02:00
Christoph Hellwig
7d095257e3 xfs: kill xfs_qmops
Kill the quota ops function vector and replace it with direct calls or
stubs in the CONFIG_XFS_QUOTA=n case.

Make sure we check XFS_IS_QUOTA_RUNNING in the right spots.  We can remove
the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
otherwise.

This brings us back closer to the way this code worked in IRIX and earlier
Linux versions, but we keep a lot of the more useful factoring of common
code.

Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
patch.

Reduces the size of the source code by about 250 lines and the size of
XFS module by about 1.5 kilobytes with quotas enabled:

   text	   data	    bss	    dec	    hex	filename
 615957	   2960	   3848	 622765	  980ad	fs/xfs/xfs.o
 617231	   3152	   3848	 624231	  98667	fs/xfs/xfs.o.old

Fallout:

 - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
   the inode locked and xfs_qm_dqattach which does the locking around it,
   thus removing XFS_QMOPT_ILOCKED.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:33:32 +02:00
Christoph Hellwig
0c5e1ce89f xfs: validate quota log items during log recovery
Arkadiusz has seen really strange crashes in xfs_qm_dqcheck that
I can only explain by a log item being too smal to actually fit the
xfs_dqblk_t we're dereferencing all over xfs_qm_dqcheck.  So add
graceful checks for NULL or too small quota items to the log recovery
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:33:21 +02:00
Christoph Hellwig
e1696834e8 xfs: update max log size
Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the
maximum log size supported by mkfs.  Merged back the changes to xfs_fs.h
so the growfs enforced the same limit and the headers are in sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:32:59 +02:00
Christoph Hellwig
21bea49594 fat: split fat_generic_ioctl
Split up fat_generic_ioctl and add separate functions for the two
implemented ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-08 21:39:50 +09:00
David Woodhouse
e635a01ea0 Merge branch 'next-mtd' of git://aeryn.fluff.org.uk/bjdooks/linux 2009-06-08 12:21:27 +01:00
Artem Bityutskiy
f2c5dbd7b7 UBIFS: start using hrtimers
UBIFS uses timers for write-buffer write-back. It is not
crucial for us to write-back exactly on time. We are fine
to write-back a little earlier or later. And this means
we may optimize UBIFS timer so that it could be groped
with a close timer event, so that the CPU would not be
waken up just to do the write back. This is optimization
to lessen power consumption, which is important in
embedded devices UBIFS is used for.

hrtimers have a nice feature: they are effectively range
timers, and we may defind the soft and hard limits for
it. Standard timers do not have these feature. They may
only be made deferrable, but this means there is effectively
no hard limit. So, we will better use hrtimers.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08 11:14:58 +03:00
Artem Bityutskiy
3f36406f26 UBIFS: do not forget to register BDI device
Reviewed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08 11:14:21 +03:00
Bartlomiej Zolnierkiewicz
db429e9ec0 partitions: add ->set_capacity block device method
* Add ->set_capacity block device method and use it in rescan_partitions()
  to attempt enabling native capacity of the device upon detecting the
  partition which exceeds device capacity.

* Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
  native capacity during partition scan.

Together with the consecutive patch implementing ->set_capacity method in
ide-gd device driver this allows automatic disabling of Host Protected Area
(HPA) if any partitions overlapping HPA are detected.

Cc: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:52 +02:00
Bartlomiej Zolnierkiewicz
02c33b123e partitions: warn about the partition exceeding device capacity
The current warning message says only about the kernel's action taken
without mentioning the underlying reason behind it.

Noticed-by: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:51 +02:00
Hugh Dickins
f07502dae2 integrity: fix IMA inode leak
CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
until the system runs out of memory.  Nowhere is calling ima_inode_free()
a.k.a. ima_iint_delete().  Fix that by calling it from destroy_inode().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-06 14:33:41 -07:00
Steve French
f0472d0ec8 [CIFS] Add mention of new mount parm (forceuid) to cifs readme
Also update fs/cifs/CHANGES

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06 21:09:39 +00:00
Jeff Layton
4ae1507f6d cifs: make overriding of ownership conditional on new mount options
We have a bit of a problem with the uid= option. The basic issue is that
it means too many things and has too many side-effects.

It's possible to allow an unprivileged user to mount a filesystem if the
user owns the mountpoint, /bin/mount is setuid root, and the mount is
set up in /etc/fstab with the "user" option.

When doing this though, /bin/mount automatically adds the "uid=" and
"gid=" options to the share. This is fortunate since the correct uid=
option is needed in order to tell the upcall what user's credcache to
use when generating the SPNEGO blob.

On a mount without unix extensions this is fine -- you generally will
want the files to be owned by the "owner" of the mount. The problem
comes in on a mount with unix extensions. With those enabled, the
uid/gid options cause the ownership of files to be overriden even though
the server is sending along the ownership info.

This means that it's not possible to have a mount by an unprivileged
user that shows the server's file ownership info. The result is also
inode permissions that have no reflection at all on the server. You
simply cannot separate ownership from the mode in this fashion.

This behavior also makes MultiuserMount option less usable. Once you
pass in the uid= option for a mount, then you can't use unix ownership
info and allow someone to share the mount.

While I'm not thrilled with it, the only solution I can see is to stop
making uid=/gid= force the overriding of ownership on mounts, and to add
new mount options that turn this behavior on.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06 21:03:27 +00:00
Ingo Molnar
75b5032212 Merge branch 'linus' into perfcounters/core
Merge reason: Pick up the latest fixes before the -v8 perfcounters
	      release.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-06 20:21:28 +02:00
Al Viro
72a43d63cb ext3/4 with synchronous writes gets wedged by Postfix
OK, that's probably the easiest way to do that, as much as I don't like it...
Since iget() et.al. will not accept I_FREEING (will wait to go away
and restart), and since we'd better have serialization between new/free
on fs data structures anyway, we can afford simply skipping I_FREEING
et.al. in insert_inode_locked().

We do that from new_inode, so it won't race with free_inode in any interesting
ways and it won't race with iget (of any origin; nfsd or in case of fs
corruption a lookup) since both still will wait for I_LOCK.

Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Tested-by: David Watson <dbwatson@ukfsn.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06 06:17:26 -04:00
Theodore Ts'o
460bcf57b1 Fix nobh_truncate_page() to not pass stack garbage to get_block()
The nobh_truncate_page() function is used by ext2, exofs, and jfs.  Of
these three, only ext2 and jfs's get_block() function pays attention
to bh->b_size --- which is normally always the filesystem blocksize
except when the get_block() function is called by either
mpage_readpage(), mpage_readpages(), or the direct I/O routines in
fs/direct_io.c.

Unfortunately, nobh_truncate_page() does not initialize map_bh before
calling the filesystem-supplied get_block() function.  So ext2 and jfs
will try to calculate the number of blocks to map by taking stack
garbage and shifting it left by inode->i_blkbits.  This should be
*mostly* harmless (except the filesystem will do some unnneeded work)
unless the stack garbage is less than filesystem's blocksize, in which
case maxblocks will be zero, and the attempt to find out whether or
not the filesystem has a hole at a given logical block will fail, and
the page cache entry might not get zero'ed out.

Also if the stack garbage in in map_bh->state happens to have the
BH_Mapped bit set, there could be an attempt to call readpage() on a
non-existent page, which could cause nobh_truncate_page() to return an
error when it should not.

Fix this by initializing map_bh->state and map_bh->size.

Fortunately, it's probably fairly unlikely that ext2 and jfs users
mount with nobh these days.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06 06:17:25 -04:00
Linus Torvalds
064e38aade Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Fix oops and use after free during space balancing
  Btrfs: set device->total_disk_bytes when adding new device
2009-06-05 11:54:28 -07:00
Steven Whitehouse
f6d03139d7 GFS2: Fix locking issue mounting gfs2meta fs
This patch uses sget() to get a reference to the
existing gfs2 sb when mouting the gfs2meta filesystem
(in fact thats just another mount of the gfs2
filesystem with a different root and this interface
is for backward compatibility).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Tested-by: Benjamin Marzinski <bmarzins@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
2009-06-05 08:35:15 +01:00
Aneesh Kumar K.V
f8514083cd ext4: truncate the file properly if we fail to copy data from userspace
In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size.  We should truncate the file in the above
case so that we don't have blocks allocated outside inode->i_size.  Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the
truncate.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 00:56:49 -04:00
Aneesh Kumar K.V
1938a150c2 ext4: Avoid leaking blocks after a block allocation failure
We should add inode to the orphan list in the same transaction
as block allocation.  This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 01:00:26 -04:00
Eric Sandeen
b31e15527a ext4: Change all super.c messages to print the device
This patch changes ext4 super.c to include the device name with all 
warning/error messages, by using a new utility function ext4_msg. 
It's a rather large patch, but very mechanic. I left debug printks
alone.

This is a straightforward port of a patch which Andi Kleen did for
ext3.

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04 17:36:36 -04:00
Jan Kara
03f5d8bcf0 ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This
seems to be a relict from some old days and setting disksize in this
function does not make much sense.  Currently it was set only by
ext4_getblk().  Since the parameter has some effect only if create ==
1, it is easy to check by grepping through the sources that the three
callers which end up calling ext4_getblk() with create == 1
(ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set
disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 00:17:05 -04:00
Jens Axboe
172124e220 Revert "block: implement blkdev_readpages"
This reverts commit db2dbb12dc.

It apparently causes problems with partition table read-ahead
on archs with large page sizes. Until that problem is diagnosed
further, just drop the readpages support on block devices.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-04 22:34:44 +02:00
Chris Mason
44fb551163 Btrfs: Fix oops and use after free during space balancing
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 15:41:27 -04:00
Yan Zheng
2cc3c559fb Btrfs: set device->total_disk_bytes when adding new device
It was not being properly initialized, and so the size saved to
disk was not correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 09:23:57 -04:00
Tao Ma
06c59bb896 ocfs2: Remove redundant gotos in ocfs2_mount_volume()
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:20:15 -07:00
Joel Becker
73be192b17 ocfs2: Add statistics for the checksum and ecc operations.
It would be nice to know how often we get checksum failures.  Even
better, how many of them we can fix with the single bit ecc.  So, we add
a statistics structure.  The structure can be installed into debugfs
wherever the user wants.

For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions.  The stats are only
registered with debugfs when the filesystem supports metadata ecc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:15:36 -07:00
Srinivas Eeda
15633a220f ocfs2 patch to track delayed orphan scan timer statistics
Patch to track delayed orphan scan timer statistics.

Modifies ocfs2_osb_dump to print the following:
  Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Srinivas Eeda
83273932fb ocfs2: timer to queue scan of all orphan slots
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock.  When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert.  The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.

This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Jan Kara
edd45c0849 ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories
We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin().
So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir()
to also use this lock ordering.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:30 -07:00
Jan Kara
80d73f15d1 ocfs2: Fix possible deadlock in quota recovery
In ocfs2_finish_quota_recovery() we acquired global quota file lock and started
recovering local quota file. During this process we need to get quota
structures, which calls ocfs2_dquot_acquire() which gets global quota file lock
again. This second lock can block in case some other node has requested the
quota file lock in the mean time. Fix the problem by moving quota file locking
down into the function where it is really needed.  Then dqget() or dqput()
won't be called with the lock held.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:30 -07:00
Jan Kara
65bac575e3 ocfs2: Fix possible deadlock with quotas in ocfs2_setattr()
We called vfs_dq_transfer() with global quota file lock held. This can lead
to deadlocks as if vfs_dq_transfer() has to allocate new quota structure,
it calls ocfs2_dquot_acquire() which tries to get quota file lock again and
this can block if another node requested the lock in the mean time.

Since we have to call vfs_dq_transfer() with transaction already started
and quota file lock ranks above the transaction start, we cannot just rely
on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock
if they need it. We fix the problem by acquiring pointers to all quota
structures needed by vfs_dq_transfer() already before calling the function.
By this we are sure that all quota structures are properly allocated and
they can be freed only after we drop references to them. Thus we don't need
quota file lock anywhere inside vfs_dq_transfer().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:29 -07:00
Jan Kara
b4c30de39a ocfs2: Fix lock inversion in ocfs2_local_read_info()
This function is called with dqio_mutex held but it has to acquire lock
from global quota file which ranks above this lock. This is not deadlockable
lock inversion since this code path is take only during mount when noone
else can race with us but let's clean this up to silence lockdep.

We just drop the dqio_mutex in the beginning of the function and reacquire
it in the end since we don't need it - noone can race with us at this moment.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:29 -07:00
Jan Kara
4e8a301929 ocfs2: Fix possible deadlock in ocfs2_global_read_dquot()
It is not possible to get a read lock and then try to get the same write lock
in one thread as that can block on downconvert being requested by other node
leading to deadlock. So first drop the quota lock for reading and only after
that get it for writing.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:28 -07:00
Andreas Dilger
0b8e58a140 ext4: super.c whitespace cleanup
Cleanup of whitespace and formatting.  Initially driven by confusing indents
for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd
cleanup some other 80-column wrapping and other indenting problems at the
same time.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-03 17:59:28 -04:00
Alberto Bertogli
bfcd3555af jbd2: Fix minor typos in comments in fs/jbd2/journal.c
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2009-06-09 00:06:20 -04:00
Denis Karpov
85c7859190 FAT: add 'errors' mount option
On severe errors FAT remounts itself in read-only mode. Allow to
specify FAT fs desired behavior through 'errors' mount option:
panic, continue or remount read-only.

`mount -t [fat|vfat] -o errors=[panic,remount-ro,continue] \
	<bdev> <mount point>`

This is analog to ext2 fs 'errors' mount option.

Signed-off-by: Denis Karpov <ext-denis.2.karpov@nokia.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-04 02:34:51 +09:00
Steven Whitehouse
e09f9446b9 GFS2: Remove unused variable
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-03 10:07:44 +01:00
Linus Torvalds
4157fd85fc Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: prevent deadlock in xfs_qm_shake()
  xfs: fix overflow in xfs_growfs_data_private
  xfs: fix double unlock in xfs_swap_extents()
2009-06-02 09:47:21 -07:00
Jeff Layton
50b64e3b77 cifs: fix IPv6 address length check
For IPv6 the userspace mount helper sends an address in the "ip="
option.  This check fails if the length is > 35 characters. I have no
idea where the magic 35 character limit came from, but it's clearly not
enough for IPv6. Fix it by making it use the INET6_ADDRSTRLEN #define.

While we're at it, use the same #define for the address length in SPNEGO
upcalls.

Reported-by: Charles R. Anderson <cra@wpi.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-02 15:45:40 +00:00
Artem Bityutskiy
8379ea31e9 UBIFS: allow sync option in rootflags
When passing UBIFS parameters via kernel command line, the
sync option will be passed to UBIFS as a string, not as an
MS_SYNCHRONOUS flag. Teach UBIFS interpreting this flag.

Reported-by: Aurélien GÉRÔME <ag@debian.org>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-02 11:08:07 +03:00
Abhijith Das
a12af1ebe6 GFS2: smbd proccess hangs with flock() call.
GFS2 currently does not support mandatory flocks. An flock() call with
LOCK_MAND triggers unexpected behavior because gfs2 is not checking for
this lock type. This patch corrects that.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-02 08:01:12 +01:00
Felix Blyakher
1b17d76646 xfs: prevent deadlock in xfs_qm_shake()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:45 -05:00
Eric Sandeen
e6da7c9fed xfs: fix overflow in xfs_growfs_data_private
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number.  If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:

 # xfs_io -f -c "truncate 19998630180864" fsfile
 # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
 # mount -o loop fsfile /mnt
 # xfs_growfs /mnt

meta-data=/dev/loop0             isize=256    agcount=52,
agsize=76288719 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3905982455, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:38 -05:00
Felix Blyakher
1f23920dbf xfs: fix double unlock in xfs_swap_extents()
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:29 -05:00
Yu Zhiguo
0a93a47f04 NFSv4: kill off complicated macro 'PROC'
J. Bruce Fields wrote:
...
> (This is extremely confusing code to track down: note that
> proc->pc_decode is set to nfs4svc_decode_compoundargs() by the PROC()
> macro at the end of fs/nfsd/nfs4proc.c.  Which means, for example, that
> grepping for nfs4svc_decode_compoundargs() gets you nowhere.  Patches to
> kill off that macro would be welcomed....)

the macro 'PROC' is complicated and obscure, it had better
be killed off in order to make the code more clear.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-01 18:09:20 -04:00
Yu Zhiguo
3c8e03166a NFSv4: do exact check about attribute specified
Server should return NFS4ERR_ATTRNOTSUPP if an attribute specified is
not supported in current environment.
Operations CREATE, NVERIFY, OPEN, SETATTR and VERIFY should do this check.

This bug is found when do newpynfs tests. The names of the tests that failed
are following:
  CR12 NVF7a NVF7b NVF7c NVF7d NVF7f NVF7r NVF7s
  OPEN15 VF7a VF7b VF7c VF7d VF7f VF7r VF7s

Add function do_check_fattr() to do exact check:
1, Check attribute specified is supported by the NFSv4 server or not.
2, Check FATTR4_WORD0_ACL & FATTR4_WORD0_FS_LOCATIONS are supported
   in current environment or not.
3, Check attribute specified is writable or not.

step 1 and 3 are done in function nfsd4_decode_fattr() but removed
to this function now.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-01 18:01:54 -04:00
Felix Blyakher
4156e735d3 xfs: prevent deadlock in xfs_qm_shake()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 13:13:24 -05:00
Ingo Molnar
23db9f430b Merge branch 'linus' into perfcounters/core
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
              based - to pick up the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 10:01:39 +02:00
Linus Torvalds
b4566ac524 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function
2009-05-30 08:04:15 -07:00
Ryusuke Konishi
62013ab5d5 nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function
The nilfs_cpfile_delete_checkpoints() wrongly skips brelse() for the
header block of checkpoint file in case of errors.  This fixes the
leak bug.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-30 22:07:50 +09:00
Linus Torvalds
3218911f83 Merge git://git.infradead.org/~dwmw2/mtd-2.6.30
* git://git.infradead.org/~dwmw2/mtd-2.6.30:
  jffs2: Fix corruption when flash erase/write failure
  mtd: MXC NAND driver fixes (v5)
2009-05-29 08:52:13 -07:00
Linus Torvalds
deeb103412 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  Driver Core: do not oops when driver_unregister() is called for unregistered drivers
  sysfs: file.c: use create_singlethread_workqueue()
2009-05-29 08:49:52 -07:00
Linus Torvalds
c8bce3d3bd Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
  svcrdma: dma unmap the correct length for the RPCRDMA header page.
  nfsd: Revert "svcrpc: take advantage of tcp autotuning"
  nfsd: fix hung up of nfs client while sync write data to nfs server
2009-05-29 08:49:09 -07:00
Oskar Schirmer
c3dc5bec05 flat: fix data sections alignment
The flat loader uses an architecture's flat_stack_align() to align the
stack but assumes word-alignment is enough for the data sections.

However, on the Xtensa S6000 we have registers up to 128bit width
which can be used from userspace and therefor need userspace stack and
data-section alignment of at least this size.

This patch drops flat_stack_align() and uses the same alignment that
is required for slab caches, ARCH_SLAB_MINALIGN, or wordsize if it's
not defined by the architecture.

It also fixes m32r which was obviously kaput, aligning an
uninitialized stack entry instead of the stack pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oskar Schirmer <os@emlix.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Johannes Weiner <jw@emlix.com>
Acked-by: Mike Frysinger <vapier.adi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
KOSAKI Motohiro
bd6daba909 procfs: make errno values consistent when open pident vs exit(2) race occurs
proc_pident_instantiate() has following call flow.

proc_pident_lookup()
  proc_pident_instantiate()
    proc_pid_make_inode()

And, proc_pident_lookup() has following error handling.

	const struct pid_entry *p, *last;
	error = ERR_PTR(-ENOENT);
	if (!task)
		goto out_no_task;

Then, proc_pident_instantiate should return ENOENT too when racing against
exit(2) occur.

EINAL has two bad reason.
  - it implies caller is wrong. bad the race isn't caller's mistake.
  - man 2 open don't explain EINVAL. user often don't handle it.

Note: Other proc_pid_make_inode() caller already use ENOENT properly.

Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
Kevin Cernekee
668ff9ab45 mtd: Handle compat ioctls directly; remove all trace from compat_ioctl.c
Remove all references to MTD ioctls from fs/compat_ioctl.c and let
them all be handled by mtd_compat_ioctl().

Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:58:25 +01:00
Kevin Cernekee
aea7cea9fa mtd: add OOB ioctls for >4GiB devices
Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:27:07 +01:00
Kevin Cernekee
9771854040 mtd: compat_ioctl cleanup
1) Move the MEMREADOOB/MEMWRITEOOB compat_ioctl wrappers from
fs/compat_ioctl.c into mtdchar.c .  Original request was here:

http://lkml.org/lkml/2009/4/1/295

2) Add missing COMPATIBLE_IOCTL lines, so that mtd-utils does not error
out when running in 64/32 compatibility mode.

LKML-Reference: <200904011650.22928.arnd@arndb.de>
Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:24:48 +01:00
Kevin Cernekee
0dc54e9f33 mtd: add MEMERASE64 ioctl for >4GiB devices
New MEMERASE/MEMREADOOB/MEMWRITEOOB ioctls are needed in order to support
64-bit offsets into large NAND flash devices.

Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:13:47 +01:00
Artem Bityutskiy
428ff9d2e3 UBIFS: remove dead code
UBIFS assumes that @c->min_io_size is 8 in case of NOR flash. This
is because UBIFS alignes all nodes to 8-byte boundary, and maintaining
@c->min_io_size introduced unnecessary complications.

This patch removes senseless constructs like:

if (c->min_io_size == 1)
	NOR-specific code

Also, few commentaries amendments.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-29 14:38:37 +03:00
Joakim Tjernlund
81e2962801 jffs2: Fix corruption when flash erase/write failure
Erase errors such as:
"Newly-erased block contained word 0xa4ef223e at offset 0x0296a014"
and failure to write the clean marker,
moves the offending erase block to erasing list before calling
jffs2_erase_failed(). This is bad as jffs2_erase_failed() will
also move the block to the bad_list, but is now moving the
wrong block, causing FS corruption.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 10:44:46 +01:00
Andrew Morton
086a377edc sysfs: file.c: use create_singlethread_workqueue()
We don't need a kernel thread per CPU for this application.

Acked-by: Alex Chiang <achiang@hp.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-28 14:24:07 -07:00
Christoph Hellwig
b96d31a62f cifs: clean up set_cifs_acl interfaces
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 18:41:56 +00:00
Christoph Hellwig
1bf4072da6 cifs: reorganize get_cifs_acl
Thus spake Christoph:

"But this whole set_cifs_acl function is a real mess anyway and needs
some splitting up."

With this change too, it's possible to call acl_to_uid_mode() with a
NULL inode pointer. That (or something close to it) will eventually be
necessary when cifs_get_inode_info is reorganized.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 17:08:02 +00:00
Steve French
c5077ec423 [CIFS] Update readme to indicate change to default mount (serverino)
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 15:09:04 +00:00
Jeff Layton
a0c9217f64 cifs: make serverino the default when mounting
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 15:04:17 +00:00
Jeff Layton
bd433d4cf4 cifs: rename cifs_iget to cifs_root_iget
The current cifs_iget isn't suitable for anything but the root inode.
Rename it with a more appropriate name.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:57:29 +00:00
Jeff Layton
c4a2c08db7 cifs: make cnvrtDosUnixTm take a little-endian args and an offset
The callers primarily end up converting the args from le anyway. Also,
most of the callers end up needing to add an offset to the result. The
exception to these rules is cnvrtDosCifsTm, but there are no callers of
that function, so we might as well remove it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:57:20 +00:00
Jeff Layton
07119a4df8 cifs: have cifs_NTtimeToUnix take a little-endian arg
...and just have the function call le64_to_cpu.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:32:31 +00:00
Mimi Zohar
14dba5331b integrity: nfsd imbalance bug fix
An nfsd exported file is opened/closed by the kernel causing the
integrity imbalance message.

Before a file is opened, there normally is permission checking, which
is done in inode_permission().  However, as integrity checking requires
a dentry and mount point, which is not available in inode_permission(),
the integrity (permission) checking must be called separately.

In order to detect any missing integrity checking calls, we keep track
of file open/closes.  ima_path_check() increments these counts and
does the integrity (permission) checking. As a result, the number of
calls to ima_path_check()/ima_file_free() should be balanced.  An extra
call to fput(), indicates the file could have been accessed without first
calling ima_path_check().

In nfsv3 permission checking is done once, followed by multiple reads,
which do an open/close for each read.  The integrity (permission) checking
call should be in nfsd_permission() after the inode_permission() call, but
as there is no correlation between the number of permission checking and
open calls, the integrity checking call should not increment the counters,
but defer it to when the file is actually opened.

This patch adds:
- integrity (permission) checking for nfsd exported files in nfsd_permission().
- a call to increment counts for files opened by nfsd.

This patch has been updated to return the nfs error types.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-28 09:32:43 +10:00
Wei Yongjun
a0d24b295a nfsd: fix hung up of nfs client while sync write data to nfs server
Commit 'Short write in nfsd becomes a full write to the client'
(31dec2538e) broken the sync write.
With the following commands to reproduce:

  $ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt
  $ cd /mnt
  $ echo aaaa > temp.txt

Then nfs client is hung up.

In SYNC mode the server alaways return the write count 0 to the
client. This is because the value of host_err in nfsd_vfs_write()
will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);',
and then we return host_err(which is now 0) as write count.

This patch fixed the problem.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 17:40:06 -04:00
Greg Banks
1dbd0d53f3 knfsd: remove unreported filehandle stats counters
The file nfsfh.c contains two static variables nfsd_nr_verified and
nfsd_nr_put.  These are counters which are incremented as a side
effect of the fh_verify() fh_compose() and fh_put() operations,
i.e. at least twice per NFS call for any non-trivial workload.
Needless to say this makes the cacheline that contains them (and any
other innocent victims) a very hot contention point indeed under high
call-rate workloads on multiprocessor NFS server.  It also turns out
that these counters are not used anywhere.  They're not reported to
userspace, they're not used in logic, they're not even exported from
the object file (let alone the module).  All they do is waste CPU time.

So this patch removes them.

Tests on a 16 CPU Altix A4700 with 2 10gige Myricom cards, configured
separately (no bonding).  Workload is 640 client threads doing directory
traverals with random small reads, from server RAM.

Before
======

Kernel profile:

  %   cumulative   self              self     total
 time   samples   samples    calls   1/call   1/call  name
  6.05   2716.00  2716.00    30406     0.09     1.02  svc_process
  4.44   4706.00  1990.00     1975     1.01     1.01  spin_unlock_irqrestore
  3.72   6376.00  1670.00     1666     1.00     1.00  svc_export_put
  3.41   7907.00  1531.00     1786     0.86     1.02  nfsd_ofcache_lookup
  3.25   9363.00  1456.00    10965     0.13     1.01  nfsd_dispatch
  3.10  10752.00  1389.00     1376     1.01     1.01  nfsd_cache_lookup
  2.57  11907.00  1155.00     4517     0.26     1.03  svc_tcp_recvfrom
  ...
  2.21  15352.00  1003.00     1081     0.93     1.00  nfsd_choose_ofc  <----
  ^^^^

Here the function nfsd_choose_ofc() reads a global variable
which by accident happened to be located in the same cacheline as
nfsd_nr_verified.

Call rate:

nullarbor:~ # pmdumptext nfs3.server.calls
...
Thu Dec 13 00:15:27     184780.663
Thu Dec 13 00:15:28     184885.881
Thu Dec 13 00:15:29     184449.215
Thu Dec 13 00:15:30     184971.058
Thu Dec 13 00:15:31     185036.052
Thu Dec 13 00:15:32     185250.475
Thu Dec 13 00:15:33     184481.319
Thu Dec 13 00:15:34     185225.737
Thu Dec 13 00:15:35     185408.018
Thu Dec 13 00:15:36     185335.764

After
=====

kernel profile:

  %   cumulative   self              self     total
 time   samples   samples    calls   1/call   1/call  name
  6.33   2813.00  2813.00    29979     0.09     1.01  svc_process
  4.66   4883.00  2070.00     2065     1.00     1.00  spin_unlock_irqrestore
  4.06   6687.00  1804.00     2182     0.83     1.00  nfsd_ofcache_lookup
  3.20   8110.00  1423.00    10932     0.13     1.00  nfsd_dispatch
  3.03   9456.00  1346.00     1343     1.00     1.00  nfsd_cache_lookup
  2.62  10622.00  1166.00     4645     0.25     1.01  svc_tcp_recvfrom
[...]
  0.10  42586.00    44.00       74     0.59     1.00  nfsd_choose_ofc  <--- HA!!
  ^^^^

Call rate:

nullarbor:~ # pmdumptext nfs3.server.calls
...
Thu Dec 13 01:45:28     194677.118
Thu Dec 13 01:45:29     193932.692
Thu Dec 13 01:45:30     194294.364
Thu Dec 13 01:45:31     194971.276
Thu Dec 13 01:45:32     194111.207
Thu Dec 13 01:45:33     194999.635
Thu Dec 13 01:45:34     195312.594
Thu Dec 13 01:45:35     195707.293
Thu Dec 13 01:45:36     194610.353
Thu Dec 13 01:45:37     195913.662
Thu Dec 13 01:45:38     194808.675

i.e. about a 5.3% improvement in call rate.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Reviewed-by: David Chinner <dgc@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 14:14:03 -04:00
Greg Banks
cf0a586cf4 knfsd: fix reply cache memory corruption
Fix a regression in the reply cache introduced when the code was
converted to use proper Linux lists.  When a new entry needs to be
inserted, the case where all the entries are currently being used
by threads is not correctly detected.  This can result in memory
corruption and a crash.  In the current code this is an extremely
unlikely corner case; it would require the machine to have 1024
nfsd threads and all of them to be busy at the same time.  However,
upcoming reply cache changes make this more likely; a crash due to
this problem was actually observed in field.

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 14:14:02 -04:00
Greg Banks
fca4217c5b knfsd: reply cache cleanups
Make REQHASH() an inline function.  Rename hash_list to cache_hash.
Fix an obsolete comment.

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 14:14:02 -04:00
David Howells
911e690e70 CacheFiles: Fixup renamed filenames in comments in internal.h
Fix up renamed filenames in comments in fs/cachefiles/internal.h.

Originally, the files were all called cf-xxx.c, but they got renamed to
just xxx.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-27 10:20:13 -07:00
David Howells
348ca1029e FS-Cache: Fixup renamed filenames in comments in internal.h
Fix up renamed filenames in comments in fs/fscache/internal.h.

Originally, the files were all called fsc-xxx.c, but they got renamed to
just xxx.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-27 10:20:13 -07:00
Eric Sandeen
096324873f xfs: fix overflow in xfs_growfs_data_private
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number.  If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:

 # xfs_io -f -c "truncate 19998630180864" fsfile
 # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
 # mount -o loop fsfile /mnt
 # xfs_growfs /mnt

meta-data=/dev/loop0             isize=256    agcount=52,
agsize=76288719 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3905982455, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-05-26 17:46:37 -05:00
Jeff Layton
f55ed1a83d cifs: tighten up default file_mode/dir_mode
The current default file mode is 02767 and dir mode is 0777. This is
extremely "loose". Given that CIFS is a single-user protocol, these
permissions allow anyone to use the mount -- in effect, giving anyone on
the machine access to the credentials used to mount the share.

Change this by making the default permissions restrict write access to
the default owner of the mount. Give read and execute permissions to
everyone else. These are the same permissions that VFAT mounts get by
default so there is some precedent here.

Note that this patch also removes the mandatory locking flags from the
default file_mode. After having looked at how these flags are used by
the kernel, I don't think that keeping them as the default offers any
real benefit. That flag combination makes it so that the kernel enforces
mandatory locking.

Since the server is going to do that for us anyway, I don't think we
want the client to enforce this by default on applications that just
want advisory locks. Anyone that does want this behavior can always
enable it by setting the file_mode appropriately.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-26 21:10:55 +00:00
Jeff Layton
46a7574caf cifs: fix artificial limit on reading symlinks
There's no reason to limit the size of a symlink that we can read to
4000 bytes. That may be nowhere near PATH_MAX if the server is sending
UCS2 strings. CIFS should be able to read in a symlink up to the size of
the buffer. The size of the header has already been accounted for when
creating the slabcache, so CIFSMaxBufSize should be the correct size to
pass in.

Fixes samba bug #6384.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-26 21:09:14 +00:00
Trond Myklebust
95baa25c73 NFSv4: Fix the case where NFSv4 renewal fails
If the asynchronous lease renewal fails (usually due to a soft timeout),
then we _must_ schedule state recovery in order to ensure that we don't
lose the lease unnecessarily or, if the lease is already lost, that we
recover the locking state promptly...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-26 14:51:00 -04:00
Sam Ravnborg
d0367a508a nfs: fix build error in nfsroot with initconst
fix build error with latest kbuild adjustments to initconst.

The commit a447c09324 ("vfs: Use
const for kernel parser table") changed:

    static match_table_t __initdata tokens = {
to
    static match_table_t __initconst tokens = {

But the missing const causes popwerpc to fail with latest
updates to __initconst like this:

fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict
fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict

The bug is only present with kbuild-next.
Following patch has been build tested.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-26 14:51:00 -04:00
Steven Whitehouse
f6eb53498e GFS2: Remove args subdir from gfs2 sysfs files
Since we can cat /proc/mounts there is no need to have this
subdirectory in the gfs2 sysfs files. In fact this does not
reflect the full range of possible mount argumenmts, where
as /proc/mounts does.

There was only one userland user of this set of sysfs files
and it will function perfectly well without these files
being present (in fact that subcommand of gfs2_tool is
obsolete anyway).

The tune/* subdirectory is also considered mostly obsolete,
but there are a few uses of this until mount arguments can
be added for the last few functions for which there are no
equivalents currently. However the tune/* directory is still
in my sights and new code should avoid using it. Only the gfs2_quota
and gfs2_tool programs are know to use tune/* at the moment.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-26 15:50:25 +01:00
Steven Whitehouse
e1b28aab58 GFS2: Remove lockstruct subdir from gfs2 sysfs files
The lockstruct sub directory contained two entries, both of
which are duplicated elsewhere in the gfs2 sysfs files as
well as being available via /proc/mounts. There is no userland program
using either of them, so this patch removes them.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-26 15:41:27 +01:00
Artem Bityutskiy
7c83f5cb55 UBIFS: use anonymous device
UBIFS has erroneuosly set 'sb->s_dev' to the UBI volume
character device major/minor. This may lead to clashes
if there is another FS mounted to a block device with
the same major/minor numbers. User-space programs which
use 'stat->st_dev' may get confused because of this.

This problem was found by Al Viro. He also pointed the
way to fix the problem - use 'set_anon_super()' and
'kill_anon_super()' VFS helpers.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-26 12:37:11 +03:00
Theodore Ts'o
88b6edd17c ext4: Clean up calls to ext4_get_group_desc()
If the caller isn't planning on modifying the block group descriptors,
there's no need to pass in a pointer to a struct buffer_head.  Nuking
this saves a tiny amount of CPU time and stack space usage.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:50:39 -04:00
Theodore Ts'o
759d427aa5 ext4: remove unused function __ext4_write_dirty_metadata
The __ext4_write_dirty_metadata() function was introduced by commit
0390131b, "ext4: Allow ext4 to run without a journal", but nothing
ever used the function, either then or since.  So let's remove it and
save a bit of space.

Cc: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:51:00 -04:00
Corentin Chary
8eec2f36fb UBIFS: return proper error code if the compr is not present
If the compressor is not present, mount_ubifs need
to return an error code. This way ubifs_fill_super
will stop and handle the error.

Signed-off-by: Corentin Chary <corentincj@iksaif.net>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-25 12:28:27 +03:00
Dave Kleikamp
79f52b77b8 jfs: Add missing mutex_unlock call to error path
Jan Kucera found an missing call to mutex_unlock() with his static code
checker.  It's an unlikely error path to hit in the real world, but it
should be fixed.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Jan Kucera <kucera.jan.cz@gmail.com>
2009-05-23 20:28:41 -05:00
Linus Torvalds
3eb9c8be0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Avoid open on possible directories since Samba now rejects them
2009-05-23 13:42:53 -07:00
Steve French
8db14ca125 [CIFS] Avoid open on possible directories since Samba now rejects them
Small change (mostly formatting) to limit lookup based open calls to
file create only.

After discussion yesteday on samba-technical about the posix lookup
regression,  and looking at a problem with cifs posix open to one
particular Samba version, Jeff and JRA realized that Samba server's
behavior changed in this area (posix open behavior on files vs.
directories).   To make this behavior consistent, JRA just made a
fix to Samba server to alter how it handles open of directories (now
returning the equivalent of EISDIR instead of success). Since we don't
know at lookup time whether the inode is a directory or file (and
thus whether posix open will succeed with most current Samba server),
this change avoids the posix open code on lookup open (just issues
posix open on creates).    This gets the semantic benefits we want
(atomicity, posix byte range locks, improved write semantics on newly
created files) and file create still is fast, and we avoid the problem
that Jeff noticed yesterday with "openat" (and some open directory
calls) of non-cached directories to one version of Samba server, and
will work with future Samba versions (which include the fix jra just
pushed into Samba server).  I confirmed this approach with jra
yesterday and with Shirish today.

Posix open is only called (at lookup time) for file create now.
For opens (rather than creates), because we do not know if it
is a file or directory yet, and current Samba no longer allows
us to do posix open on dirs, we could end up wasting an open call
on what turns out to be a dir. For file opens, we wait to call posix
open till cifs_open.  It could be added here (lookup) in the future
but the performance tradeoff of the extra network request when EISDIR
or EACCES is returned would have to be weighed against the 50%
reduction in network traffic in the other paths.

Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
CC: Jeremy Allison <jra@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-23 18:57:25 +00:00
Martin K. Petersen
c72758f337 block: Export I/O topology for block devices and partitions
To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment.  This patch adds support for exposing
I/O topology characteristics as devices are stacked.

  logical_block_size is the smallest unit the device can address.

  physical_block_size indicates the smallest I/O the device can write
  without incurring a read-modify-write penalty.

  The io_min parameter is the smallest preferred I/O size reported by
  the device.  In many cases this is the same as the physical block
  size.  However, the io_min parameter can be scaled up when stacking
  (RAID5 chunk size > physical block size).

  The io_opt characteristic indicates the optimal I/O size reported by
  the device.  This is usually the stripe width for arrays.

  The alignment_offset parameter indicates the number of bytes the start
  of the device/partition is offset from the device's natural alignment.
  Partition tools and MD/DM utilities can use this to pad their offsets
  so filesystems start on proper boundaries.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:55 +02:00
Martin K. Petersen
ae03bf639a block: Use accessor functions for queue limits
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Martin K. Petersen
e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Jens Axboe
9bd7de51ee Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/ide/ide-io.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:28:35 +02:00
Jens Axboe
e4b636366c Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/block/hd.c
	drivers/block/mg_disk.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:25:34 +02:00
Linus Torvalds
6a44587ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix memory leak in nilfs_ioctl_clean_segments
2009-05-22 08:41:13 -07:00
Ryusuke Konishi
d504685363 nilfs2: fix memory leak in nilfs_ioctl_clean_segments
This fixes a new memory leak problem in garbage collection.  The
problem was brought by the bugfix patch ("nilfs2: fix lock order
reversal in nilfs_clean_segments ioctl").

Thanks to Kentaro Suzuki for finding this problem.

Reported-by: Kentaro Suzuki <k_suzuki@ms.sylc.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-22 20:49:04 +09:00
Steven Whitehouse
87ec217411 GFS2: Move gfs2_unlink_ok into ops_inode.c
Another function which is only called from one ops_inode.c so
we move it and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:54:50 +01:00
Steven Whitehouse
536baf02f6 GFS2: Move gfs2_readlinki into ops_inode.c
Move gfs2_readlinki into ops_inode.c and make it static

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:48:59 +01:00
Steven Whitehouse
2286dbfad1 GFS2: Move gfs2_rmdiri into ops_inode.c
Move gfs2_rmdiri() into ops_inode.c and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:45:09 +01:00
Steven Whitehouse
9e6e0a128b GFS2: Merge mount.c and ops_super.c into super.c
mount.c only contained a single function, so is not really
worth retaining on its own. All of the super related code
is now either in super.c or ops_fstype.c

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:36:01 +01:00
Steven Whitehouse
b1e71b0622 GFS2: Clean up some file names
This patch renames the ops_*.c files which have no counterpart
without the ops_ prefix in order to shorten the name and make
it more readable. In addition, ops_address.h (which was very
small) is moved into inode.h and inode.h is cleaned up by
adding extern where required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:01:55 +01:00
James Morris
2c9e703c61 Merge branch 'master' into next
Conflicts:
	fs/exec.c

Removed IMA changes (the IMA checks are now performed via may_open()).

Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 18:40:59 +10:00
Mimi Zohar
c9d9ac525a integrity: move ima_counts_get
Based on discussion on lkml (Andrew Morton and Eric Paris),
move ima_counts_get down a layer into shmem/hugetlb__file_setup().
Resolves drm shmem_file_setup() usage case as well.

HD comment:
  I still think you're doing this at the wrong level, but recognize
  that you probably won't be persuaded until a few more users of
  alloc_file() emerge, all wanting your ima_counts_get().

  Resolving GEM's shmem_file_setup() is an improvement, so I'll say

Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 09:45:33 +10:00
Mimi Zohar
b9fc745db8 integrity: path_check update
- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 09:43:41 +10:00
Steve French
703a3b8e5c [CIFS] fix posix open regression
Posix open code was not properly adding the file to the
list of open files.  Fix  allocating cifsFileInfo
more than once, and adding twice to flist and tlist.
Also fix mode setting to be done in one place in these
paths.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
2009-05-21 22:38:08 +00:00
Steven Whitehouse
1ce97e564b GFS2: Be more aggressive in reclaiming unlinked inodes
This patch increases the frequency with which gfs2 looks
for unlinked, but still allocated inodes. Its the equivalent
operation to ext3's orphan list, but done with bitmaps in
the resource groups.

This also fixes a bug where a field in the rgrp was too small.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-21 15:18:19 +01:00
Steven Whitehouse
60a0b8f936 GFS2: Add a rgrp bitmap full flag
During block allocation, it is useful to know if sections of disk
are full on a finer grained basis than a single resource group.
This can make a performance difference when resource groups have
larger numbers of bitmap blocks, since we no longer have to search
them all block by block in each individual bitmap.

The full flag is set on a per-bitmap basis when it has been
searched and found to have no free space. It is then skipped in
subsequent searches until the flag is reset. The resetting
occurs if we have to drop the glock on the resource group for any
reason, or if we deallocate some blocks within that resource
group and thus free up some space.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-21 12:23:12 +01:00
Linus Torvalds
929a8651f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)
2009-05-20 08:36:53 -07:00
Steven Whitehouse
0901097834 GFS2: Improve resource group error handling
This patch improves the error handling in the case where we
discover that the summary information in the resource group
doesn't match the bitmap information while in the process of
allocating blocks. Originally this resulted in a kernel bug,
but this patch changes that so that we return -EIO and print
some messages explaining what went wrong, and how to fix it.

We also remember locally not to try and allocate from the
same rgrp again, so that a subsequent allocation in a
different rgrp should succeed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-20 10:48:47 +01:00
Jeff Layton
8b6427a2a8 cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)
This is the third respin of the patch posted yesterday to fix the error
handling in cifs_follow_symlink. It also includes a fix for a bogus NULL
pointer check in CIFSSMBQueryUnixSymLink that Jeff Moyer spotted.

It's possible for CIFSSMBQueryUnixSymLink to return without setting
target_path to a valid pointer. If that happens then the current value
to which we're initializing this pointer could cause an oops when it's
kfree'd.

This patch is a little more comprehensive than the last patches. It
reorganizes cifs_follow_link a bit for (hopefully) better readability.
It should also eliminate the uneeded allocation of full_path on servers
without unix extensions (assuming they can get to this point anyway, of
which I'm not convinced).

On a side note, I'm not sure I agree with the logic of enabling this
query even when unix extensions are disabled on the client. It seems
like that should disable this as well. But, changing that is outside the
scope of this fix, so I've left it alone for now.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@inraded.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-19 15:31:20 +00:00
Steven Whitehouse
ef9e8b14a5 GFS2: Don't warn when delete inode fails on ro filesystem
If the filesystem is read-only, then we expect that delete inode
will fail, so there is no need to warn about it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-19 14:25:16 +01:00
Miklos Szeredi
b2858d7d16 splice: fix kmaps in default_file_splice_write()
Unfortunately multiple kmap() within a single thread are deadlockable,
so writing out multiple buffers with writev() isn't possible.

Change the implementation so that it does a separate write() for each
buffer.  This actually simplifies the code a lot since the
splice_from_pipe() helper can be used.

This limitation is caused by HIGHMEM pages, and so only affects a
subset of architectures and configurations.  In the future it may be
worth to implement default_file_splice_write() in a more efficient way
on configs that allow it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-19 11:37:46 +02:00
Tejun Heo
4fc981ef9e bio: always copy back data for copied kernel requests
When a read bio_copy_kern() request fails, the content of the bounce
buffer is not copied back.  However, as request failure doesn't
necessarily mean complete failure, the buffer state can be useful.
This behavior is also inconsistent with the user map counterpart and
causes the subtle difference between bounced and unbounced IO causes
confusion.

This patch makes bio_copy_kern_endio() ignore @err and always copy
back data on request completion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-19 11:36:08 +02:00
Steven Whitehouse
fe64d517df GFS2: Umount recovery race fix
This patch fixes a race condition where we can receive recovery
requests part way through processing a umount. This was causing
problems since the recovery thread had already gone away.

Looking in more detail at the recovery code, it was really trying
to implement a slight variation on a work queue, and that happens to
align nicely with the recently introduced slow-work subsystem. As a
result I've updated the code to use slow-work, rather than its own home
grown variety of work queue.

When using the wait_on_bit() function, I noticed that the wait function
that was supplied as an argument was appearing in the WCHAN field, so
I've updated the function names in order to produce more meaningful
output.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-19 10:01:18 +01:00
Hunter Adrian
8b3884a841 UBIFS: return error if link and unlink race
Consider a scenario when 'vfs_link(dirA/fileA)' and
'vfs_unlink(dirA/fileA, dirB/fileB)' race. 'vfs_link()' does not
lock 'dirA->i_mutex', so this is possible. Both of the functions
lock 'fileA->i_mutex' though. Suppose 'vfs_unlink()' wins, and takes
'fileA->i_mutex' mutex first. Suppose 'fileA->i_nlink' is 1. In this
case 'ubifs_unlink()' will drop the last reference, and put 'inodeA'
to the list of orphans. After this, 'vfs_link()' will link
'dirB/fileB' to 'inodeA'. Thir is a problem because, for example,
the subsequent 'vfs_unlink(dirB/fileB)' will add the same inode
to the list of orphans.

This problem was reported by J. R. Okajima <hooanon05@yahoo.co.jp>

[Artem: add more comments, amended commit message]

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-19 11:01:31 +03:00
Frank Filz
7ee2cb7f32 nfs: Fix NFS v4 client handling of MAY_EXEC in nfs_permission.
The problem is that permission checking is skipped if atomic open is
possible, but when exec opens a file, it just opens it O_READONLY which
means EXEC permission will not be checked at that time.

This problem is observed by the following sequence (executed as root):

  mount -t nfs4 server:/ /mnt4
  echo "ls" >/mnt4/foo
  chmod 744 /mnt4/foo
  su guest -c "mnt4/foo"

Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Tested-by: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-18 20:11:12 -07:00
Ingo Molnar
dc3f81b129 Merge commit 'v2.6.30-rc6' into perfcounters/core
Merge reason: this branch was on an -rc4 base, merge it up to -rc6
              to get the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 07:37:49 +02:00
Manish Katiyar
0f7ee7c172 ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:51 -04:00
Manish Katiyar
de5ce03730 ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:47 -04:00
Manish Katiyar
f68301656b ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:44 -04:00
Theodore Ts'o
0568c51893 ext4: down i_data_sem only for read when walking tree for fiemap
Not sure why I put this in as down_write originally; all we are
doing is walking the tree, nothing will change under us and
concurrent reads should be no problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:31:23 -04:00
Theodore Ts'o
6fd058f779 ext4: Add a comprehensive block validity check to ext4_get_blocks()
To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree.  This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 15:38:01 -04:00
Jeff Mahoney
b83674c0da reiserfs: fixup perms when xattrs are disabled
This adds CONFIG_REISERFS_FS_XATTR protection from reiserfs_permission.

This is needed to avoid warnings during file deletions and chowns with
xattrs disabled.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
Jeff Mahoney
ceb5edc457 reiserfs: deal with NULL xattr root w/ xattrs disabled
This avoids an Oops in open_xa_root that can occur when deleting a file
with xattrs disabled.  It assumes that the xattr root will be there, and
that is not guaranteed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
Jeff Mahoney
12abb35a03 reiserfs: clean up ifdefs
With xattr cleanup even with xattrs disabled, much of the initial setup
is still performed.  Some #ifdefs are just not needed since the options
they protect wouldn't be available anyway.

This cleans those up.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
David Teigland
748285ccf7 dlm: use more NOFS allocation
Change some GFP_KERNEL allocations to use either GFP_NOFS or
ls_allocation (when available) which the fs sets to GFP_NOFS.
The point is to prevent allocations from going back into the
cluster fs in places where that might lead to deadlock.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-15 11:24:59 -05:00
Linus Torvalds
5d41343ac8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix race in ext4_inode_info.i_cached_extent
  ext4: Clear the unwritten buffer_head flag after the extent is initialized
  ext4: Use a fake block number for delayed new buffer_head
  ext4: Fix sub-block zeroing for writes into preallocated extents
2009-05-15 08:07:25 -07:00
Sukadev Bhattiprolu
1f71ebedb3 devpts: correctly set default options
devpts_get_sb() calls memset(0) to clear mount options and calls
parse_mount_options() if user specified any mount options.

The memset(0) is bogus since the 'mode' and 'ptmxmode' options are
non-zero by default.  parse_mount_options() restores options to default
anyway and can properly deal with NULL mount options.

So in devpts_get_sb() remove memset(0) and call parse_mount_options() even
for NULL mount options.

Bug reported by Eric Paris: http://lkml.org/lkml/2009/5/7/448.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Reported-by: Eric Paris <eparis@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Reviewed-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-15 08:03:23 -07:00
Christine Caulfield
391fbdc5d5 dlm: connect to nodes earlier
Make network connections to other nodes earlier, in the context of
dlm_recoverd.  This avoids connecting to nodes from dlm_send where we
try to avoid allocations which could possibly deadlock if memory reclaim
goes into the cluster fs which may try to do a dlm operation.

Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-15 09:34:12 -05:00
Thomas Gleixner
2d02494f5a sched, timers: cleanup avenrun users
avenrun is an rough estimate so we don't have to worry about
consistency of the three avenrun values. Remove the xtime lock
dependency and provide a function to scale the values. Cleanup the
users.

[ Impact: cleanup ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
Theodore Ts'o
2ec0ae3ace ext4: Fix race in ext4_inode_info.i_cached_extent
If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time.  This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development.  Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult.  Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-05-15 09:07:28 -04:00
Linus Torvalds
bd67ce0f66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix error handling in parse_DFS_referrals
2009-05-14 19:20:04 -07:00
Linus Torvalds
5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Aneesh Kumar K.V
2a8964d63d ext4: Clear the unwritten buffer_head flag after the extent is initialized
The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area.  That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.

When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate.  Hence, we need to make sure the
BH_Unwritten is not left set, since the combination of BH_Mapped and
BH_Unwritten is not allowed; among other things, it will result ext4's
get_block() to be called over and over again during the write_begin
phase of write(2).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 17:05:39 -04:00
Sankar P
9f55684c2d Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Signed-off-by: Sankar P <sankar.curiosity@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Sage Weil
6b65c5c61b Btrfs: make show_options result match actual option names
The notreelog and flushoncommit mount options were being printed slightly
differently.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Li Hong
5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
cc7b0c9b70 Btrfs: remove some WARN_ONs in the IO failure path
These debugging WARN_ONs make too much console noise during regular
IO failures.  An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
76a05b35a3 Btrfs: Don't loop forever on metadata IO failures
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:32 -04:00
Chris Mason
2757495c90 Btrfs: init inode ordered_data_close flag properly
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:31 -04:00
Theodore Ts'o
2ac3b6e00a ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
The ext4_get_blocks() function was depending on the value of
bh_result->b_state as an input parameter to decide whether or not
update the delalloc accounting statistics by calling
ext4_da_update_reserve_space().  We now use a separate flag,
EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE, to requests this update, so that
all callers of ext4_get_blocks() can clear map_bh.b_state before
calling ext4_get_blocks() without worrying about any consistency
issues.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 13:57:08 -04:00
Jeff Layton
d8e2f53ac9 cifs: fix error handling in parse_DFS_referrals
cifs_strndup_from_ucs returns NULL on error, not an ERR_PTR

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-14 13:55:32 +00:00
Theodore Ts'o
2fa3cdfb31 ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
The static function ext4_da_get_block_write() was only used by
mpage_da_map_blocks().  So to simplify the code, merge that function
into mpage_da_map_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 09:29:45 -04:00
Andrew Morton
77f6bf57ba splice: fix error return code
fs/splice.c: In function 'default_file_splice_read':
fs/splice.c:566: warning: 'error' may be used uninitialized in this function

which is sort-of true.  The code will in fact return -ENOMEM instead of the
kernel_readv() return value.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-14 09:49:44 +02:00
Linus Torvalds
210af919c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: cody tidying, remove commented out line in Makefile
  Squashfs: check page size is not larger than the filesystem block size
  Squashfs: fix breakage when page size > metadata block size
2009-05-13 16:33:25 -07:00
Linus Torvalds
a6aeeebf51 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: destroy bdi on error
2009-05-13 16:32:57 -07:00
Linus Torvalds
014049a1c7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: check size of array structured data exchanged via ioctls
  nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
  nilfs2: fix possible circular locking for get information ioctls
  nilfs2: ensure to clear dirty state when deleting metadata file block
  nilfs2: fix circular locking dependency of writer mutex
  nilfs2: fix possible recovery failure due to block creation without writer
2009-05-13 16:32:16 -07:00
Randy Dunlap
dd4dc82d4c lockd: fix FILE_LOCKING=n build error
lockd/svclock.c is missing a header file <linux/fs.h>.

<linux/fs.h> is missing a definition of locks_release_private()
for the config case of FILE_LOCKING=n, causing a build error:

fs/lockd/svclock.c:330: error: implicit declaration of function 'locks_release_private'

lockd without FILE_LOCKING doesn't make sense, so make LOCKD and LOCKD_V4
depend on FILE_LOCKING, and make NFS depend on FILE_LOCKING.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-13 15:59:10 -04:00
Mel Gorman
f2deae9d4e Remove implementation of readpage from the hugetlbfs_aops
The core VM assumes the page size used by the address_space in
inode->i_mapping is PAGE_SIZE but hugetlbfs breaks this assumption by
inserting pages into the page cache at offsets the core VM considers
unexpected.

This would not be a problem except that hugetlbfs also provide a
->readpage implementation.  As it exists, the core VM can assume the
base page size is being used, allocate pages on behalf of the
filesystem, insert them into the page cache and call ->readpage to
populate them.  These pages are the wrong size and at the wrong offset
for hugetlbfs causing confusion.

This patch deletes the ->readpage implementation for hugetlbfs on the
grounds the core VM should not be allocating and populating pages on
behalf of hugetlbfs.  There should be no existing users of the
->readpage implementation so it should not cause a regression.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-13 08:04:45 -07:00
Steven Whitehouse
9582d41135 GFS2: Remove a couple of unused sysfs entries
These two tunables are pointless and would never need to be
changed anyway. There is also a race between them and umount
as the deamons which they refer to might have gone away. The
easiest way to fix the race is to remove the interface.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 15:06:25 +01:00
Steven Whitehouse
48c2b61361 GFS2: Add commit= mount option
It has always been possible to adjust the gfs2 log commit
interval, but only from the sysfs interface. This adds a
mount option, commit=<nn>, which will be familar to ext3
users.

The sysfs interface continues to be available as well, although
this might be removed in the future.

Also this patch cleans up some duplicated structures in the GFS2
sysfs code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 14:49:48 +01:00
Steven Whitehouse
a1c0643ff9 GFS2: Move journal live test at transaction start
There seems little point grabbing the transaction glock
only to have to release it again if the journal isn't
live. This moves the test earlier to avoid grabbing the lock
when we don't need it in the first place.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 10:56:52 +01:00
Jens Axboe
4f23122858 splice: fix repeated kmap()'s in default_file_splice_read()
We cannot reliably map more than one page at the time, or we risk
deadlocking. Just allocate the pages from low mem instead.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-13 08:35:35 +02:00
Phillip Lougher
e5d287539d Squashfs: cody tidying, remove commented out line in Makefile
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 03:25:20 +01:00
Phillip Lougher
fffb47b80e Squashfs: check page size is not larger than the filesystem block size
Normally the block size (by default 128K) will be larger than the
page size, unless a non-standard block size has been specified in
Mksquashfs, and the page size is larger than 4K.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 02:59:26 +01:00
Doug Chapman
a37b06d589 Squashfs: fix breakage when page size > metadata block size
Squashfs is broken on any system where the page size is larger than
the metadata size (8192).  This is easily fixed by ensuring cache->pages
is always > 0.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 02:56:39 +01:00
Linus Torvalds
2ea3f86848 Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
  nfsd: silence lockdep warning
  lockd: fix list corruption on lockd restart
  nfsd4: check for negative dentry before use in nfsv4 readdir
  nfsd41: slots are freed with session
  svcrdma: clean up error paths.
  svcrdma: Fix dma map direction for rdma read targets
2009-05-12 17:11:56 -07:00
Davide Libenzi
bfe3891a5f epoll: fix size check in epoll_create()
Fix a size check WRT the manual pages.  This was inadvertently broken by
commit 9fe5ad9c8c ("flag parameters
add-on: remove epoll_create size param").

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <Hiroyuki.Mach@gmail.com>
Cc: rohit verma <rohit.170309@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-12 14:11:35 -07:00
Aneesh Kumar K.V
33b9817e2a ext4: Use a fake block number for delayed new buffer_head
Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 14:40:37 -04:00
Aneesh Kumar K.V
9c1ee184a3 ext4: Fix sub-block zeroing for writes into preallocated extents
We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.

Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 18:36:58 -04:00
Theodore Ts'o
a2dc52b5d1 ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
Enforce that noalloc_get_block_write() is only called to map one block
at a time, and that it always is successful in finding a mapping for
given an inode's logical block block number if it is called with
create == 1.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 13:51:29 -04:00
Theodore Ts'o
b920c75502 ext4: Add documentation to the ext4_*get_block* functions
This adds more documentation to various internal functions in
fs/ext4/inode.c, most notably ext4_ind_get_blocks(),
ext4_da_get_block_write(), ext4_da_get_block_prep(),
ext4_normal_get_block_write().

In addition, the static function ext4_normal_get_block_write() has
been renamed noalloc_get_block_write(), since it is used in many
places far beyond ext4_normal_writepage().

Plenty of warnings have been added to the noalloc_get_block_write()
function, since the way it is used is amazingly fragile.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:54:29 -04:00
Theodore Ts'o
c217705733 ext4: Define a new set of flags for ext4_get_blocks()
The functions ext4_get_blocks(), ext4_ext_get_blocks(), and
ext4_ind_get_blocks() used an ad-hoc set of integer variables used as
boolean flags passed in as arguments.  Use a single flags parameter
and a setandard set of bitfield flags instead.  This saves space on
the call stack, and it also makes the code a bit more understandable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:58:52 -04:00
Theodore Ts'o
12b7ac1768 ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks()
Another function rename for clarity's sake.  The _wrap prefix simply
confuses people, and didn't add much people trying to follow the code
paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:57:44 -04:00
Abhijith Das
7537d81aa7 GFS2: Fix timestamps on write
This patch copies the timestamps from the vfs inode into gfs2 and syncs
it to the disk inode during writes.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-12 16:14:05 +01:00
Theodore Ts'o
e4d996ca80 ext4: Rename ext4_get_blocks_handle() to be ext4_ind_get_blocks()
The static function ext4_get_blocks_handle() is badly named.  Of
*course* it takes a handle.  Since its counterpart for extent-based
file is ext4_ext_get_blocks(), rename it to be ext4_ind_get_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 00:25:28 -04:00
Theodore Ts'o
f888e652d7 ext4: Simplify function signature for ext4_da_get_block_write()
The function ext4_da_get_block_write() is called in exactly one write,
and the last argument, create, is always 1.  Remove it to simplify the
code slightly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 00:21:29 -04:00
Vincent Minet
bc8e67409c ext4: Fix spinlock assertions on UP systems
On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails
which triggers a BUG_ON() call.
This patch fixes it by using assert_spin_locked instead.

Signed-off-by: Vincent Minet <vincent@vincent-minet.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-15 08:33:18 -04:00
J. Bruce Fields
8daed1e549 nfsd: silence lockdep warning
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-11 17:23:14 -04:00
Jeff Mahoney
2b79bc4f7e dup2: Fix return value with oldfd == newfd and invalid fd
The return value of dup2 when oldfd == newfd and the fd isn't valid is
not getting properly sign extended.  We end up with 4294967287 instead
of -EBADF.

I've reproduced this on SLE11 (2.6.27.21), openSUSE Factory
(2.6.29-rc5), and Ubuntu 9.04 (2.6.28).

This patch uses a signed int for the error value so it is properly
extended.

Commit 6c5d0512a0 introduced this
regression.

Reported-by: Jiri Dluhos <jdluhos@novell.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-11 12:18:06 -07:00
Ryusuke Konishi
83aca8f480 nilfs2: check size of array structured data exchanged via ioctls
Although some ioctls of nilfs2 exchange data in the form of indirectly
referenced array, some of them lack size check on the array elements.

This inserts the missing checks and rejects requests if data of ioctl
does not have a valid format.

We usually don't have to check size of structures that we associated
with ioctl commands because the size is tested implicitly for
identifying ioctl command; the checks this patch adds are for the
cases where the implicit check is not applied.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-12 01:48:54 +09:00
Miklos Szeredi
0b0a47f5c4 splice: implement default splice_write method
If f_op->splice_write() is not implemented, fall back to a plain write.
Use vfs_writev() to write from the pipe buffers.

This will allow splice on all filesystems and file types.  This
includes "direct_io" files in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
6818173bd6 splice: implement default splice_read method
If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.

This will allow splice and functions using splice, such as the loop
device, to work on all filesystems.  This includes "direct_io" files
in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
7c77f0b3f9 splice: implement pipe to pipe splicing
Allow splice(2) to work when both the input and the output is a pipe.

Based on the impementation of the tee(2) syscall, but instead of
duplicating the buffer references move the buffers from the input pipe
to the output pipe.

Moving the whole buffer only succeeds if the full length of the buffer
is spliced.  Otherwise duplicate the buffer, just like tee(2), set the
length of the output buffer and advance the offset on the input
buffer.

Since splice is operating on two pipes, special care needs to be taken
with locking to prevent AN ABBA deadlock.  Again this is done
similarly to the tee(2) syscall, first preparing the input and output
pipes so there's data to consume and space for that data, and then
doing the move operation while holding both locks.

If other processes are doing I/O on the same pipes parallel to the
splice, then by the time both inodes are locked there might be no
buffers left to move, or no space to move them to.  In this case retry
the whole operation, including the preparation phase.  This could lead
to starvation, but I'm not sure if that's serious enough to worry
about.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:09 +02:00
Steven Whitehouse
48bf2b1711 GFS2: Something nonlinear this way comes!
For some reason GFS2 has been missing support for non-linear
mappings. This patch fixes that, and also avoids taking any
locks for mmap in the O_NOATIME case. In fact we don't actually need
to take the lock here at all - just doing file_accessed() would be
enough, but we have to take the lock eventually and this helps
it hit disk (and thus be seen by other nodes) faster.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:44 +01:00
Steven Whitehouse
4a0f9a321a GFS2: Optimise writepage for metadata
This adds a GFS2 specific writepage for metadata, rather than
continuing to use the VFS function. As a result we now tag all
our metadata I/O with the correct flag so that blktraces will
now be less confusing.

Also, the generic function was checking for a number of corner
cases which cannot happen on the metadata address spaces so that
this should be faster too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:43 +01:00
Steven Whitehouse
c969f58ca4 GFS2: Update the rw flags
After Jens recent updates:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a1f242524c3c1f5d40f1c9c343427e34d1aadd6e
et al. this is a patch to bring gfs2 uptodate with the core
code. Also I've managed to squash another call to ll_rw_block()
along the way.

There is still one part of the GFS2 I/O paths which are not correctly
annotated and that is due to the sharing of the writeback code between
the data and metadata address spaces. I would like to change that too,
but this patch is still worth doing on its own, I think.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:41 +01:00
Tejun Heo
c3a4d78c58 block: add rq->resid_len
rq->data_len served two purposes - the length of data buffer on issue
and the residual count on completion.  This duality creates some
headaches.

First of all, block layer and low level drivers can't really determine
what rq->data_len contains while a request is executing.  It could be
the total request length or it coulde be anything else one of the
lower layers is using to keep track of residual count.  This
complicates things because blk_rq_bytes() and thus
[__]blk_end_request_all() relies on rq->data_len for PC commands.
Drivers which want to report residual count should first cache the
total request length, update rq->data_len and then complete the
request with the cached data length.

Secondly, it makes requests default to reporting full residual count,
ie. reporting that no data transfer occurred.  The residual count is
an exception not the norm; however, the driver should clear
rq->data_len to zero to signify the normal cases while leaving it
alone means no data transfer occurred at all.  This reverse default
behavior complicates code unnecessarily and renders block PC on some
drivers (ide-tape/floppy) unuseable.

This patch adds rq->resid_len which is used only for residual count.

While at it, remove now unnecessasry blk_rq_bytes() caching in
ide_pc_intr() as rq->data_len is not changed anymore.

Boaz	: spotted missing conversion in osd
Sergei	: spotted too early conversion to blk_rq_bytes() in ide-tape

[ Impact: cleanup residual count handling, report 0 resid by default ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:53 +02:00
Ryusuke Konishi
4f6b828837 nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
This is a companion patch to ("nilfs2: fix possible circular locking
for get information ioctls").

This corrects lock order reversal between mm->mmap_sem and
nilfs->ns_segctor_sem in nilfs_clean_segments() which was detected by
lockdep check:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3-nilfs-00003-g360bdc1 #7
 -------------------------------------------------------
 mmap/5294 is trying to acquire lock:
  (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]

 but task is already holding lock:
  (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
        [<c01470a5>] __lock_acquire+0x1066/0x13b0
        [<c01474a9>] lock_acquire+0xba/0xdd
        [<c01836bc>] might_fault+0x68/0x88
        [<c023c61d>] copy_from_user+0x2a/0x111
        [<d0d120d0>] nilfs_ioctl_prepare_clean_segments+0x1d/0xf1 [nilfs2]
        [<d0d0e2aa>] nilfs_clean_segments+0x6d/0x1b9 [nilfs2]
        [<d0d11f68>] nilfs_ioctl+0x2ad/0x318 [nilfs2]
        [<c01a3be7>] vfs_ioctl+0x22/0x69
        [<c01a408e>] do_vfs_ioctl+0x460/0x499
        [<c01a4107>] sys_ioctl+0x40/0x5a
        [<c01031a4>] sysenter_do_call+0x12/0x38
        [<ffffffff>] 0xffffffff

 -> #0 (&nilfs->ns_segctor_sem){++++.+}:
        [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
        [<c01474a9>] lock_acquire+0xba/0xdd
        [<c0433f1d>] down_read+0x2a/0x3e
        [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
        [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
        [<c0183b0b>] __do_fault+0x165/0x376
        [<c01855cd>] handle_mm_fault+0x287/0x5d1
        [<c043712d>] do_page_fault+0x2fb/0x30a
        [<c0435462>] error_code+0x72/0x78
        [<ffffffff>] 0xffffffff

where nilfs_clean_segments() holds:

  nilfs->ns_segctor_sem -> copy_from_user()
                             --> page fault -> mm->mmap_sem

And, page fault path may hold:

  page fault -> mm->mmap_sem
         --> nilfs_page_mkwrite() -> nilfs->ns_segctor_sem

Even though nilfs_clean_segments() does not perform write access on
given user pages, it may cause deadlock because nilfs->ns_segctor_sem
is shared per device and mm->mmap_sem can be shared with other tasks.

To avoid this problem, this patch moves all calls of copy_from_user()
outside the nilfs->ns_segctor_sem lock in the ioctl.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-11 14:54:41 +09:00
Ryusuke Konishi
47eb6b9c8f nilfs2: fix possible circular locking for get information ioctls
This is one of two patches which are to correct possible circular
locking between mm->mmap_sem and nilfs->ns_segctor_sem.

The problem was detected by lockdep check as follows:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3-nilfs-00002-g3552613 #6
 -------------------------------------------------------
 mmap/5418 is trying to acquire lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]

 but task is already holding lock:
 (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
 [<c01470a5>] __lock_acquire+0x1066/0x13b0
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<c01836bc>] might_fault+0x68/0x88
 [<c023c730>] copy_to_user+0x2c/0xfc
 [<d0d11b4f>] nilfs_ioctl_wrap_copy+0x103/0x160 [nilfs2]
 [<d0d11fa9>] nilfs_ioctl+0x30a/0x3b0 [nilfs2]
 [<c01a3be7>] vfs_ioctl+0x22/0x69
 [<c01a408e>] do_vfs_ioctl+0x460/0x499
 [<c01a4107>] sys_ioctl+0x40/0x5a
 [<c01031a4>] sysenter_do_call+0x12/0x38
 [<ffffffff>] 0xffffffff

 -> #0 (&nilfs->ns_segctor_sem){++++.+}:
 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<c0433f1d>] down_read+0x2a/0x3e
 [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
 [<c0183b0b>] __do_fault+0x165/0x376
 [<c01855cd>] handle_mm_fault+0x287/0x5d1
 [<c043712d>] do_page_fault+0x2fb/0x30a
 [<c0435462>] error_code+0x72/0x78
 [<ffffffff>] 0xffffffff

 other info that might help us debug this:

 1 lock held by mmap/5418:
 #0:  (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 stack backtrace:
 Pid: 5418, comm: mmap Not tainted 2.6.30-rc3-nilfs-00002-g3552613 #6
 Call Trace:
 [<c0432145>] ? printk+0xf/0x12
 [<c0145c48>] print_circular_bug_tail+0xaa/0xb5
 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
 [<d0d10149>] ? nilfs_sufile_get_stat+0x1e/0x105 [nilfs2]
 [<c013b59a>] ? up_read+0x16/0x2c
 [<d0d10225>] ? nilfs_sufile_get_stat+0xfa/0x105 [nilfs2]
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<c0433f1d>] down_read+0x2a/0x3e
 [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
 [<c0183b0b>] __do_fault+0x165/0x376
 [<c01855cd>] handle_mm_fault+0x287/0x5d1
 [<c043700a>] ? do_page_fault+0x1d8/0x30a
 [<c013b54f>] ? down_read_trylock+0x39/0x43
 [<c043712d>] do_page_fault+0x2fb/0x30a
 [<c0436e32>] ? do_page_fault+0x0/0x30a
 [<c0435462>] error_code+0x72/0x78
 [<c0436e32>] ? do_page_fault+0x0/0x30a

This makes the lock granularity of nilfs->ns_segctor_sem finer than
that of the mmap semaphore for ioctl commands except
nilfs_clean_segments().

The successive patch ("nilfs2: fix lock order reversal in
nilfs_clean_segments ioctl") is required to fully resolve the problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-11 12:57:46 +09:00