Commit Graph

15761 Commits

Author SHA1 Message Date
Andy Adamson
49557cc74c nfsd41: Use separate DRC for setclientid
Instead of trying to share the generic 4.1 reply cache code for the
CREATE_SESSION reply cache, it's simpler to handle CREATE_SESSION
separately.

The nfs41 single slot clientid DRC holds the results of create session
processing.  CREATE_SESSION can be preceeded by a SEQUENCE operation
(an embedded CREATE_SESSION) and the create session single slot cache must be
maintained.  nfsd4_replay_cache_entry() and nfsd4_store_cache_entry() do not
implement the replay of an embedded CREATE_SESSION.

The clientid DRC slot does not need the inuse, cachethis or other fields that
the multiple slot session cache uses.  Replace the clientid DRC cache struct
nfs4_slot cache with a new nfsd4_clid_slot cache.  Save the xdr struct
nfsd4_create_session into the cache at the end of processing, and on a replay,
replace the struct for the replay request with the cached version all while
under the state lock.

nfsd4_proc_compound will handle both the solo and embedded CREATE_SESSION case
via the normal use of encode_operation.

Errors that do not change the create session cache:
A create session NFS4ERR_STALE_CLIENTID error means that a client record
(and associated create session slot) could not be found and therefore can't
be changed.  NFSERR_SEQ_MISORDERED errors do not change the slot cache.

All other errors get cached.

Remove the clientid DRC specific check in nfs4svc_encode_compoundres to
put the session only if cstate.session is set which will now always be true.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:30:29 -04:00
Andy Adamson
88e588d56a nfsd41: change check_slot_seqid parameters
For separation of session slot and clientid slot processing.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:30:23 -04:00
Andy Adamson
5261dcf8eb nfsd41: remove redundant forechannel max requests check
This check is done in set_forechannel_maxreqs.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:30:15 -04:00
Andy Adamson
0c193054a4 nfsd41: hange from page to memory based drc limits
NFSD_SLOT_CACHE_SIZE is the size of all encoded operation responses
(excluding the sequence operation) that we want to cache.

For now, keep NFSD_SLOT_CACHE_SIZE at PAGE_SIZE. It will be reduced
when the DRC is changed from page based to memory based.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:30:05 -04:00
Andy Adamson
6a14dd1a4f nfsd41: reserve less memory for DRC
Also remove a slightly misleading comment.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:29:59 -04:00
Andy Adamson
b101ebbc39 nfsd41: minor set_forechannel_maxreqs cleanup
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:29:54 -04:00
Andy Adamson
be98d1bbd1 nfsd41: reclaim DRC memory on session free
This fixes a leak which would eventually lock out new clients.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:29:48 -04:00
J. Bruce Fields
413d63d710 nfsd: minor write_pool_threads exit cleanup
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:29:41 -04:00
Eric Sesterhenn
2522a776c1 Fix memory leak in write_pool_threads
kmemleak produces the following warning

unreferenced object 0xc9ec02a0 (size 8):
  comm "cat", pid 19048, jiffies 730243
  backtrace:
    [<c01bf970>] create_object+0x100/0x240
    [<c01bfadb>] kmemleak_alloc+0x2b/0x60
    [<c01bcd4b>] __kmalloc+0x14b/0x270
    [<c02fd027>] write_pool_threads+0x87/0x1d0
    [<c02fcc08>] nfsctl_transaction_write+0x58/0x70
    [<c02fcc6f>] nfsctl_transaction_read+0x4f/0x60
    [<c01c2574>] vfs_read+0x94/0x150
    [<c01c297d>] sys_read+0x3d/0x70
    [<c0102d6b>] sysenter_do_call+0x12/0x32
    [<ffffffff>] 0xffffffff

write_pool_threads() only frees nthreads on error paths, in the success case
we leak it.

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-28 14:29:34 -04:00
Yan Zheng
f25784b35f Btrfs: Fix async caching interaction with unmount
- don't stop the caching thread until btrfs_commit_super return.

- if caching is interrupted by umount, set last to (u64)-1.
  otherwise the un-scanned range of block group will be considered
  as free extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 08:41:57 -04:00
Peng Tao
785b4b3a5a ext4: fix build warning when EXT4FS_DEBUG is on
When compiling with EXT4FS_DEBUG on, gcc will complain with following warnings:

linux-2.6/fs/ext4/ialloc.c: In function ‘ext4_count_free_inodes’:
linux-2.6/fs/ext4/ialloc.c:1192: warning: format ‘%lu’ expects type
‘long unsigned int’, but argument 2 has type ‘ext4_group_t’

So add a type cast to suppress it. 

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 21:44:40 -04:00
Jeff Layton
7b91e2661a cifs: fix error handling in mount-time DFS referral chasing code
If the referral is malformed or the hostname can't be resolved, then
the current code generates an oops. Fix it to handle these errors
gracefully.

Reported-by: Sandro Mathys <sm@sandro-mathys.ch>
Acked-by: Igor Mammedov <niallain@gmail.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-28 00:51:59 +00:00
Linus Torvalds
fc013a5885 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  inotify: use GFP_NOFS under potential memory pressure
  fsnotify: fix inotify tail drop check with path entries
  inotify: check filename before dropping repeat events
  fsnotify: use def_bool in kconfig instead of letting the user choose
  inotify: fix error paths in inotify_update_watch
  inotify: do not leak inode marks in inotify_add_watch
  inotify: drop user watch count when a watch is removed
2009-07-27 15:54:10 -07:00
Linus Torvalds
a9355cf8e6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
  jfs: Fix early release of acl in jfs_get_acl
2009-07-27 12:15:56 -07:00
Linus Torvalds
2bc20d09b0 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  jbd: fix race between write_metadata_buffer and get_write_access
  ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
  jbd: Fix a race between checkpointing code and journal_get_write_access()
  ext3: Fix truncation of symlinks after failed write
  jbd: Fail to load a journal if it is too short
2009-07-27 12:12:10 -07:00
Linus Torvalds
c7425eb481 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] fix sparse warning
  cifs: fix sb->s_maxbytes so that it casts properly to a signed value
  cifs: disable serverino if server doesn't support it
2009-07-27 12:11:43 -07:00
Josef Bacik
68b38550dd Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents.  This patch makes
things much less complicated by only unpinning the extent if the block group is
cached.  We check the block_group->cached var under the block_group->lock spin
lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back.  If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent.  This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.

One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory.  So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:01 -04:00
Julia Lawall
631c07c8d1 Btrfs: Correct redundant test in add_inode_ref
dir has already been tested.  It seems that this test should be on the
recently returned value inode.

A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:00 -04:00
Chris Mason
9779b72f05 Btrfs: find smallest available device extent during chunk allocation
Allocating new block group is easy when the disk has plenty of space.
But things get difficult as the disk fills up, especially if
the FS has been run through btrfs-vol -b.  The balance operation
is likely to make the total bytes available on the device greater
than the largest extent we'll actually be able to allocate.

But the device extent allocation code incorrectly assumes that a device
with 5G free will be able to allocate a 5G extent.  It isn't normally a
problem because device extents don't get freed unless btrfs-vol -b
is run.

This fixes the device extent allocator to remember the largest free
extent it can find, and then uses that value as a fallback.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:41:41 -04:00
Chris Mason
283bb1979f Btrfs: clear all space_info->full after removing a block group
Btrfs allocates individual extents from block groups, and each
block group has a specific type.  It may hold metadata, data
mirrored or striped etc.

When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups.  Once a block group is freed, the space it was
using on the device may be available for use by new block groups.

btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.

This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:30:55 -04:00
Sage Weil
ebecd3d9d2 Btrfs: make flushoncommit mount option correctly wait on ordered_extents
The commit_transaction call to wait_ordered_extents when snap_pending
passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
correct for the 'flushoncommit' mode, as it skips extents we just started
IO on in start_delalloc_inodes.

So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
only pass the nocow_only flag to wait_ordered_extents if snap_pending.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 13:17:44 -04:00
Yan Zheng
d717aa1d31 Btrfs: Avoid delayed reference update looping
btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.

There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 12:42:46 -04:00
Yan Zheng
0a4eefbb74 Btrfs: Fix ordering of key field checks in btrfs_previous_item
Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Yan Zheng
1fcbac581b Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
find_free_dev_extent does not properly handle the case where
the device is not complete free, and there is a free extent
at the beginning of the device.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Diego Calleja
20736abaa3 Btrfs: Remove code duplication in comp_keys
comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:46 -04:00
Josef Bacik
817d52f8db Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner.  Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation.  If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching.  This is how I tested
the speedup from this

mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo

Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.

Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group.  This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.

I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached.  This drastically reduces the amount of time it takes to
cache the block groups.  Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.

This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:39 -04:00
Josef Bacik
9630308170 Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space.  As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache.  Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram.  The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space.  This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree.  This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff.  I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues.  I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:30 -04:00
Artem Bityutskiy
887ee17117 UBIFS: remove unneeded call from ubifs_sync_fs
Nowadays VFS always synchronizes all dirty inodes and pages before
calling '->sync_fs()', so remove unneeded 'generic_sync_sb_inodes()'
from 'ubifs_sync_fs()'. It used to be needed, but not any longer.

Pointed-out-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-24 15:14:55 +03:00
Artem Bityutskiy
e9d6bbc428 UBIFS: kill BKL
The BKL was pushed down from VFS to the file-systems. It used
to serialize mount/unmount/remount and prevented more than one
instance of the same file-system from doing
mount/umount/remount at the same time. But it is OK for UBIFS
and it does not need any additional locking for these cases.
Thus, kick the BKL out of UBIFS.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-24 15:14:54 +03:00
Subrata Modak
b5148da40c UBIFS: remove unused functions
Remove 'xent_key_init_hash()' and 'data_key_init_flash()' functions,
as they are unot used anywhere.

Signed-off-by: Subrata Modak <subrata@linux.vnet.ibm.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-24 15:14:54 +03:00
Subrata Modak
83ef2ecdbb UBIFS: suppress compilation warning
Fix "using uninitialized variable" compilation warning by using
the "unititialized_var()" helper.

Signed-off-by: Subrata Modak<subrata@linux.vnet.ibm.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-24 15:14:54 +03:00
Jan Kara
0584974a77 ocfs2: Define credit counts for quota operations
Numbers of needed credits for some quota operations were written
as raw numbers. Create appropriate defines instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:31 -07:00
Jan Kara
4539f1df25 ocfs2: Remove syncjiff field from quota info
syncjiff is just a converted value of syncms. Some places which
are updating syncms forgot to update syncjiff as well. Since the
conversion is just a simple division / multiplication and it does
not happen frequently, just remove the syncjiff field to avoid
forgotten conversions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:27 -07:00
Jan Kara
1c1d9793ff ocfs2: Fix initialization of blockcheck stats
We just set blockcheck stats to zeros but we should also
properly initialize the spinlock there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:24 -07:00
Jan Kara
7669f54c55 ocfs2: Zero out padding of on disk dquot structure
Padding fields of on-disk dquot structure were not zeroed. Zero them
so that it's easier to use them later.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:17 -07:00
Jan Kara
0e7f387bf3 ocfs2: Initialize blocks allocated to local quota file
When we extend local quota file, we should initialize data
in newly allocated block. Firstly because on recovery we could
parse bogus data, secondly so that block checksums are properly
computed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:11 -07:00
Jan Kara
4b3fa1904c ocfs2: Mark buffer uptodate before calling ocfs2_journal_access_dq()
In a code path extending local quota files we marked new header
buffer uptodate only after calling ocfs2_journal_access_dq() which
triggers a bug. Fix it and also call ocfs2 variant of the function
marking buffer uptodate.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:05 -07:00
Jan Kara
b57ac2c43e ocfs2: Make global quota files blocksize aligned
Change i_size of global quota files so that it always remains aligned to block
size. This is mainly because the end of quota block may contain checksum (if
checksumming is enabled) and it's a bit awkward for it to be "outside" of quota
file (and it makes life harder for ocfs2-tools).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:58:59 -07:00
Tao Ma
82e12644cf ocfs2: Use ocfs2_rec_clusters in ocfs2_adjust_adjacent_records.
In ocfs2_adjust_adjacent_records, we will adjust adjacent records
according to the extent_list in the lower level. But actually
the lower level tree will either be a leaf or a branch. If we only
use ocfs2_is_empty_extent we will meet with some problem if the lower
tree is a branch (tree_depth > 1). So use !ocfs2_rec_clusters instead.
And actually only the leaf record can have holes. So add a BUG_ON
for non-leaf branch.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:58:46 -07:00
Stefan Bader
4a19fb11a9 jfs: Fix early release of acl in jfs_get_acl
BugLink: http://bugs.launchpad.net/ubuntu/+bug/396780

Commit 073aaa1b14 "helpers for acl
caching + switch to those" introduced new helper functions for
acl handling but seems to have introduced a regression for jfs as
the acl is released before returning it to the caller, instead of
leaving this for the caller to do.
This causes the acl object to be used after freeing it, leading
to kernel panics in completely different places.

Thanks to Christophe Dumez for reporting and bisecting into this.

Reported-by: Christophe Dumez <dchris@gmail.com>
Tested-by: Christophe Dumez <dchris@gmail.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2009-07-23 11:08:36 -05:00
Linus Torvalds
81cbf6d055 Merge branch 'lockdep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep
* 'lockdep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
  lockdep: Fix lockdep annotation for pipe_double_lock()
2009-07-22 16:44:18 -07:00
Steve French
f1230c9797 [CIFS] fix sparse warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-22 23:13:01 +00:00
Jeff Layton
03aa3a49ad cifs: fix sb->s_maxbytes so that it casts properly to a signed value
This off-by-one bug causes sendfile() to not work properly. When a task
calls sendfile() on a file on a CIFS filesystem, the syscall returns -1
and sets errno to EOVERFLOW.

do_sendfile uses s_maxbytes to verify the returned offset of the file.
The problem there is that this value is cast to a signed value (loff_t).
When this is done on the s_maxbytes value that cifs uses, it becomes
negative and the comparisons against it fail.

Even though s_maxbytes is an unsigned value, it seems that it's not OK
to set it in such a way that it'll end up negative when it's cast to a
signed value. These casts happen in other codepaths besides sendfile
too, but the VFS is a little hard to follow in this area and I can't
be sure if there are other bugs that this will fix.

It's not clear to me why s_maxbytes isn't just declared as loff_t in the
first place, but either way we still need to fix these values to make
sendfile work properly. This is also an opportunity to replace the magic
bit-shift values here with the standard #defines for this.

This fixes the reproducer program I have that does a sendfile and
will probably also fix the situation where apache is serving from a
CIFS share.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-22 21:08:00 +00:00
Jeff Layton
ce6e7fcd43 cifs: disable serverino if server doesn't support it
A recent regression when dealing with older servers. This bug was
introduced when we made serverino the default...

When the server can't provide inode numbers, disable it for the mount.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-22 21:07:51 +00:00
David Woodhouse
83121942b2 Btrfs: Fix crash on read failures at mount
If the tree roots hit read errors during mount, btrfs is not properly
erroring out.  We need to check the uptodate bits after
reading in the tree root node.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:52:13 -04:00
Daniel Cadete
c271b49241 Btrfs: remove of redundant btrfs_header_level
This removes the continues call's of btrfs_header_level. One call of
btrfs_header_level(c) its enough.

Signed-off-by Daniel Cadete <danielncadete10@gmail.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:52:13 -04:00
Julia Lawall
33c17ad571 Btrfs: adjust NULL test
Move the call to BUG_ON to before the dereference of the tested value.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:49:01 -04:00
David Woodhouse
3acada49c2 Btrfs: Remove broken sanity check from btrfs_rmap_block()
It was never actually doing anything anyway (see the loop condition),
and it would be difficult to make it work for RAID[56].

Even if it was actually working, it's checking for the wrong thing
anyway. Instead of checking whether we list a block which _doesn't_ land
at the relevant physical location, it should be checking that we _have_
listed all the logical blocks which refer to the required physical
location on all devices.

This function is only called from remove_sb_from_cache() to ensure that
we reserve the logical blocks which would reside at the same physical
location as the superblock copies. So listing more blocks than we need
is actually OK.

With RAID[56] we're going to throw away an entire stripe for each block
we have to ignore, so we _are_ going to list blocks other than the
ones which actually contain the superblock.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:49:01 -04:00
Julia Lawall
29c5e8ce01 Btrfs: convert nested spin_lock_irqsave to spin_lock
If spin_lock_irqsave is called twice in a row with the same second
argument, the interrupt state at the point of the second call overwrites
the value saved by the first call.  Indeed, the second call does not need
to save the interrupt state, so it is changed to a simple spin_lock.

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 16:49:00 -04:00
Peter Zijlstra
023d43c7b5 lockdep: Fix lockdep annotation for pipe_double_lock()
The presumed use of the pipe_double_lock() routine is to lock 2 locks in
a deadlock free way by ordering the locks by their address. However it
fails to keep the specified lock classes in order and explicitly
annotates a deadlock.

Rectify this.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
LKML-Reference: <1248163763.15751.11098.camel@twins>
2009-07-22 21:14:14 +02:00
Linus Torvalds
1f9758d4e7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  fs/Kconfig: move nilfs2 out
2009-07-22 10:05:00 -07:00
Yan Zheng
4a8c9a62d7 Btrfs: make sure all dirty blocks are written at commit time
Write dirty block groups may allocate new block, and so may add new delayed
back ref. btrfs_run_delayed_refs may make some block groups dirty.

commit_cowonly_roots does not handle the recursion properly, and some dirty
blocks can be left unwritten at commit time. This patch moves
btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
the code not break out of the loop until there are no dirty block groups or
delayed back refs.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 10:07:05 -04:00
Yan Zheng
33c66f430b Btrfs: fix locking issue in btrfs_find_next_key
When walking up the tree, btrfs_find_next_key assumes the upper level tree
block is properly locked. This isn't always true even path->keep_locks is 1.
This is because btrfs_find_next_key may advance path->slots[] several times
instead of only once.

When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
we can't guarantee the original value of 'path->slots[level]' is
'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
'level + 1' isn't locked.

This patch fixes the issue by explicitly checking the locking state,
re-searching the tree if it's not locked.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng
e457afec60 Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
if 1 is returned by btrfs_search_slot, the path already points to the
first item with 'key > searching key'. So increasing path->slots[0] by
one is superfluous in that case.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng
bf1fb512a5 Btrfs: properly update space information after shrinking device.
Change 'goto done' to 'break' for the case of all device extents have
been freed, so that the code updates space information will be execute.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Yan Zheng
1bec1aed1e Btrfs: fix definition of struct btrfs_extent_inline_ref
use __le64 instead of u64 in on-disk structure definition.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-22 09:59:00 -04:00
Trond Myklebust
d953126a28 NFSv4: Fix a problem whereby a buggy server can oops the kernel
We just had a case in which a buggy server occasionally returns the wrong
attributes during an OPEN call. While the client does catch this sort of
condition in nfs4_open_done(), and causes the nfs4_atomic_open() to return
-EISDIR, the logic in nfs_atomic_lookup() is broken, since it causes a
fallback to an ordinary lookup instead of just returning the error.

When the buggy server then returns a regular file for the fallback lookup,
the VFS allows the open, and bad things start to happen, since the open
file doesn't have any associated NFSv4 state.

The fix is firstly to return the EISDIR/ENOTDIR errors immediately, and
secondly to ensure that we are always careful when dereferencing the
nfs_open_context state pointer.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-07-21 19:22:38 -04:00
Jan Kara
f7b1aa69be ocfs2: Fix deadlock on umount
In commit ea455f8ab6, we moved the dentry lock
put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
can drop references to inodes while they are being invalidated by
invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
ending in an infinite loop, "Busy inodes after umount" messages etc.).

We fix the problem by stopping ocfs2_wq from doing any further releasing of
inode references on the superblock being unmounted, wait until it finishes
the current round of releasing and finally cleaning up all the references in
dentry_lock_list from ocfs2_put_super().

The issue was tracked down by Tao Ma <tao.ma@oracle.com>.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-21 15:47:55 -07:00
Tao Ma
3c5e10683e ocfs2: Add extra credits and access the modified bh in update_edge_lengths.
In normal tree rotation left process, we will never touch the tree
branch above subtree_index and ocfs2_extend_rotate_transaction doesn't
reserve the credits for them either.

But when we want to delete the rightmost extent block, we have to update
the rightmost records for all the rightmost branch(See
ocfs2_update_edge_lengths), so we have to allocate extra credits for them.
What's more, we have to access them also.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-21 14:41:54 -07:00
Trond Myklebust
fccba80455 NFSv4: Fix an NFSv4 mount regression
Commit 008f55d0e0 (nfs41: recover lease in
_nfs4_lookup_root) forces the state manager to always run on mount. This is
a bug in the case of NFSv4.0, which doesn't require us to send a
setclientid until we want to grab file state.

In any case, this is completely the wrong place to be doing state
management. Moving that code into nfs4_init_session...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-07-21 16:48:07 -04:00
Trond Myklebust
b64aec8d1e NFSv4: Fix an Oops in nfs4_free_lock_state
The oops http://www.kerneloops.org/raw.php?rawid=537858&msgid= appears to
be due to the nfs4_lock_state->ls_state field being uninitialised. This
happens if the call to nfs4_free_lock_state() is triggered at the end of
nfs4_get_lock_state().

The fix is to move the initialisation of ls_state into the allocator.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-07-21 16:47:46 -04:00
Eric Paris
f44aebcc56 inotify: use GFP_NOFS under potential memory pressure
inotify can have a watchs removed under filesystem reclaim.

=================================
[ INFO: inconsistent lock state ]
2.6.31-rc2 #16
---------------------------------
inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
khubd/217 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (iprune_mutex){+.+.?.}, at: [<c10ba899>] invalidate_inodes+0x20/0xe3
{IN-RECLAIM_FS-W} state was registered at:
  [<c10536ab>] __lock_acquire+0x2c9/0xac4
  [<c1053f45>] lock_acquire+0x9f/0xc2
  [<c1308872>] __mutex_lock_common+0x2d/0x323
  [<c1308c00>] mutex_lock_nested+0x2e/0x36
  [<c10ba6ff>] shrink_icache_memory+0x38/0x1b2
  [<c108bfb6>] shrink_slab+0xe2/0x13c
  [<c108c3e1>] kswapd+0x3d1/0x55d
  [<c10449b5>] kthread+0x66/0x6b
  [<c1003fdf>] kernel_thread_helper+0x7/0x10
  [<ffffffff>] 0xffffffff

Two things are needed to fix this.  First we need a method to tell
fsnotify_create_event() to use GFP_NOFS and second we need to stop using
one global IN_IGNORED event and allocate them one at a time.  This solves
current issues with multiple IN_IGNORED on a queue having tail drop
problems and simplifies the allocations since we don't have to worry about
two tasks opperating on the IGNORED event concurrently.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:27 -04:00
Eric Paris
c05594b621 fsnotify: fix inotify tail drop check with path entries
fsnotify drops new events when they are the same as the tail event on the
queue to be sent to userspace.  The problem is that if the event comes with
a path we forget to break out of the switch statement and fall into the
code path which matches on events that do not have any type of file backed
information (things like IN_UNMOUNT and IN_Q_OVERFLOW).  The problem is
that this code thinks all such events should be dropped.  Fix is to add a
break.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:26 -04:00
Eric Paris
4a148ba988 inotify: check filename before dropping repeat events
inotify drops events if the last event on the queue is the same as the
current event.  But it does 2 things wrong.  First it is comparing old->inode
with new->inode.  But after an event if put on the queue the ->inode is no
longer allowed to be used.  It's possible between the last event and this new
event the inode could be reused and we would falsely match the inode's memory
address between two differing events.

The second problem is that when a file is removed fsnotify is passed the
negative dentry for the removed object rather than the postive dentry from
immediately before the removal.  This mean the (broken) inotify tail drop code
was matching the NULL ->inode of differing events.

The fix is to check the file name which is stored with events when doing the
tail drop instead of wrongly checking the address of the stored ->inode.

Reported-by: Scott James Remnant <scott@ubuntu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:26 -04:00
Eric Paris
520dc2a526 fsnotify: use def_bool in kconfig instead of letting the user choose
fsnotify doens't give the user anything.  If someone chooses inotify or
dnotify it should build fsnotify, if they don't select one it shouldn't be
built.  This patch changes fsnotify to be a def_bool=n and makes everything
else select it.  Also fixes the issue people complained about on lwn where
gdm hung because they didn't have inotify and they didn't get the inotify
build option.....

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:26 -04:00
Eric Paris
7e790dd5fc inotify: fix error paths in inotify_update_watch
inotify_update_watch could leave things in a horrid state on a number of
error paths.  We could try to remove idr entries that didn't exist, we
could send an IN_IGNORED to userspace for watches that don't exist, and a
bit of other stupidity.  Clean these up by doing the idr addition before we
put the mark on the inode since we can clean that up on error and getting
off the inode's mark list is hard.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:26 -04:00
Eric Paris
75fe2b2639 inotify: do not leak inode marks in inotify_add_watch
inotify_add_watch had a couple of problems.  The biggest being that if
inotify_add_watch was called on the same inode twice (to update or change the
event mask) a refence was taken on the original inode mark by
fsnotify_find_mark_entry but was not being dropped at the end of the
inotify_add_watch call.  Thus if inotify_rm_watch was called although the mark
was removed from the inode, the refcnt wouldn't hit zero and we would leak
memory.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:26 -04:00
Eric Paris
5549f7cdf8 inotify: drop user watch count when a watch is removed
The inotify rewrite forgot to drop the inotify watch use cound when a watch
was removed.  This means that a single inotify fd can only ever register a
maximum of /proc/sys/fs/max_user_watches even if some of those had been
freed.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-21 15:26:26 -04:00
dingdinghua
f1015c4477 jbd: fix race between write_metadata_buffer and get_write_access
The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in)
too early; this could potentially allow another thread to call get_write_access
on the buffer head, modify the data, and dirty it, and allowing the wrong data
to be written into the journal.  Fortunately, if we lose this race, the only
time this will actually cause filesystem corruption is if there is a system
crash or other unclean shutdown of the system before the next commit can take
place.

Signed-off-by: dingdinghua <dingdinghua85@gmail.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-21 11:54:42 +02:00
Wengang Wang
1f4cea3790 ocfs2: Fail ocfs2_get_block() immediately when a block needs allocation
ocfs2_get_block() does no allocation.  Hole filling for writes should
have happened farther up in the call chain.  We detect this case and
print an error, but we then continue with the function.  We should be
exiting immediately.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-20 17:36:07 -07:00
Wengang Wang
cbfa9639ae ocfs2: Fix error return in ocfs2_write_cluster()
A typo caused ocfs2_write_cluster() to return 0 in some error cases.
Fix it.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-20 17:32:55 -07:00
Linus Torvalds
457f82bac6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: Fix incorrect parameters to v9fs_file_readn.
  9p: Possible regression in p9_client_stat
  9p: default 9p transport module fix
2009-07-20 16:48:31 -07:00
Subrata Modak
44d8e4e1e9 ocfs2: Fix compilation warning for fs/ocfs2/xattr.c
gcc 4.4.1 generates the following build warning on i386:

CC [M]  fs/ocfs2/xattr.o
fs/ocfs2/xattr.c: In function ‘ocfs2_xattr_block_get’:
fs/ocfs2/xattr.c:1055: warning: ‘block_off’ may be used uninitialized in this function

The following fix is based on a similar approach by David Howells
few days back: http://lkml.org/lkml/2009/7/9/109,

Signed-off-by: Subrata Modak<subrata@linux.vnet.ibm.com>,
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-20 15:52:57 -07:00
Goldwyn Rodrigues
cefcb800fa ocfs2: Initialize count in aio_write before generic_write_checks
generic_write_checks() expects count to be initialized to the size of
the write.  Writes to files open with O_DIRECT|O_LARGEFILE write 0 bytes
because count is uninitialized.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-20 15:47:58 -07:00
Jeff Layton
90a98b2f3f cifs: free nativeFileSystem field before allocating a new one
...otherwise, we'll leak this memory if we have to reconnect (e.g. after
network failure).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-20 18:24:37 +00:00
Frederic Weisbecker
def01bc53d sched: Convert the only user of cond_resched_bkl to use cond_resched()
fs/locks.c:flock_lock_file() is the only user of
cond_resched_bkl()

This helper doesn't do anything more than cond_resched(). The
latter naming is enough to explain that we are rescheduling if
needed.

The bkl suffix suggests another semantics but it's actually a
synonym of cond_resched().

Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-7-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-07-18 15:51:45 +02:00
Frederic Weisbecker
613afbf832 sched: Pull up the might_sleep() check into cond_resched()
might_sleep() is called late-ish in cond_resched(), after the
need_resched()/preempt enabled/system running tests are
checked.

It's better to check the sleeps while atomic earlier and not
depend on some environment datas that reduce the chances to
detect a problem.

Also define cond_resched_*() helpers as macros, so that the
FILE/LINE reported in the sleeping while atomic warning
displays the real origin and not sched.h

Changes in v2:

 - Call __might_sleep() directly instead of might_sleep() which
   may call cond_resched()

 - Turn cond_resched() into a macro so that the file:line
   couple reported refers to the caller of cond_resched() and
   not __cond_resched() itself.

Changes in v3:

 - Also propagate this __might_sleep() pull up to
   cond_resched_lock() and cond_resched_softirq()

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-6-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-07-18 15:51:44 +02:00
Sten Spans
713c0ecdb8 security: fix security_file_lock cmd argument
Pass posix-translated lock operations to security_file_lock
when invoked via sys_flock.

Signed-off-by: Sten Spans <Sten_Spans@genua.de>
Signed-off-by: James Morris <jmorris@namei.org>
2009-07-17 07:41:23 +10:00
Steve French
f6c4338543 Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-07-16 04:21:39 +00:00
Jan Kara
43237b5490 ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
be a relict from some old days and setting disksize in this function does not
make much sence. Currently it was set only by ext3_getblk().  Since the
parameter has some effect only if create == 1, it is easy to check that the
three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-15 21:30:46 +02:00
Jan Kara
1e9fd53b78 jbd: Fix a race between checkpointing code and journal_get_write_access()
The following race can happen:

  CPU1                          CPU2
                                checkpointing code checks the buffer, adds
                                  it to an array for writeback
do_get_write_access()
  ...
  lock_buffer()
  unlock_buffer()
                                  flush_batch() submits the buffer for IO
  __jbd_journal_file_buffer()

  So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment. The similar problem can happen with
the journal_get_create_access() path.
  We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.

Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-15 21:30:07 +02:00
Jan Kara
9eaaa2d575 ext3: Fix truncation of symlinks after failed write
Contents of long symlinks is written via standard write methods. So when the
write fails, we add inode to orphan list. But symlinks don't have .truncate
method defined so nobody properly removes them from the orphan list (both on
disk and in memory).

Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
(which is saner anyway since we don't need anything vmtruncate() does except
from calling .truncate in these paths).  We also add inode to orphan list only
if ext3_can_truncate() is true (currently, it can be false for symlinks when
there are no blocks allocated) - otherwise orphan list processing will complain
and ext3_truncate() will not remove inode from on-disk orphan list.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-15 21:28:07 +02:00
Jan Kara
7447a668a3 jbd: Fail to load a journal if it is too short
Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.

Reported-by: Nageswara R Sastry <rnsastry@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-15 21:26:23 +02:00
Linus Torvalds
8aa651e23e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: free socket in error exit path
  dlm: fix plock use-after-free
  dlm: Fix uninitialised variable warning in lock.c
2009-07-14 18:37:24 -07:00
Linus Torvalds
c0c50b541a Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  tracing/function-profiler: do not free per cpu variable stat
  tracing/events: Move TRACE_SYSTEM outside of include guard
2009-07-14 18:34:32 -07:00
Andy Adamson
4bd9b0f4af nfsd41: use globals for DRC limits
The version 4.1 DRC memory limit and tracking variables are server wide and
session specific. Replace struct svc_serv fields with globals.
Stop using the svc_serv sv_lock.

Add a spinlock to serialize access to the DRC limit management variables which
change on session creation and deletion (usage counter) or (future)
administrative action to adjust the total DRC memory limit.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-07-14 17:52:40 -04:00
Abhishek Kulkarni
9c9ad6162e 9p: Fix incorrect parameters to v9fs_file_readn.
Fix v9fs_vfs_readpage. The offset and size parameters to v9fs_file_readn
were interchanged and hence passed incorrectly.

Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2009-07-14 15:54:42 -05:00
Casey Dahlin
a89d63a159 dlm: free socket in error exit path
In the tcp_connect_to_sock() error exit path, the socket
allocated at the top of the function was not being freed.

Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-07-14 12:28:43 -05:00
Yu Zhiguo
9208faf297 NFSv4: ACL in operations 'open' and 'create' should be used
ACL in operations 'open' and 'create' is decoded but never be used.
It should be set as the initial ACL for the object according to RFC3530.
If error occurs when setting the ACL, just clear the ACL bit in the
returned attr bitmap.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-14 12:16:47 -04:00
Ryusuke Konishi
4fed598a49 fs/Kconfig: move nilfs2 out
fs/Kconfig file was split into individual fs/*/Kconfig files before
nilfs was merged.  I've found the current config entry of nilfs is
tainting the work.  Sorry, I didn't notice.  This fixes the violation.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
2009-07-14 12:34:17 +09:00
Linus Torvalds
1cf29683f4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix race between write_metadata_buffer and get_write_access
  ext4: Fix ext4_mb_initialize_context() to initialize all fields
  ext4: fix null handler of ioctls in no journal mode
  ext4: Fix buffer head reference leak in no-journal mode
  ext4: Move __ext4_journalled_writepage() to avoid forward declaration
  ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc
  ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation
  ext4: Don't look at buffer_heads outside i_size.
  ext4: Fix goal inum check in the inode allocator
  ext4: fix no journal corruption with locale-gen
  ext4: Calculate required journal credits for inserting an extent properly
  ext4: Fix truncation of symlinks after failed write
  jbd2: Fix a race between checkpointing code and journal_get_write_access()
  ext4: Use rcu_barrier() on module unload.
  ext4: naturally align struct ext4_allocation_request
  ext4: mark several more functions in mballoc.c as noinline
  ext4: Fix potential reclaim deadlock when truncating partial block
  jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical region
  ext4: Fix type warning on 64-bit platforms in tracing events header
2009-07-13 16:39:25 -07:00
dingdinghua
96577c4382 jbd2: fix race between write_metadata_buffer and get_write_access
The function jbd2_journal_write_metadata_buffer() calls
jbd_unlock_bh_state(bh_in) too early; this could potentially allow
another thread to call get_write_access on the buffer head, modify the
data, and dirty it, and allowing the wrong data to be written into the
journal.  Fortunately, if we lose this race, the only time this will
actually cause filesystem corruption is if there is a system crash or
other unclean shutdown of the system before the next commit can take
place.

Signed-off-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 17:55:35 -04:00
Linus Torvalds
a4dc32374e Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  wm97xx_batery: replace driver_data with dev_get_drvdata()
  omap: video: remove direct access of driver_data
  Sound: remove direct access of driver_data
  driver model: fix show/store prototypes in doc.
  Firmware: firmware_class, fix lock imbalance
  Driver Core: remove BUS_ID_SIZE
  sparc: remove driver-core BUS_ID_SIZE
  partitions: fix broken uevent_suppress conversion
  devres: WARN() and return, don't crash on device_del() of uninitialized device
2009-07-13 10:24:08 -07:00
James Morris
7d45ecafb6 Merge branch 'master' into next
Conflicts:
	include/linux/personality.h

Use Linus' version.

Signed-off-by: James Morris <jmorris@namei.org>
2009-07-14 00:30:40 +10:00
Theodore Ts'o
833576b362 ext4: Fix ext4_mb_initialize_context() to initialize all fields
Pavel Roskin pointed out that kmemcheck indicated that
ext4_mb_store_history() was accessing uninitialized values of
ac->ac_tail and ac->ac_buddy leading to garbage in the mballoc
history.  Fix this by initializing the entire structure to all zeros
first.

Also, two fields were getting doubly initialized by the caller of
ext4_mb_initialize_context, so remove them for efficiency's sake.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 09:45:52 -04:00
Peng Tao
ac046f1d61 ext4: fix null handler of ioctls in no journal mode
The EXT4_IOC_GROUP_ADD and EXT4_IOC_GROUP_EXTEND ioctls should not
flush the journal in no_journal mode.  Otherwise, running resize2fs on
a mounted no_journal partition triggers the following error messages:

BUG: unable to handle kernel NULL pointer dereference at 00000014
IP: [<c039d282>] _spin_lock+0x8/0x19
*pde = 00000000 
Oops: 0002 [#1] SMP

Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 09:30:17 -04:00
Curt Wohlgemuth
e6b5d30104 ext4: Fix buffer head reference leak in no-journal mode
We found a problem with buffer head reference leaks when using an ext4
partition without a journal.  In particular, calls to ext4_forget() would
not to a brelse() on the input buffer head, which will cause pages they
belong to to not be reclaimable.

Further investigation showed that all places where ext4_journal_forget() and
ext4_journal_revoke() are called are subject to the same problem.  The patch
below changes __ext4_journal_forget/__ext4_journal_revoke to do an explicit
release of the buffer head when the journal handle isn't valid.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 09:07:20 -04:00
Li Zefan
d0b6e04a4c tracing/events: Move TRACE_SYSTEM outside of include guard
If TRACE_INCLDUE_FILE is defined, <trace/events/TRACE_INCLUDE_FILE.h>
will be included and compiled, otherwise it will be
<trace/events/TRACE_SYSTEM.h>

So TRACE_SYSTEM should be defined outside of #if proctection,
just like TRACE_INCLUDE_FILE.

Imaging this scenario:

 #include <trace/events/foo.h>
    -> TRACE_SYSTEM == foo
 ...
 #include <trace/events/bar.h>
    -> TRACE_SYSTEM == bar
 ...
 #define CREATE_TRACE_POINTS
 #include <trace/events/foo.h>
    -> TRACE_SYSTEM == bar !!!

and then bar.h will be included and compiled.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A5A9CF1.2010007@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-07-13 10:59:55 +02:00
Johannes Berg
134e63756d genetlink: make netns aware
This makes generic netlink network namespace aware. No
generic netlink families except for the controller family
are made namespace aware, they need to be checked one by
one and then set the family->netnsok member to true.

A new function genlmsg_multicast_netns() is introduced to
allow sending a multicast message in a given namespace,
for example when it applies to an object that lives in
that namespace, a new function genlmsg_multicast_allns()
to send a message to all network namespaces (for objects
that do not have an associated netns).

The function genlmsg_multicast() is changed to multicast
the message in just init_net, which is currently correct
for all generic netlink families since they only work in
init_net right now. Some will later want to work in all
net namespaces because they do not care about the netns
at all -- those will have to be converted to use one of
the new functions genlmsg_multicast_allns() or
genlmsg_multicast_netns() whenever they are made netns
aware in some way.

After this patch families can easily decide whether or
not they should be available in all net namespaces. Many
genl families us it for objects not related to networking
and should therefore be available in all namespaces, but
that will have to be done on a per family basis.

Note that this doesn't touch on the checkpoint/restart
problem where network namespaces could be used, genl
families and multicast groups are numbered globally and
I see no easy way of changing that, especially since it
must be possible to multicast to all network namespaces
for those families that do not care about netns.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-07-12 14:03:27 -07:00
Heiko Carstens
f8c73c790c partitions: fix broken uevent_suppress conversion
git commit f67f129e "Driver core: implement uevent suppress in kobject"
contains this chunk for fs/partitions/check.c:

 	/* suppress uevent if the disk supresses it */
-	if (!ddev->uevent_suppress)
+	if (!dev_get_uevent_suppress(pdev))
 		kobject_uevent(&pdev->kobj, KOBJ_ADD);

However that should have been

-	if (!ddev->uevent_suppress)
+	if (!dev_get_uevent_suppress(ddev))

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Ming Lei <tom.leiming@gmail.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-12 13:02:09 -07:00
Artem Bityutskiy
dd0d9a46f5 AFS: Fix compilation warning
Fix the following warning:

  fs/afs/dir.c: In function 'afs_d_revalidate':
  fs/afs/dir.c:567: warning: 'fid.vnode' may be used uninitialized in this function
  fs/afs/dir.c:567: warning: 'fid.unique' may be used uninitialized in this function

by marking the 'fid' variable as an uninitialized_var.  The problem is
that gcc doesn't always manage to work out that fid is always set on the
path through the function that uses it.

Cc: linux-afs@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-12 12:24:07 -07:00
Alexey Dobriyan
405f55712d headers: smp_lock.h redux
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
  It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT

  This will make hardirq.h inclusion cheaper for every PREEMPT=n config
  (which includes allmodconfig/allyesconfig, BTW)

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-12 12:22:34 -07:00
Linus Torvalds
81e4e1ba7e Revert "fuse: Fix build error" as unnecessary
This reverts commit 097041e576.

Trond had a better fix, which is the parent of this one ("Fix compile
error due to congestion_wait() changes")

Requested-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-11 11:22:34 -07:00
Bartlomiej Zolnierkiewicz
8711c67bee isofs: fix Joliet regression
commit 5404ac8e44 ("isofs: cleanup mount
option processing") missed conversion of joliet option flag resulting
in non-working Joliet support.

CC: walt <w41ter@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-10 19:18:59 -07:00
Linus Torvalds
44c695b13b Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix corruption dump
  UBIFS: clean up free space checking
  UBIFS: small amendments in the LEB scanning code
  UBIFS: dump a little more in case of corruptions
  MAINTAINERS: update ahunter's e-mail address
  UBIFS: allow more than one volume to be mounted
  UBIFS: fix assertion warning
  UBIFS: minor spelling and grammar fixes
  UBIFS: fix 64-bit divisions in debug print
  UBIFS: few spelling fixes
  UBIFS: set write-buffer timout to 3-5 seconds
  UBIFS: slightly optimize write-buffer timer usage
  UBIFS: improve debugging messaged
  UBIFS: fix integer overflow warning
2009-07-10 19:14:48 -07:00
Linus Torvalds
04eef90c2e Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  osdblk: Adjust queue limits to lower device's limits
  osdblk: a Linux block device for OSD objects
  MAINTAINERS: Add osd maintained files (F:)
  exofs: Avoid using file_fsync()
  exofs: Remove IBM copyrights
  exofs: Fix bio leak in error handling path (sync read)
2009-07-10 19:12:24 -07:00
Larry Finger
097041e576 fuse: Fix build error
When building v2.6.31-rc2-344-g69ca06c, the following build errors are
found due to missing includes:

 CC [M]  fs/fuse/dev.o
fs/fuse/dev.c: In function ‘request_end’:
fs/fuse/dev.c:289: error: ‘BLK_RW_SYNC’ undeclared (first use in this function)
...
fs/nfs/write.c: In function ‘nfs_set_page_writeback’:
fs/nfs/write.c:207: error: ‘BLK_RW_ASYNC’ undeclared (first use in this function)

Signed-off-by: Larry Finger@lwfinger.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-10 19:09:46 -07:00
Wengang Wang
812e7a6a43 ocfs2: log the actual return value of ocfs2_file_aio_write()
in ocfs2_file_aio_write(), log_exit() could don't log the value
which is really returned. this patch fixes it.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-10 16:53:52 -07:00
Linus Torvalds
69ca06c945 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: reset oom_cfqq in cfq_set_request()
  block: fix sg SG_DXFER_TO_FROM_DEV regression
  block: call blk_scsi_ioctl_init()
  Fix congestion_wait() sync/async vs read/write confusion
2009-07-10 14:29:58 -07:00
Linus Torvalds
9f2d8be426 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix disorder in cp count on error during deleting checkpoints
  nilfs2: fix lockdep warning between regular file and inode file
  nilfs2: fix incorrect KERN_CRIT messages in case of write failures
  nilfs2: fix hang problem of log writer which occurs after write failures
  nilfs2: remove unlikely directive causing mis-conversion of error code
2009-07-10 14:27:21 -07:00
FUJITA Tomonori
ecb554a846 block: fix sg SG_DXFER_TO_FROM_DEV regression
I overlooked SG_DXFER_TO_FROM_DEV support when I converted sg to use
the block layer mapping API (2.6.28).

Douglas Gilbert explained SG_DXFER_TO_FROM_DEV:

http://www.spinics.net/lists/linux-scsi/msg37135.html

=
The semantics of SG_DXFER_TO_FROM_DEV were:
   - copy user space buffer to kernel (LLD) buffer
   - do SCSI command which is assumed to be of the DATA_IN
     (data from device) variety. This would overwrite
     some or all of the kernel buffer
   - copy kernel (LLD) buffer back to the user space.

The idea was to detect short reads by filling the original
user space buffer with some marker bytes ("0xec" it would
seem in this report). The "resid" value is a better way
of detecting short reads but that was only added this century
and requires co-operation from the LLD.
=

This patch changes the block layer mapping API to support this
semantics. This simply adds another field to struct rq_map_data and
enables __bio_copy_iov() to copy data from user space even with READ
requests.

It's better to add the flags field and kills null_mapped and the new
from_user fields in struct rq_map_data but that approach makes it
difficult to send this patch to stable trees because st and osst
drivers use struct rq_map_data (they were converted to use the block
layer in 2.6.29 and 2.6.30). Well, I should clean up the block layer
mapping API.

zhou sf reported this regiression and tested this patch:

http://www.spinics.net/lists/linux-scsi/msg37128.html
http://www.spinics.net/lists/linux-scsi/msg37168.html

Reported-by: zhou sf <sxzzsf@gmail.com>
Tested-by: zhou sf <sxzzsf@gmail.com>
Cc: stable@kernel.org
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Jens Axboe
8aa7e847d8 Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Steve French
65bc98b005 [CIFS] Distinguish posix opens and mkdirs from legacy mkdirs in stats
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-10 15:27:25 +00:00
Linus Torvalds
c2cc49a2f8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: when ATTR_READONLY is set, only clear write bits on non-directories
  cifs: remove cifsInodeInfo->inUse counter
  cifs: convert cifs_get_inode_info and non-posix readdir to use cifs_iget
  [CIFS] update cifs version number
  cifs: add and use CIFSSMBUnixSetFileInfo for setattr calls
  cifs: make a separate function for filling out FILE_UNIX_BASIC_INFO
  cifs: rename CIFSSMBUnixSetInfo to CIFSSMBUnixSetPathInfo
  cifs: add pid of initiating process to spnego upcall info
  cifs: fix regression with O_EXCL creates and optimize away lookup
  cifs: add new cifs_iget function and convert unix codepath to use it
2009-07-09 20:40:58 -07:00
Jeff Layton
d0c280d26d cifs: when ATTR_READONLY is set, only clear write bits on non-directories
cifs: when ATTR_READONLY is set, only clear write bits on non-directories

On windows servers, ATTR_READONLY apparently either has no meaning or
serves as some sort of queue to certain applications for unrelated
behavior. This MS kbase article has details:

http://support.microsoft.com/kb/326549/

Don't clear the write bits directory mode when ATTR_READONLY is set.

Reported-by: pouchat@peewiki.net
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 23:06:04 +00:00
Jeff Layton
aeaaf253c4 cifs: remove cifsInodeInfo->inUse counter
cifs: remove cifsInodeInfo->inUse counter

It was purported to be a refcounter of some sort, but was never
used that way. It never served any purpose that wasn't served equally well
by the I_NEW flag.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 23:06:00 +00:00
Jeff Layton
0b8f18e358 cifs: convert cifs_get_inode_info and non-posix readdir to use cifs_iget
cifs: convert cifs_get_inode_info and non-posix readdir to use cifs_iget

Rather than allocating an inode and filling it out, have
cifs_get_inode_info fill out a cifs_fattr and call cifs_iget. This means
a pretty hefty reorganization of cifs_get_inode_info.

For the readdir codepath, add a couple of new functions for filling out
cifs_fattr's from different FindFile response infolevels.

Finally, remove cifs_new_inode since there are no more callers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 23:05:48 +00:00
Steve French
b77863bfa1 [CIFS] update cifs version number
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 22:51:38 +00:00
Jeff Layton
3bbeeb3c93 cifs: add and use CIFSSMBUnixSetFileInfo for setattr calls
cifs: add and use CIFSSMBUnixSetFileInfo for setattr calls

When there's an open filehandle, SET_FILE_INFO is apparently preferred
over SET_PATH_INFO. Add a new variant that sets a FILE_UNIX_INFO_BASIC
infolevel via SET_FILE_INFO and switch cifs_setattr_unix to use the
new call when there's an open filehandle available.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 21:15:10 +00:00
Jeff Layton
654cf14ac0 cifs: make a separate function for filling out FILE_UNIX_BASIC_INFO
cifs: make a separate function for filling out FILE_UNIX_BASIC_INFO

The SET_FILE_INFO variant will need to do the same thing here. Break
this code out into a separate function that both variants can call.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 21:15:06 +00:00
Jeff Layton
01ea95e3b6 cifs: rename CIFSSMBUnixSetInfo to CIFSSMBUnixSetPathInfo
cifs: rename CIFSSMBUnixSetInfo to CIFSSMBUnixSetPathInfo

...in preparation of adding a SET_FILE_INFO variant.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 21:15:02 +00:00
Jeff Layton
c4c1bff64d cifs: add pid of initiating process to spnego upcall info
cifs: add pid of initiating process to spnego upcall info

This will allow the upcall to poke in /proc/<pid>/environ and get
the value of the $KRB5CCNAME env var for the process.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-09 21:14:58 +00:00
Artem Bityutskiy
0611254760 UBIFS: fix corruption dump
In the 'ubifs_recover_leb()' function, when we find corrupted
empty space, we dump 8K starting from the offset where the last
node ends. This is OK if the corrupted empty space is somewhere
near that offset. But if the corruption is far at the end of the
LEB, we will dump all 0xFF bytes and complitely ignore the
interesting data. This is observed on a PPC ("kilauea") with
NOR flash.

This patch changes the behavior and teaches UBIFS to print only
interesting data. I.e., now we find where corruption starts and
start dumping from that offset.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-07-09 09:19:39 +03:00
Artem Bityutskiy
431102fed3 UBIFS: clean up free space checking
recovery.c has 'is_empty()' helper and it is better to use
this helper instead of re-implementing it in several places.
This patch does this and removes some amount of unneeded code.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-07-09 09:19:38 +03:00
Artem Bityutskiy
ed43f2f06c UBIFS: small amendments in the LEB scanning code
This patch fixes few minor things I've spotted while going through
code:

1. Better document return codes
2. If 'ubifs_scan_a_node()' returns some thing we do not expect,
   treat this as an error.
3. Try to do recovery only when 'ubifs_scan()' returns %-EUCLEAN,
   not on any error.
4. If empty space starts at a non-aligned address, print a message.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-07-09 09:19:38 +03:00
Artem Bityutskiy
086b3640c1 UBIFS: dump a little more in case of corruptions
In case of corruptions, dump 8192 bytes instead of 4096. The
largest node is 4096+ bytes, so it is better to see a node
boundary, which is not always possible when only 4096 bytes
are printed.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
2009-07-09 09:19:38 +03:00
Jeff Liu
17ae26b669 ocfs2: trivial fix for s/migrate/migration/ in dlmrecovery.c logging
in dlmrecovery.c:1121, replace 'migrate' to 'migration' to keep the consistency
by comparing to other lines with the similar log info in the same file.

Signed-off-by: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-08 15:34:40 -07:00
Jeff Mahoney
8b712cd58a ocfs2: Fixup orphan scan cleanup after failed mount
If the mount fails for any reason, ocfs2_dismount_volume calls
ocfs2_orphan_scan_stop. It requires that ocfs2_orphan_scan_init
be called to setup the mutex and work queues, but that doesn't
happen if the mount has failed and we oops accessing an uninitialized
work queue.

This patch splits the init and startup of the orphan scan, eliminating
the oops.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-08 15:34:02 -07:00
Jeff Layton
5ddf1e0ff0 cifs: fix regression with O_EXCL creates and optimize away lookup
cifs: fix regression with O_EXCL creates and optimize away lookup

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Shirish Pargaonkar <shirishp@gmail.com>
CC: Stable Kernel <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-08 21:55:45 +00:00
Joe Perches
ad361c9884 Remove multiple KERN_ prefixes from printk formats
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics.  printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.

<level> is now included in the output on each additional use.

Remove all uses of multiple KERN_<level>s in formats.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-08 10:30:03 -07:00
Linus Torvalds
728b690fd5 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6:
  quota: Fix possible deadlock during parallel quotaon and quotaoff
2009-07-08 09:35:50 -07:00
Catalin Marinas
d5ce5b40bc Free the memory allocated by memdup_user() in fs/sysfs/bin.c
Commit 1c8542c7bb replaced kmalloc() with memdup_user() in the write()
function but also dropped the kfree(temp). The memdup_user() function
allocates memory which is never freed.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Parag Warudkar <parag.warudkar@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-08 09:34:07 -07:00
Alexey Dobriyan
b43f3cbd21 headers: mnt_namespace.h redux
Fix various silly problems wrt mnt_namespace.h:

 - exit_mnt_ns() isn't used, remove it
 - done that, sched.h and nsproxy.h inclusions aren't needed
 - mount.h inclusion was need for vfsmount_lock, but no longer
 - remove mnt_namespace.h inclusion from files which don't use anything
   from mnt_namespace.h

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-08 09:31:56 -07:00
Jiaying Zhang
d01730d74d quota: Fix possible deadlock during parallel quotaon and quotaoff
The following test script triggers a deadlock on ext2 filesystem:
while true; do quotaon /dev/hda >&/dev/null; usleep $RANDOM; done &
while true; do quotaoff /dev/hda >&/dev/null; usleep $RANDOM; done &

I found there is a potential deadlock between quotaon and quotaoff (or
quotasync). Basically, all of quotactl operations need to be protected by
dqonoff_mutex. vfs_quota_off and vfs_quota_sync also call sb->s_op->quota_write
that needs to grab the i_mutex of the quota file.  But in vfs_quota_on_inode
(called from quotaon operation), the current code tries to grab  the i_mutex of
the quota file first before getting quonoff_mutex.

Reverse the order in which we take locks in vfs_quota_on_inode().

Jan Kara: Changed changelog to be more readable, made lockdep happy with
  I_MUTEX_QUOTA.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-07 18:15:21 +02:00
Csaba Henk
7a6d3c8b30 fuse: make the number of max background requests and congestion threshold tunable
The practical values for these limits depend on the design of the
filesystem server so let userspace set them at initialization time.

Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-07-07 17:28:52 +02:00
Oleg Nesterov
793285fcaf cred_guard_mutex: do not return -EINTR to user-space
do_execve() and ptrace_attach() return -EINTR if
mutex_lock_interruptible(->cred_guard_mutex) fails.

This is not right, change the code to return ERESTARTNOINTR.

Perhaps we should also change proc_pid_attr_write().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-06 13:57:04 -07:00
Zhang, Yanmin
3beab0b424 sys_sync(): fix 16% performance regression in ffsb create_4k test
I run many ffsb test cases on JBODs (typically 13/12 disks).  Comparing
with kernel 2.6.30, 2.6.31-rc1 has about 16% regression with
ffsb_create_4k.  The sub test case creates files continuously for 10
minitues and every file is 1MB.

Bisect located below patch.

5cee5815d1 is first bad commit
commit 5cee5815d1
Author: Jan Kara <jack@suse.cz>
Date:   Mon Apr 27 16:43:51 2009 +0200

    vfs: Make sys_sync() use fsync_super() (version 4)

    It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

As a matter of fact, ffsb calls sys_sync in the end to make sure all data
is flushed to disks and the flushing is counted into the result.  vmstat
shows ffsb is blocked when syncing for a long time.  With 2.6.30, ffsb is
blocked for a short time.

I checked the patch and did experiments to recover the original methods.
Eventually, the root cause is the patch deletes the calling to
wakeup_pdflush when syncing, so only ffsb is blocked on disk I/O.
wakeup_pdflush could ask pdflush to write back pages with ffsb at the
same time.

[akpm@linux-foundation.org: restore comment too]
Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-06 13:57:03 -07:00
Akira Fujita
1c71850517 ext4: Fix compile warnings with MB_DEBUG
When MB_DEBUG is enabled, we get some compile warnings because
ext4_group_t is unsigned int.  This patch fixes them.

Signed-off-by Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:04:36 -04:00
Joe Perches
5a4a798937 ext4: Remove unnecessary semicolons in mballoc.c
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 22:33:08 -04:00
Curt Wohlgemuth
6487a9d3b5 ext4: More buffer head reference leaks
After the patch I posted last week regarding buffer head ref leaks in
no-journal mode, I looked at all the code that uses buffer heads and
searched for more potential leaks.

The patch below fixes the issues I found; these can occur even when a
journal is present.

The change to inode.c fixes a double release if
ext4_journal_get_create_access() fails.

The changes to namei.c are more complicated.  add_dirent_to_buf() will
release the input buffer head EXCEPT when it returns -ENOSPC.  There are
some callers of this routine that don't always do the brelse() in the event
that -ENOSPC is returned.  Unfortunately, to put this fix into ext4_add_entry()
required capturing the return value of make_indexed_dir() and
add_dirent_to_buf().

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17 10:54:08 -04:00
Jan Kara
f6f50e28f0 jbd2: Fail to load a journal if it is too short
Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17 10:40:01 -04:00
Theodore Ts'o
78f1ddbb49 ext4: Avoid null pointer dereference when decoding EROFS w/o a journal
We need to check to make sure a journal is present before checking the
journal flags in ext4_decode_error().

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 23:09:47 -04:00
Manish Katiyar
43b3852029 ext4: Fix typo in ext4/Kconfig
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-27 21:38:17 -04:00
Aneesh Kumar K.V
024eab4d5b ext4: Fix memory leak fix when mounting an ext4 filesystem
The allocation of the ext4_group_info array was moved to a new
function ext4_mb_add_group_info() in commit 5f21b0e6 so that online
resize would use a common (and correct) codepath.  Unfortunately, the
call to the new ext4_mb_add_group_info() function was added without
removing the code which originally allocated the array.  This caused a
memory leak each time an ext4 filesystem was mounted.

The fix is simple; remove the code that did the original allocation,
since it is no longer needed.

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-17 09:01:04 -04:00
Daniel Mack
7fcd9c3ecb UBIFS: allow more than one volume to be mounted
UBIFS uses a bdi device per volume, but does not care to hand out unique
names to each of them. This causes an error when trying to mount more
than one volumes. Append the UBI volume and device ID to avoid that.

[Amended a bit by Artem Bityutskiy]

Signed-off-by: Daniel Mack <daniel@caiaq.de>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Adrian Hunter <ext-adrian.hunter@nokia.com>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:19 +03:00
Artem Bityutskiy
1fb8bd01ed UBIFS: fix assertion warning
When debugging is enabled and an unclean file-system is mounter,
the following assertion is triggered:

UBIFS assert failed in ubifs_tnc_start_commit at 805 (pid 1081)
Call Trace:
[cfaffbd0] [c0006cf8] show_stack+0x44/0x16c (unreliable)
[cfaffc10] [c011b738] ubifs_tnc_start_commit+0xbb8/0xd18
[cfaffc90] [c0112670] do_commit+0x150/0xa44
[cfaffd10] [c0125234] ubifs_rcvry_gc_commit+0xd8/0x544
[cfaffd60] [c0100e9c] ubifs_fill_super+0xe78/0x15f8
[cfaffdf0] [c0102118] ubifs_get_sb+0x20c/0x320
[cfaffe70] [c007f764] vfs_kern_mount+0x58/0xe0
[cfaffe90] [c007f83c] do_kern_mount+0x40/0xf8
[cfaffeb0] [c0095c24] do_mount+0x550/0x758
[cfafff10] [c0095ebc] sys_mount+0x90/0xe0
[cfafff40] [c000ed4c] ret_from_syscall+0x0/0x3c

The reason is that we initialize 'c->min_leb_idx' early, and do
not re-calculate it after journal replay.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:19 +03:00
Adrian Hunter
681947d2fa UBIFS: minor spelling and grammar fixes
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
2009-07-05 18:45:18 +03:00
Adrian Hunter
4473758944 UBIFS: fix 64-bit divisions in debug print
Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
2009-07-05 18:45:18 +03:00
Artem Bityutskiy
cb54ef8b13 UBIFS: few spelling fixes
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:18 +03:00
Artem Bityutskiy
2a35a3a8ab UBIFS: set write-buffer timout to 3-5 seconds
This patch cleans up write-buffer timeout initialization and
sets it to 3-5 interval.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:17 +03:00
Artem Bityutskiy
0b335b9d7d UBIFS: slightly optimize write-buffer timer usage
This patch adds the following minor optimization:

1. If write-buffer does not use the timer, indicate it with the
   wbuf->no_timer variable, instead of using the wbuf->softlimit
   variable. This is better because wbuf->softlimit is of ktime_t
   type, and the ktime_to_ns function contains 64-bit multiplication.

2. Do not call the 'hrtimer_cancel()' function for write-buffers
   which do not use timers.

3. Do not cancel the timer in 'ubifs_put_super()' because the
   synchronization function does this.

This patch also removes a confusing comment.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:16 +03:00
Artem Bityutskiy
70aee2f153 UBIFS: improve debugging messaged
1. Make the I/O debugging message print the journal head number.
2. Add prints to timer functions.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:16 +03:00
Adrian Hunter
e3dc5a665d UBIFS: fix integer overflow warning
Fix the following warning:

fs/ubifs/io.c: In function 'ubifs_wbuf_init':
fs/ubifs/io.c:860: warning: integer overflow in expression

And limit maximum hrtimer delta to ULONG_MAX because the
argument is 'unsigned long'.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-07-05 18:45:15 +03:00
Jiro SEKIBA
d9a0a345ab nilfs2: fix disorder in cp count on error during deleting checkpoints
This fixes a bug that checkpoint count gets wrong on errors when
deleting a series of checkpoints.

The count error is persistent since the checkpoint count is stored on
disk.  Some userland programs refer to the count via ioctl, and this
bugfix is needed to prevent malfunction of such programs.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@kernel.org
2009-07-05 10:44:20 +09:00
Ryusuke Konishi
ff54de363a nilfs2: fix lockdep warning between regular file and inode file
This will fix the following false positive of recursive locking which
lockdep has detected:

=============================================
[ INFO: possible recursive locking detected ]
2.6.30-nilfs #42
---------------------------------------------
nilfs_cleanerd/10607 is trying to acquire lock:
 (&bmap->b_sem){++++-.}, at: [<e0d025b7>] nilfs_bmap_lookup_at_level+0x1a/0x74 [nilfs2]

but task is already holding lock:
 (&bmap->b_sem){++++-.}, at: [<e0d024e0>] nilfs_bmap_truncate+0x19/0x6a [nilfs2]
other info that might help us debug this:
2 locks held by nilfs_cleanerd/10607:
 #0:  (&nilfs->ns_segctor_sem){++++.+}, at: [<e0d0d75a>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 #1:  (&bmap->b_sem){++++-.}, at: [<e0d024e0>] nilfs_bmap_truncate+0x19/0x6a [nilfs2]

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-07-05 10:44:20 +09:00
Ryusuke Konishi
4a52df7797 nilfs2: fix incorrect KERN_CRIT messages in case of write failures
In case of write-failure retries, the following KERN_CRIT level
messages are mistakenly output by nilfs_dat_commit_start() function:

nilfs_dat_commit_start: vbn = 408463, start = 12506, end = 18446744073709551615, pbn = 530210
nilfs_dat_commit_start: vbn = 408515, start = 12506, end = 18446744073709551615, pbn = 530211
nilfs_dat_commit_start: vbn = 408464, start = 12506, end = 18446744073709551615, pbn = 530212
...

This suppresses these messages.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@kernel.org
2009-07-05 10:44:20 +09:00
Ryusuke Konishi
8227b29722 nilfs2: fix hang problem of log writer which occurs after write failures
Leandro Lucarella gave me a report that nilfs gets stuck after its
write function fails.

The problem turned out to be caused by bugs which leave writeback flag
on pages.  This fixes the problem by ensuring to clear the writeback
flag in error path.

Reported-by: Leandro Lucarella <llucax@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@kernel.org
2009-07-05 10:44:20 +09:00
Ryusuke Konishi
0cfae3d879 nilfs2: remove unlikely directive causing mis-conversion of error code
The following error code handling in nilfs_segctor_write() function
wrongly converted negative error codes to a truth value (i.e. 1):

   err = unlikely(err) ? : res;

which originaly meant to be

   err = err ? : res;

This mis-conversion caused that write or sync functions receive the
unexpected error code.  This fixes the bug by removing the unlikely
directive.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable@kernel.org
2009-07-05 10:44:19 +09:00
Linus Torvalds
14c1b7c212 Merge branch 'for-2.6.31' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.31' of git://linux-nfs.org/~bfields/linux:
  NFSD: Don't hold unrefcounted creds over call to nfsd_setuser()
2009-07-04 10:11:38 -07:00
David Howells
033a666ccb NFSD: Don't hold unrefcounted creds over call to nfsd_setuser()
nfsd_open() gets an unrefcounted pointer to the current process's effective
credentials at the top of the function, then calls nfsd_setuser() via
fh_verify() - which may replace and destroy the current process's effective
credentials - and then passes the unrefcounted pointer to dentry_open() - but
the credentials may have been destroyed by this point.

Instead, the value from current_cred() should be passed directly to
dentry_open() as one of its arguments, rather than being cached in a variable.

Possibly fh_verify() should return the creds to use.

This is a regression introduced by
745ca2475a "CRED: Pass credentials through
dentry_open()".

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-and-Verified-By: Steve Dickson <steved@redhat.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-07-03 10:21:10 -04:00
Linus Torvalds
746a99a5af Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fs/notify/inotify: decrement user inotify count on close
2009-07-02 16:54:07 -07:00
Linus Torvalds
5291a12f05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix error message formatting
  Btrfs: fix use after free in btrfs_start_workers fail path
  Btrfs: honor nodatacow/sum mount options for new files
  Btrfs: update backrefs while dropping snapshot
  Btrfs: account for space we may use in fallocate
  Btrfs: fix the file clone ioctl for preallocated extents
  Btrfs: don't log the inode in file_write while growing the file
2009-07-02 16:52:38 -07:00
Hu Tao
68f5a38c3e Btrfs: fix error message formatting
Make an error msg look nicer by inserting a space between number and word.

Signed-off-by: Hu Tao <hu.taoo@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:55:45 -04:00
Jiri Slaby
9b627e9bf4 Btrfs: fix use after free in btrfs_start_workers fail path
worker memory is already freed on one fail path in btrfs_start_workers,
but is still dereferenced. Switch the dereference and kfree.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:50:58 -04:00
Chris Mason
9427216476 Btrfs: honor nodatacow/sum mount options for new files
The btrfs attr patches unconditionally inherited the inode flags field
without honoring nodatacow and nodatasum.  This fix makes sure
we properly record the nodatacow/sum mount options in new inodes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:17 -04:00
Yan Zheng
2c47e605a9 Btrfs: update backrefs while dropping snapshot
The new backref format has restriction on type of backref item.  If a tree
block isn't referenced by its owner tree, full backrefs must be used for the
pointers in it. When a tree block loses its owner tree's reference, backrefs
for the pointers in it should be updated to full backrefs. Current
btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
general use.

This patch adds backrefs update code to btrfs_drop_snapshot.  It isn't a
problem in the restricted form btrfs_drop_snapshot is used today, but for
general snapshot deletion this update is required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:17 -04:00
Josef Bacik
a970b0a16c Btrfs: account for space we may use in fallocate
Using Eric Sandeen's xfstest for fallocate, you can easily trigger a ENOSPC
panic on btrfs.  This is because we do not account for data we may use when
doing the fallocate.  This patch fixes the problem by properly reserving space,
and then just freeing it when we are done.  The reservation stuff was made with
delalloc in mind, so its a little crude for this case, but it keeps the box
from panicing.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-02 13:41:16 -04:00
Chris Mason
c8a894d77d Btrfs: fix the file clone ioctl for preallocated extents 2009-07-02 13:41:16 -04:00
Chris Mason
f597bb19cc Btrfs: don't log the inode in file_write while growing the file 2009-07-02 13:41:16 -04:00
Keith Packard
bdae997f44 fs/notify/inotify: decrement user inotify count on close
The per-user inotify_devs value is incremented each time a new file is
allocated, but never decremented. This led to inotify_init failing after a
limited number of calls.

Signed-off-by: Keith Packard <keithp@keithp.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2009-07-02 08:23:00 -04:00
Jeff Layton
cc0bad7552 cifs: add new cifs_iget function and convert unix codepath to use it
cifs: add new cifs_iget function and convert unix codepath to use it

In order to unify some codepaths, introduce a common cifs_fattr struct
for storing inode attributes. The different codepaths (unix, legacy,
normal, etc...) can fill out this struct with inode info. It can then be
passed as an arg to a common set of routines to get and update inodes.

Add a new cifs_iget function that uses iget5_locked to identify inodes.
This will compare inodes based on the uniqueid value in a cifs_fattr
struct.

Rather than filling out an already-created inode, have
cifs_get_inode_info_unix instead fill out cifs_fattr and hand that off
to cifs_iget. cifs_iget can then properly look for hardlinked inodes.

On the readdir side, add a new cifs_readdir_lookup function that spawns
populated dentries. Redefine FILE_UNIX_INFO so that it's basically a
FILE_UNIX_BASIC_INFO that has a few fields wrapped around it. This
allows us to more easily use the same function for filling out the fattr
as the non-readdir codepath.

With this, we should then have proper hardlink detection and can
eventually get rid of some nasty CIFS-specific hacks for handing them.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-01 21:26:42 +00:00
Linus Torvalds
5c5d4e8eaf Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
  mtd: nand: fix build failure and incorrect return from omap_wait()
  mtd: Use BLOCK_NIL consistently in NFTL/INFTL
  mtd: m25p80 timeout too short for worst-case m25p16 devices
  mtd: atmel_nand: Fix typo s/parititions/partitions/
  mtd: cmdlineparts: Use 64-bit format when printing a debug message.
  mtd: maps: Remove BUS_ID_SIZE from integrator_flash
  jffs2: fix another potential leak on error path in scan.c
2009-07-01 11:25:46 -07:00
Linus Torvalds
fa172f4006 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: invalidation reverse calls
  fuse: allow umask processing in userspace
  fuse: fix bad return value in fuse_file_poll()
  fuse: fix return value of fuse_dev_write()
2009-07-01 11:20:46 -07:00
Amerigo Wang
e2dbe12557 elf: fix one check-after-use
Check before use it.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-01 11:14:28 -07:00
Martin K. Petersen
7878cba9f0 block: Create bip slabs with embedded integrity vectors
This patch restores stacking ability to the block layer integrity
infrastructure by creating a set of dedicated bip slabs.  Each bip slab
has an embedded bio_vec array at the end.  This cuts down on memory
allocations and also simplifies the code compared to the original bvec
version.  Only the largest bip slab is backed by a mempool.  The pool is
contained in the bio_set so stacking drivers can ensure forward
progress.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@carl.(none)>
2009-07-01 10:56:25 +02:00
Wolfgang Illmeyer
752fa51e4c hostfs: set maximum filesize in superblock for proper LFS support
Maximum file size for hostfs mounts defaults to 2GB, so bigger files cannot be
read/written through hostfs. This patch initializes the maximum file size to
MAX_LFS_SIZE.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13531

Signed-off-by: Wolfgang Illmeyer <wolfgang@illmeyer.com>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:56:03 -07:00
Bryan Donlan
4d6c13f87d ext2: return -EIO not -ESTALE on directory traversal through deleted inode
ext2_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly.  However, in ext[234]_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted such
that a directory entry references a deleted inode.  This leads to a
misleading error message - "Stale NFS file handle" - and confusion on the
part of the admin.

The bug can be easily reproduced by creating a new filesystem, making a
link to an unused inode using debugfs, then mounting and attempting to ls
-l said link.

This patch thus changes ext2_lookup to return -EIO if it receives -ESTALE
from ext2_iget(), as ext2 does for other filesystem metadata corruption;
and also invokes the appropriate ext*_error functions when this case is
detected.

Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:56:00 -07:00
KAMEZAWA Hiroyuki
341c87bf34 elf: limit max map count to safe value
With ELF, at generating coredump, some more headers other than used
vmas are added.

When max_map_count == 65536, a core generated by following kinds of
code can be unreadable because the number of ELF's program header is
written in 16bit in Ehdr (please see elf.h) and the number overflows.

==
	... = mmap(); (munmap, mprotect, etc...)
	if (failed)
		abort();
==

This can happen in mmap/munmap/mprotect/etc...which calls split_vma().

I think 65536 is not safe as _default_ and reduce it to 65530 is good
for avoiding unexpected corrupted core.

Anyway, max_map_count can be enlarged by sysctl if a user is brave..

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jakub Jelinek <jakub@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:55:59 -07:00
Davide Libenzi
133890103b eventfd: revised interface and cleanups
Change the eventfd interface to de-couple the eventfd memory context, from
the file pointer instance.

Without such change, there is no clean way to racely free handle the
POLLHUP event sent when the last instance of the file* goes away.  Also,
now the internal eventfd APIs are using the eventfd context instead of the
file*.

This patch is required by KVM's IRQfd code, which is still under
development.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 18:55:58 -07:00
Jiri Slaby
f7c2df9b55 AFS: Fix lock imbalance
Don't unlock on vfs_rejected_lock path in afs_do_setlk, since the lock
is unlocked after abort_attempt label.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-30 13:30:44 -07:00
John Muir
3b463ae0c6 fuse: invalidation reverse calls
Add notification messages that allow the filesystem to invalidate VFS
caches.

Two notifications are added:

 1) inode invalidation

   - invalidate cached attributes
   - invalidate a range of pages in the page cache (this is optional)

 2) dentry invalidation

   - try to invalidate a subtree in the dentry cache

Care must be taken while accessing the 'struct super_block' for the
mount, as it can go away while an invalidation is in progress.  To
prevent this, introduce a rw-semaphore, that is taken for read during
the invalidation and taken for write in the ->kill_sb callback.

Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@zresearch.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-06-30 20:12:24 +02:00
Miklos Szeredi
e0a43ddcc0 fuse: allow umask processing in userspace
This patch lets filesystems handle masking the file mode on creation.
This is needed if filesystem is using ACLs.

 - The CREATE, MKDIR and MKNOD requests are extended with a "umask"
   parameter.

 - A new FUSE_DONT_MASK flag is added to the INIT request/reply.  With
   this the filesystem may request that the create mode is not masked.

CC: Jean-Pierre André <jean-pierre.andre@wanadoo.fr>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-06-30 20:12:23 +02:00
Miklos Szeredi
201fa69a28 fuse: fix bad return value in fuse_file_poll()
Fix fuse_file_poll() which returned a -errno value instead of a poll
mask.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-06-30 20:06:24 +02:00
Csaba Henk
b4c458b3a2 fuse: fix return value of fuse_dev_write()
On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
exposes a bug due to a non-careful return of an int or unsigned value:

implement a FUSE filesystem which sends an unsolicited notification to
the kernel with invalid opcode. The respective write to /dev/fuse
will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
errno == EINVAL.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-06-30 20:06:23 +02:00
James Morris
ac7242142b Merge branch 'master' into next 2009-06-30 09:10:35 +10:00
Mimi Zohar
94e5d714f6 integrity: add ima_counts_put (updated)
This patch fixes an imbalance message as reported by J.R. Okajima.
The IMA file counters are incremented in ima_path_check. If the
actual open fails, such as ETXTBSY, decrement the counters to
prevent unnecessary imbalance messages.

Reported-by: J.R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-06-29 08:59:10 +10:00
Jeff Layton
f0a71eb820 cifs: fix fh_mutex locking in cifs_reopen_file
Fixes a regression caused by commit a6ce4932fb

When this lock was converted to a mutex, the locks were turned into
unlocks and vice-versa.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Cc: Stable Tree <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-27 23:46:43 +00:00
Linus Torvalds
aada1bc927 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] remove unknown mount option warning message
  [CIFS] remove bkl usage from umount begin
  cifs: Fix incorrect return code being printed in cFYI messages
  [CIFS] cleanup asn handling for ntlmssp
  [CIFS] Copy struct *after* setting the port, instead of before.
  cifs: remove rw/ro options
  cifs: fix problems with earlier patches
  cifs: have cifs parse scope_id out of IPv6 addresses and use it
  [CIFS] Do not send tree disconnect if session is already disconnected
  [CIFS] Fix build break
  cifs: display scopeid in /proc/mounts
  cifs: add new routine for converting AF_INET and AF_INET6 addrs
  cifs: have cifs_show_options show forceuid/forcegid options
  cifs: remove unneeded NULL checks from cifs_show_options
2009-06-26 09:37:19 -07:00
Steve French
71a394faaa [CIFS] remove unknown mount option warning message
Jeff's previous patch which removed the unneeded rw/ro
parsing can cause a minor warning in dmesg (about the
unknown rw or ro mount option) at mount time. This
patch makes cifs ignore them in kernel to remove the warning
(they are already handled in the mount helper and VFS).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-26 04:07:18 +00:00
Steve French
ad8034f197 [CIFS] remove bkl usage from umount begin
The lock_kernel call moved into the fs for umount_begin
is not needed.  This adds a check to make sure we don't
call umount_begin twice on the same fs.

umount_begin for cifs is probably not needed and
may eventually be able to be removed, but in
the meantime this smaller patch is safe and
gets rid of the bkl from this path which provides
some benefit.

Acked-by: Jeff Layton <redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-26 03:25:49 +00:00
Suresh Jayaraman
0f3bc09ee1 cifs: Fix incorrect return code being printed in cFYI messages
FreeXid() along with freeing Xid does add a cifsFYI debug message that
prints rc (return code) as well. In some code paths where we set/return
error code after calling FreeXid(), incorrect error code is being
printed when cifsFYI is enabled.

This could be misleading in few cases. For eg.
In cifs_open() if cifs_fill_filedata() returns a valid pointer to
cifsFileInfo, FreeXid() prints rc=-13 whereas 0 is actually being
returned. Fix this by setting rc before calling FreeXid().

Basically convert

FreeXid(xid);			rc = -ERR;
return -ERR;		=>	FreeXid(xid);
				return rc;

[Note that Christoph would like to replace the GetXid/FreeXid
calls, which are primarily used for debugging.  This seems
like a good longer term goal, but although there is an
alternative tracing facility, there are no examples yet
available that I know of that we can use (yet) to
convert this cifs function entry/exit logging, and for
creating an identifier that we can use to correlate
all dmesg log entries for a particular vfs operation
(ie identify all log entries for a particular vfs
request to cifs: e.g. a particular close or read or write
or byte range lock call ... and just using the thread id
is harder).  Eventually when a replacement
for this is available (e.g. when NFS switches over and various
samples to look at in other file systems) we can remove the
GetXid/FreeXid macro but in the meantime multiple people
use this run time configurable logging all the time
for debugging, and Suresh's patch fixes a problem
which made it harder to notice some low
memory problems in the log so it is worthwhile
to fix this problem until a better logging
approach is able to be used]

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 19:12:57 +00:00
Steve French
f46c7234e4 [CIFS] cleanup asn handling for ntlmssp
Also removes obsolete distinction between rawntlmssp and ntlmssp (in asn/SPNEGO)
since as jra noted we can always send raw ntlmssp in session setup now.

remove check for experimental runtime flag (/proc/fs/cifs/Experimental) in
ntlmssp path.

Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 03:07:48 +00:00
Simo Leone
6debdbc0ba [CIFS] Copy struct *after* setting the port, instead of before.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Simo Leone <simo@archlinux.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 02:44:43 +00:00
Jeff Layton
6459340cfc cifs: remove rw/ro options
cifs: remove rw/ro options

These options are handled at the VFS layer. They only ever set the
option in the smb_vol struct. Nothing was ever done with them afterward
anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 02:33:01 +00:00
Jeff Layton
b48a485884 cifs: fix problems with earlier patches
cifs: fix problems with earlier patches

cifs_show_address hasn't been introduced yet, and fix a typo that was
silently fixed by a later patch in the series.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 02:32:57 +00:00
Jeff Layton
681bf72e48 cifs: have cifs parse scope_id out of IPv6 addresses and use it
This patch has CIFS look for a '%' in an IPv6 address. If one is
present then it will try to treat that value as a numeric interface
index suitable for stuffing into the sin6_scope_id field.

This should allow people to mount servers on IPv6 link-local addresses.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: David Holder <david@erion.co.uk>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 01:14:36 +00:00
Steve French
268875b9d1 [CIFS] Do not send tree disconnect if session is already disconnected
Noticed this when tree connect timed out (due to Samba server crash) -
we try to send a tree disconnect for a tid that does not exist
since we don't have a valid tree id yet. This checks that the
session is valid before sending the tree disconnect to handle
this case.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-25 00:29:21 +00:00
Al Viro
d5bb68adda another race fix in jfs_check_acl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 17:02:42 -04:00
Al Viro
72c04902d1 Get "no acls for this inode" right, fix shmem breakage
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 16:58:48 -04:00
Linus Torvalds
936940a9c7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (23 commits)
  switch xfs to generic acl caching helpers
  helpers for acl caching + switch to those
  switch shmem to inode->i_acl
  switch reiserfs to inode->i_acl
  switch reiserfs to usual conventions for caching ACLs
  reiserfs: minimal fix for ACL caching
  switch nilfs2 to inode->i_acl
  switch btrfs to inode->i_acl
  switch jffs2 to inode->i_acl
  switch jfs to inode->i_acl
  switch ext4 to inode->i_acl
  switch ext3 to inode->i_acl
  switch ext2 to inode->i_acl
  add caching of ACLs in struct inode
  fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls
  cleanup __writeback_single_inode
  ... and the same for vfsmount id/mount group id
  Make allocation of anon devices cheaper
  update Documentation/filesystems/Locking
  devpts: remove module-related code
  ...
2009-06-24 10:03:12 -07:00
Linus Torvalds
d7ed9c05eb Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: remove redundant tests on unsigned
  udf: Use device size when drive reported bogus number of written blocks
2009-06-24 09:57:10 -07:00
Oleg Nesterov
a893a84e87 mm_for_maps: simplify, use ptrace_may_access()
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.

Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.

Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.

With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-06-25 00:20:58 +10:00
Al Viro
1cbd20d820 switch xfs to generic acl caching helpers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:07 -04:00
Al Viro
073aaa1b14 helpers for acl caching + switch to those
helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl),
forget_cached_acl(inode, type).

ubifs/xattr.c needed includes reordered, the rest is a plain switchover.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:07 -04:00
Al Viro
281eede032 switch reiserfs to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:06 -04:00
Al Viro
7a77b15d92 switch reiserfs to usual conventions for caching ACLs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:06 -04:00
Al Viro
e68888bcb6 reiserfs: minimal fix for ACL caching
reiserfs uses NULL as "unknown" and ERR_PTR(-ENODATA) as "no ACL";
several codepaths store the former instead of the latter.

All those codepaths go through iset_acl() and all cases when it's
called with NULL acl are for the second variety, so the minimal
fix is to teach iset_acl() to deal with that.

Proper fix is to switch to more usual conventions and avoid back
and forth between internally used ERR_PTR(-ENODATA) and NULL
expected by the rest of the kernel.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:05 -04:00
Al Viro
d441b1c293 switch nilfs2 to inode->i_acl
Actually, get rid of private analog, since nothing in there is
using ACLs at all so far.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:05 -04:00
Al Viro
5affd88a10 switch btrfs to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:05 -04:00
Al Viro
290c263bf8 switch jffs2 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:05 -04:00
Al Viro
05fc0790b6 switch jfs to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:04 -04:00
Al Viro
d4bfe2f76d switch ext4 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:04 -04:00
Al Viro
6582a0e6f6 switch ext3 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:17:04 -04:00
Al Viro
5e78b43568 switch ext2 to inode->i_acl
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:28 -04:00
Al Viro
f19d4a8fa6 add caching of ACLs in struct inode
No helpers, no conversions yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:27 -04:00
Ankit Jain
3e63cbb1ef fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls
This patch adds ioctls to vfs for compatibility with legacy XFS
pre-allocation ioctls (XFS_IOC_*RESVP*). The implementation
effectively invokes sys_fallocate for the new ioctls.
Also handles the compat_ioctl case.
Note: These legacy ioctls are also implemented by OCFS2.

[AV: folded fixes from hch]

Signed-off-by: Ankit Jain <me@ankitjain.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:27 -04:00
Christoph Hellwig
01c031945f cleanup __writeback_single_inode
There is no reason to for the split between __writeback_single_inode and
__sync_single_inode, the former just does a couple of checks before
tail-calling the latter.  So merge the two, and while we're at it split
out the I_SYNC waiting case for data integrity writers, as it's
logically separate function.  Finally rename __writeback_single_inode to
writeback_single_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:26 -04:00
Al Viro
f21f62208a ... and the same for vfsmount id/mount group id
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:26 -04:00
Al Viro
c63e09eccc Make allocation of anon devices cheaper
Standard trick - add a new variable (start) such that
for each n < start n is known to be busy.  Allocation can
skip checking everything in [0..start) and if it returns
n, we can set start to n + 1.  Freeing below start sets
start to what we'd just freed.

Of course, it still sucks if we do something like
	free 0
	allocate
	allocate
in a loop - still O(n^2) time.  However, on saner loads it
improves the things a lot and the entire thing is not worth
the trouble of switching to something with better worst-case
behaviour.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:25 -04:00
H. Peter Anvin
f6cc746bbb devpts: remove module-related code
These days, the devpts filesystem is closely integrated with the pty
memory management, and cannot be built as a module, even less removed
from the kernel.  Accordingly, remove all module-related stuff from
this filesystem.

[ v2: only remove code that's actually dead ]

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:24 -04:00
Trond Myklebust
3b22edc573 VFS: Switch init_mount_tree() to use the new create_mnt_ns() helper
Eliminates some duplicated code...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:24 -04:00
J. R. Okajima
654f562c52 vfs: fix nd->root leak in do_filp_open()
commit 2a73787110 "Cache root in nameidata"
introduced a new member nd->root, but forgot to put it in do_filp_open().

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:24 -04:00
Christoph Hellwig
b5450d9c84 reiserfs: remove stray unlock_super in reiserfs_resize
Reiserfs doesn't use lock_super anywhere internally, and ->remount_fs
which calls reiserfs_resize does have it currently but also expects it
to be held on return, so there's no business for the unlock_super here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked by Edward Shishkin <edward.shishkin@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-24 08:15:24 -04:00
Roel Kluin
3391faa4f1 udf: remove redundant tests on unsigned
first_block and goal are unsigned. When negative they are wrapped and caught by
the other test.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-06-24 13:48:28 +02:00
Linus Torvalds
cf5434e894 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2/trivial: Wrap ocfs2_sysfile_cluster_lock_key within define.
  ocfs2: Add lockdep annotations
  vfs: Set special lockdep map for dirs only if not set by fs
  ocfs2: Disable orphan scanning for local and hard-ro mounts
  ocfs2: Do not initialize lvb in ocfs2_orphan_scan_lock_res_init()
  ocfs2: Stop orphan scan as early as possible during umount
  ocfs2: Fix ocfs2_osb_dump()
  ocfs2: Pin journal head before accessing jh->b_committed_data
  ocfs2: Update atime in splice read if necessary.
  ocfs2: Provide the ocfs2_dlm_lvb_valid() stack API.
2009-06-23 19:36:02 -07:00
Trond Myklebust
b88f8a546f NFS: Correct the NFS mount path when following a referral
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-22 21:28:25 -07:00
Trond Myklebust
0b75b35c7c NFS: Fix nfs_path() to always return a '/' at the beginning of the path
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-22 21:28:25 -07:00
Trond Myklebust
c02d7adf8c NFSv4: Replace nfs4_path_walk() with VFS path lookup in a private namespace
As noted in the previous patch, the NFSv4 client mount code currently
has several limitations. If the mount path contains symlinks, or
referrals, or even if it just contains a '..', then the client code in
nfs4_path_walk() will fail with an error.

This patch replaces the nfs4_path_walk()-based lookup with a helper
function that sets up a private namespace to represent the namespace on the
server, then uses the ordinary VFS and NFS path lookup code to walk down the
mount path in that namespace.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-22 21:28:25 -07:00
Trond Myklebust
cf8d2c11cb VFS: Add VFS helper functions for setting up private namespaces
The purpose of this patch is to improve the remote mount path lookup
support for distributed filesystems such as the NFSv4 client.

When given a mount command of the form "mount server:/foo/bar /mnt", the
NFSv4 client is required to look up the filehandle for "server:/", and
then look up each component of the remote mount path "foo/bar" in order
to find the directory that is actually going to be mounted on /mnt.
Following that remote mount path may involve following symlinks,
crossing server-side mount points and even following referrals to
filesystem volumes on other servers.

Since the standard VFS path lookup code already supports walking paths
that contain all these features (using in-kernel automounts for
following referrals) we would like to be able to reuse that rather than
duplicate the full path traversal functionality in the NFSv4 client code.

This patch therefore defines a VFS helper function create_mnt_ns(), that
sets up a temporary filesystem namespace and attaches a root filesystem to
it. It exports the create_mnt_ns() and put_mnt_ns() function for use by
filesystem modules.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-22 21:28:25 -07:00
Trond Myklebust
616511d039 VFS: Uninline the function put_mnt_ns()
In order to allow modules to use it without having to export vfsmount_lock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-22 21:28:25 -07:00
David Woodhouse
4839641333 jffs2: fix another potential leak on error path in scan.c
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-06-23 01:34:19 +01:00
Linus Torvalds
ac1b7c378e Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (63 commits)
  mtd: OneNAND: Allow setting of boundary information when built as module
  jffs2: leaking jffs2_summary in function jffs2_scan_medium
  mtd: nand: Fix memory leak on txx9ndfmc probe failure.
  mtd: orion_nand: use burst reads with double word accesses
  mtd/nand: s3c6400 support for s3c2410 driver
  [MTD] [NAND] S3C2410: Use DIV_ROUND_UP
  [MTD] [NAND] S3C2410: Deal with unaligned lengths in S3C2440 buffer read/write
  [MTD] [NAND] S3C2410: Allow the machine code to get the BBT table from NAND
  [MTD] [NAND] S3C2410: Added a kerneldoc for s3c2410_nand_set
  mtd: physmap_of: Add multiple regions and concatenation support
  mtd: nand: max_retries off by one in mxc_nand
  mtd: nand: s3c2410_nand_setrate(): use correct macros for 2412/2440
  mtd: onenand: add bbt_wait & unlock_all as replaceable for some platform
  mtd: Flex-OneNAND support
  mtd: nand: add OMAP2/OMAP3 NAND driver
  mtd: maps: Blackfin async: fix memory leaks in probe/remove funcs
  mtd: uclinux: mark local stuff static
  mtd: uclinux: do not allow to be built as a module
  mtd: uclinux: allow systems to override map addr/size
  mtd: blackfin NFC: fix hang when using NAND on BF527-EZKITs
  ...
2009-06-22 16:56:22 -07:00
Tao Ma
d246ab307d ocfs2/trivial: Wrap ocfs2_sysfile_cluster_lock_key within define.
Actually ocfs2_sysfile_cluster_lock_key is only used if we enable
CONFIG_DEBUG_LOCK_ALLOC. Wrap it so that we can avoid a building
warning.
fs/ocfs2/sysfile.c:53: warning: ‘ocfs2_sysfile_cluster_lock_key’
defined but not used

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:34:29 -07:00
Jan Kara
cb25797d45 ocfs2: Add lockdep annotations
Add lockdep support to OCFS2. The support also covers all of the cluster
locks except for open locks, journal locks, and local quotafile locks. These
are special because they are acquired for a node, not for a particular process
and lockdep cannot deal with such type of locking.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:34:26 -07:00
Jan Kara
9a7aa12f39 vfs: Set special lockdep map for dirs only if not set by fs
Some filesystems need to set lockdep map for i_mutex differently for
different directories. For example OCFS2 has system directories (for
orphan inode tracking and for gathering all system files like journal
or quota files into a single place) which have different locking
locking rules than standard directories. For a filesystem setting
lockdep map is naturaly done when the inode is read but we have to
modify unlock_new_inode() not to overwrite the lockdep map the filesystem
has set.

Acked-by: peterz@infradead.org
CC: mingo@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:34:22 -07:00
Sunil Mushran
df152c241d ocfs2: Disable orphan scanning for local and hard-ro mounts
Local and Hard-RO mounts do not need orphan scanning.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:55 -07:00
Sunil Mushran
3211949f89 ocfs2: Do not initialize lvb in ocfs2_orphan_scan_lock_res_init()
We don't access the LVB in our ocfs2_*_lock_res_init() functions.

Since the LVB can become invalid during some cluster recovery
operations, the dlmglue must be able to handle an uninitialized
LVB.

For the orphan scan lock, we initialized an uninitialzed LVB with our
scan sequence number plus one.  This starts a normal orphan scan
cycle.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:53 -07:00
Sunil Mushran
692684e19e ocfs2: Stop orphan scan as early as possible during umount
Currently if the orphan scan fires a tick before the user issues the umount,
the umount will wait for the queued orphan scan tasks to complete.

This patch makes the umount stop the orphan scan as early as possible so as
to reduce the probability of the queued tasks slowing down the umount.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:51 -07:00
Sunil Mushran
c3d38840ab ocfs2: Fix ocfs2_osb_dump()
Skip printing information that is not valid for local mounts.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:49 -07:00
Sunil Mushran
94e41ecfe0 ocfs2: Pin journal head before accessing jh->b_committed_data
This patch adds jbd_lock_bh_state() and jbd_unlock_bh_state() around accessses
to jh->b_committed_data.

Fixes oss bugzilla#1131
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1131

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:47 -07:00
Tao Ma
1962f39abb ocfs2: Update atime in splice read if necessary.
We should call ocfs2_inode_lock_atime instead of ocfs2_inode_lock
in ocfs2_file_splice_read like we do in ocfs2_file_aio_read so
that we can update atime in splice read if necessary.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:45 -07:00
Joel Becker
1c520dfbf3 ocfs2: Provide the ocfs2_dlm_lvb_valid() stack API.
The Lock Value Block (LVB) of a DLM lock can be lost when nodes die and
the DLM cannot reconstruct its state.  Clients of the DLM need to know
this.

ocfs2's internal DLM, o2dlm, explicitly zeroes out the LVB when it loses
track of the state.  This is not a standard behavior, but ocfs2 has
always relied on it.  Thus, an o2dlm LVB is always "valid".

ocfs2 now supports both o2dlm and fs/dlm via the stack glue.  When
fs/dlm loses track of an LVBs state, it sets a flag
(DLM_SBF_VALNOTVALID) on the Lock Status Block (LKSB).  The contents of
the LVB may be garbage or merely stale.

ocfs2 doesn't want to try to guess at the validity of the stale LVB.
Instead, it should be checking the VALNOTVALID flag.  As this is the
'standard' way of treating LVBs, we will promote this behavior.

We add a stack glue API ocfs2_dlm_lvb_valid().  It returns non-zero when
the LVB is valid.  o2dlm will always return valid, while fs/dlm will
check VALNOTVALID.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
2009-06-22 14:24:30 -07:00
Linus Torvalds
7e0338c0de Merge branch 'for-2.6.31' of git://fieldses.org/git/linux-nfsd
* 'for-2.6.31' of git://fieldses.org/git/linux-nfsd: (60 commits)
  SUNRPC: Fix the TCP server's send buffer accounting
  nfsd41: Backchannel: minorversion support for the back channel
  nfsd41: Backchannel: cleanup nfs4.0 callback encode routines
  nfsd41: Remove ip address collision detection case
  nfsd: optimise the starting of zero threads when none are running.
  nfsd: don't take nfsd_mutex twice when setting number of threads.
  nfsd41: sanity check client drc maxreqs
  nfsd41: move channel attributes from nfsd4_session to a nfsd4_channel_attr struct
  NFS: kill off complicated macro 'PROC'
  sunrpc: potential memory leak in function rdma_read_xdr
  nfsd: minor nfsd_vfs_write cleanup
  nfsd: Pull write-gathering code out of nfsd_vfs_write
  nfsd: track last inode only in use_wgather case
  sunrpc: align cache_clean work's timer
  nfsd: Use write gathering only with NFSv2
  NFSv4: kill off complicated macro 'PROC'
  NFSv4: do exact check about attribute specified
  knfsd: remove unreported filehandle stats counters
  knfsd: fix reply cache memory corruption
  knfsd: reply cache cleanups
  ...
2009-06-22 12:55:50 -07:00
Linus Torvalds
df36b439c5 Merge branch 'for-2.6.31' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'for-2.6.31' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (128 commits)
  nfs41: sunrpc: xprt_alloc_bc_request() should not use spin_lock_bh()
  nfs41: Move initialization of nfs4_opendata seq_res to nfs4_init_opendata_res
  nfs: remove unnecessary NFS_INO_INVALID_ACL checks
  NFS: More "sloppy" parsing problems
  NFS: Invalid mount option values should always fail, even with "sloppy"
  NFS: Remove unused XDR decoder functions
  NFS: Update MNT and MNT3 reply decoding functions
  NFS: add XDR decoder for mountd version 3 auth-flavor lists
  NFS: add new file handle decoders to in-kernel mountd client
  NFS: Add separate mountd status code decoders for each mountd version
  NFS: remove unused function in fs/nfs/mount_clnt.c
  NFS: Use xdr_stream-based XDR encoder for MNT's dirpath argument
  NFS: Clean up MNT program definitions
  lockd: Don't bother with RPC ping for NSM upcalls
  lockd: Update NSM state from SM_MON replies
  NFS: Fix false error return from nfs_callback_up() if ipv6.ko is not available
  NFS: Return error code from nfs_callback_up() to user space
  NFS: Do not display the setting of the "intr" mount option
  NFS: add support for splice writes
  nfs41: Backchannel: CB_SEQUENCE validation
  ...
2009-06-22 12:53:06 -07:00
Hitoshi Mitake
e38be994b9 Making fs/minix/minix.h double including safe
I happened to find that fs/minix/minix.h doesn't guard double include.

Yes, I know this never cause something destructive because this is
self-evidence that no source file includes minix.h twice, but I think
fixing this is better than disregarding it.

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-22 11:34:42 -07:00
Boaz Harrosh
baaf94cdc7 exofs: Avoid using file_fsync()
The use of file_fsync() in exofs_file_sync() is not necessary since it
does some extra stuff not used by exofs. Open code just the parts that
are currently needed.

TODO: Farther optimization can be done to sync the sb only on inode
update of new files, Usually the sb update is not needed in exofs.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-06-21 17:53:47 +03:00
Boaz Harrosh
27d2e14919 exofs: Remove IBM copyrights
Boaz,
Congrats on getting all the OSD stuff into 2.6.30!
I just pulled the git, and saw that the IBM copyrights are still there.
Please remove them from all files:
 * Copyright (C) 2005, 2006
 * International Business Machines

IBM has revoked all rights on the code - they gave it to me.

Thanks!
Avishay

Signed-off-by: Avishay Traeger <avishay@gmail.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-06-21 17:53:47 +03:00
Boaz Harrosh
b76a3f93d0 exofs: Fix bio leak in error handling path (sync read)
When failing a read request in the sync path, called from
write_begin, I forgot to free the allocated bio, fix it.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2009-06-21 17:53:44 +03:00
Benny Halevy
578e458568 nfs41: Move initialization of nfs4_opendata seq_res to nfs4_init_opendata_res
nfs4_open_recover_helper clears opendata->o_res
before calling nfs4_init_opendata_res, thus causing
NFSv4.0 OPEN operations to be sent rather than nfsv4.1.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-20 14:55:12 -04:00
Linus Torvalds
12e24f34cb Merge branch 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
  perfcounter: Handle some IO return values
  perf_counter: Push perf_sample_data through the swcounter code
  perf_counter tools: Define and use our own u64, s64 etc. definitions
  perf_counter: Close race in perf_lock_task_context()
  perf_counter, x86: Improve interactions with fast-gup
  perf_counter: Simplify and fix task migration counting
  perf_counter tools: Add a data file header
  perf_counter: Update userspace callchain sampling uses
  perf_counter: Make callchain samples extensible
  perf report: Filter to parent set by default
  perf_counter tools: Handle lost events
  perf_counter: Add event overlow handling
  fs: Provide empty .set_page_dirty() aop for anon inodes
  perf_counter: tools: Makefile tweaks for 64-bit powerpc
  perf_counter: powerpc: Add processor back-end for MPC7450 family
  perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
  perf_counter: powerpc: Change how processor-specific back-ends get selected
  perf_counter: powerpc: Use unsigned long for register and constraint values
  perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
  perf_counter tools: Add and use isprint()
  ...
2009-06-20 11:29:32 -07:00
OGAWA Hirofumi
3e107603ae fat: Fix the removal of opts->fs_dmask
(ce3b0f8d5c: New helper - current_umask())
is removing the opts->fs_dmask, probably it's a cut-and-paste
miss or something.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-20 21:50:47 +09:00
Linus Torvalds
bee89ab228 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  inotify: inotify_destroy_mark_entry could get called twice
2009-06-19 17:46:44 -07:00
Linus Torvalds
31583d6acf Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  Fix kernel-doc parameter name typo in blk-settings.c:
  block: rename CONFIG_LBD to CONFIG_LBDAF
  block: Fix bounce_pfn setting
  hd: stop defining MAJOR_NR
2009-06-19 17:43:04 -07:00
Eric Paris
528da3e9e2 inotify: inotify_destroy_mark_entry could get called twice
inotify_destroy_mark_entry could get called twice for the same mark since it
is called directly in inotify_rm_watch and when the mark is being destroyed for
another reason.  As an example assume that the file being watched was just
deleted so inotify_destroy_mark_entry would get called from the path
fsnotify_inoderemove() -> fsnotify_destroy_marks_by_inode() ->
fsnotify_destroy_mark_entry() -> inotify_destroy_mark_entry().  If this
happened at the same time as userspace tried to remove a watch via
inotify_rm_watch we could attempt to remove the mark from the idr twice and
could thus double dec the ref cnt and potentially could be in a use after
free/double free situation.  The fix is to have inotify_rm_watch use the
generic recursive safe fsnotify_destroy_mark_by_entry() so we are sure the
inotify_destroy_mark_entry() function can only be called one.

This patch also renames the function to inotify_ingored_remove_idr() so it is
clear what is actually going on in the function.

Hopefully this fixes:
[   20.342058] idr_remove called for id=20 which is not allocated.
[   20.348000] Pid: 1860, comm: udevd Not tainted 2.6.30-tip #1077
[   20.353933] Call Trace:
[   20.356410]  [<ffffffff811a82b7>] idr_remove+0x115/0x18f
[   20.361737]  [<ffffffff8134259d>] ? _spin_lock+0x6d/0x75
[   20.367061]  [<ffffffff8111640a>] ? inotify_destroy_mark_entry+0xa3/0xcf
[   20.373771]  [<ffffffff8111641e>] inotify_destroy_mark_entry+0xb7/0xcf
[   20.380306]  [<ffffffff81115913>] inotify_freeing_mark+0xe/0x10
[   20.386238]  [<ffffffff8111410d>] fsnotify_destroy_mark_by_entry+0x143/0x170
[   20.393293]  [<ffffffff811163a3>] inotify_destroy_mark_entry+0x3c/0xcf
[   20.399829]  [<ffffffff811164d1>] sys_inotify_rm_watch+0x9b/0xc6
[   20.405850]  [<ffffffff8100bcdb>] system_call_fastpath+0x16/0x1b

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
Tested-by: Peter Ziljlstra <peterz@infradead.org>
2009-06-19 12:42:48 -04:00
Bartlomiej Zolnierkiewicz
90c699a9ee block: rename CONFIG_LBD to CONFIG_LBDAF
Follow-up to "block: enable by default support for large devices
and files on 32-bit archs".

Rename CONFIG_LBD to CONFIG_LBDAF to:
- allow update of existing [def]configs for "default y" change
- reflect that it is used also for large files support nowadays

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-19 08:08:50 +02:00
Andy Adamson
ab52ae6db0 nfsd41: Backchannel: minorversion support for the back channel
Prepare to share backchannel code with NFSv4.1.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[nfsd41: use nfsd4_cb_sequence for callback minorversion]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-18 18:33:57 -07:00
Andy Adamson
ef52bff840 nfsd41: Backchannel: cleanup nfs4.0 callback encode routines
Mimic the client and prepare to share the back channel xdr with NFSv4.1.
Bump the number of operations in each encode routine, then backfill the
number of operations.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-18 18:33:57 -07:00
Trond Myklebust
1f84603c09 Merge branch 'devel-for-2.6.31' into for-2.6.31
Conflicts:
	fs/nfs/client.c
	fs/nfs/super.c
2009-06-18 18:13:44 -07:00
Mike Sager
6ddbbbfe52 nfsd41: Remove ip address collision detection case
Verified that cthon and pynfs exchange id tests pass (except for the
two expected fails: EID8 and EID50)

Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-18 17:43:53 -07:00
Linus Torvalds
0732f87761 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: clean up jbd2_journal_try_to_free_buffers()
  ext4: Don't update ctime for non-extent-mapped inodes
  ext4: Fix up whitespace issues in fs/ext4/inode.c
  ext4: Fix 64-bit block type problem on 32-bit platforms
  ext4: teach the inode allocator to use a goal inode number
  ext4: Use a hash of the topdir directory name for the Orlov parent group
  ext4: document the "abort" mount option
  ext4: move the abort flag from s_mount_opts to s_mount_flags
  ext4: update the s_last_mounted field in the superblock
  ext4: change s_mount_opt to be an unsigned int
  ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl
  ext4: avoid unnecessary spinlock in critical POSIX ACL path
  ext3: avoid unnecessary spinlock in critical POSIX ACL path
  ext4: convert instrumentation from markers to tracepoints
  jbd2: convert instrumentation from markers to tracepoints
2009-06-18 14:07:46 -07:00
Peter Oberparleiter
0b923606e7 seq_file: add function to write binary data
seq_write() can be used to construct seq_files containing arbitrary data.
Required by the gcov-profiling interface to synthesize binary profiling
data files.

Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Li Wei <W.Li@Sun.COM>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:57 -07:00
Oleg Nesterov
3b34fc5880 elf_core_dump: use rcu_read_lock() to access ->real_parent
In theory it is not safe to dereference ->parent/real_parent without
tasklist or rcu lock, we can race with re-parenting.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:52 -07:00
Jeff Mahoney
1d965fe0eb reiserfs: fix warnings with gcc 4.4
Several code paths in reiserfs have a construct like:

 if (is_direntry_le_ih(ih = B_N_PITEM_HEAD(src, item_num))) ...

which, in addition to being ugly, end up causing compiler warnings with
gcc 4.4.0.  Previous compilers didn't issue a warning.

fs/reiserfs/do_balan.c:1273: warning: operation on `aux_ih' may be undefined
fs/reiserfs/lbalance.c:393: warning: operation on `ih' may be undefined
fs/reiserfs/lbalance.c:421: warning: operation on `ih' may be undefined
fs/reiserfs/lbalance.c:777: warning: operation on `ih' may be undefined

I believe this is due to the ih being passed to macros which evaluate the
argument more than once.  This is old code and we haven't seen any
problems with it, but this patch eliminates the warnings.

It converts the multiple evaluation macros to static inlines and does a
preassignment for the cases that were causing the warnings because that
code is just ugly.

Reported-by: Chris Mason <mason@oracle.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:46 -07:00
Roel Kluin
37044c86ba ufs: sector_t cannot be negative
unsigned i_block,fragment cannot be negative.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:46 -07:00
Jan Kara
5404ac8e44 isofs: cleanup mount option processing
Remove unused variables from isofs_sb_info (used to be some mount
options), unify variables for option to use 0/1 (some options used
'y'/'n'), use bit fields for option flags in superblock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:45 -07:00
Jan Kara
5c4a656b7e isofs: fix setting of uid and gid to 0
isofs allows setting of default uid and gid of files but value 0 was used
to indicate that user did not specify any uid/gid mount option.  Since
this option also overrides uid/gid set in Rock Ridge extension, it makes
sense to allow forcing uid/gid 0.  Fix option processing to allow this.

Cc: <Hans-Joachim.Baader@cjt.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:45 -07:00
Jan Kara
52b680c812 isofs: let mode and dmode mount options override rock ridge mode setting
So far, permissions set via 'mode' and/or 'dmode' mount options were
effective only if the medium had no rock ridge extensions (or was mounted
without them).  Add 'overriderockmode' mount option to indicate that these
options should override permissions set in rock ridge extensions.  Maybe
this should be default but the current behavior is there since mount
options were created so I think we should not change how they behave.

Cc: <Hans-Joachim.Baader@cjt.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:45 -07:00
Jan Kara
ef43618a47 ext3: make sure inode is deleted from orphan list after truncate
As Ted pointed out, it can happen that ext3_truncate() returns without
removing inode from orphan list.  This way we could in some rare cases
(like when we get ENOMEM from an allocation in ext3_truncate called
because of failed ext3_write_begin) leave the inode on orphan list and
that triggers assertion failure on umount.

So make ext3_truncate() always remove inode from in-memory orphan list.

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:45 -07:00
Hisashi Hifumi
6f3f1cb21f jbd: clean up journal_try_to_free_buffers()
I delete the following patch
"commit 3f31fddfa2
Author: Mingming Cao <cmm@us.ibm.com>
Date:   Fri Jul 25 01:46:22 2008 -0700

    jbd: fix race between free buffer and commit transaction

This patch is no longer needed because if race between freeing buffer and
committing transaction functionality occurs and dio gets error, currently
dio falls back to buffered IO by the following patch.

	commit 6ccfa806a9
	Author: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
	Date:   Tue Sep 2 14:35:40 2008 -0700

   	VFS: fix dio write returning EIO when try_to_release_page fails

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:45 -07:00
Jan Kara
e8ef7aaea7 ext3: fix chain verification in ext3_get_blocks()
Chain verification in ext3_get_blocks() has been hosed since it called
verify_chain(chain, NULL) which always returns success.  As a result
readers could in theory race with truncate.  On the other hand the race
probably cannot happen with the current locking scheme, since by the
time ext3_truncate() is called all the pages are already removed and
hence get_block() shouldn't be called on such pages...

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:45 -07:00
Jan Kara
39fe7557b4 ext2: Do not update mtime of a moved directory
One of our users is complaining that his backup tool is upset on ext2
(while it's happy on ext3, xfs, ...) because of the mtime change.

The problem is:

    mkdir foo
    mkdir bar
    mkdir foo/a

Now under ext2:
    mv foo/a foo/b

changes mtime of 'foo/a' (foo/b after the move).  That does not really
make sense and it does not happen under any other filesystem I've seen.

More complicated is:
    mv foo/a bar/a

This changes mtime of foo/a (bar/a after the move) and it makes some
sense since we had to update parent directory pointer of foo/a.  But
again, no other filesystem does this.  So after some thoughts I'd vote
for consistency and change ext2 to behave the same as other filesystems.

Do not update mtime of a moved directory.  Specs don't say anything
about it (neither that it should, nor that it should not be updated) and
other common filesystems (ext3, ext4, xfs, reiserfs, fat, ...) don't do
it.  So let's become more consistent.

Spotted by ronny.pretzsch@dfs.de, initial fix by Jörn Engel.

Reported-by: <ronny.pretzsch@dfs.de>
Cc: <hare@suse.de>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:44 -07:00
Cyrill Gorcunov
2f6d311080 proc: vmcore - use kzalloc in get_new_element()
Instead of kmalloc+memset better use straight kzalloc

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
Michal Simek
bcac2b1b7d procfs: remove sparse errors in proc_devtree.c
CHECK   fs/proc/proc_devtree.c
fs/proc/proc_devtree.c:197:14: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:203:34: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:210:14: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:223:26: warning: Using plain integer as NULL pointer
fs/proc/proc_devtree.c:226:14: warning: Using plain integer as NULL pointer

Signed-off-by: Michal Simek <monstr@monstr.eu>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
Davide Libenzi
3fe4a975d6 epoll: fix nested calls support
This fixes a regression in 2.6.30.

I unfortunately accepted a patch time ago, to drop the "current" usage
from possible IRQ context, w/out proper thought over it.  The patch
switched to using the CPU id by bounding the nested call callback with a
get_cpu()/put_cpu().

Unfortunately the ep_call_nested() function can be called with a callback
that grabs sleepy locks (from own f_op->poll()), that results in epic
fails.  The following patch uses the proper "context" depending on the
path where it is called, and on the kind of callback.

This has been reported by Stefan Richter, that has also verified the patch
is his previously failing environment.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Reported-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
Keika Kobayashi
d3d64df21d proc: export statistics for softirq to /proc
Export statistics for softirq in /proc/softirqs and /proc/stat.

1. /proc/softirqs
Implement /proc/softirqs which shows the number of softirq
for each CPU like /proc/interrupts.

2. /proc/stat
Add the "softirq" line to /proc/stat.
This line shows the number of softirq for all cpu.
The first column is the total of all softirqs and
each subsequent column is the total for particular softirq.

[kosaki.motohiro@jp.fujitsu.com: remove redundant for_each_possible_cpu() loop]
Signed-off-by: Keika Kobayashi <kobayashi.kk@ncos.nec.co.jp>
Reviewed-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:41 -07:00
David Teigland
c78a87d0a1 dlm: fix plock use-after-free
Fix a regression from the original addition of nfs lock support
586759f03e.  When a synchronous
(non-nfs) plock completes, the waiting thread will wake up and
free the op struct.  This races with the user thread in
dev_write() which goes on to read the op's callback field to
check if the lock is async and needs a callback.  This check
can happen on the freed op.  The fix is to note the callback
value before the op can be freed.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-06-18 13:42:42 -05:00
NeilBrown
671e1fcf63 nfsd: optimise the starting of zero threads when none are running.
Currently, if we ask to set then number of nfsd threads to zero when
there are none running, we set up all the sockets and register the
service, and then tear it all down again.
This is pointless.

So detect that case and exit promptly.
(also remove an assignment to 'error' which was never used.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
2009-06-18 09:42:41 -07:00
NeilBrown
82e12fe924 nfsd: don't take nfsd_mutex twice when setting number of threads.
Currently when we write a number to 'threads' in nfsdfs,
we take the nfsd_mutex, update the number of threads, then take the
mutex again to read the number of threads.

Mostly this isn't a big deal.  However if we are write '0', and
portmap happens to be dead, then we can get unpredictable behaviour.
If the nfsd threads all got killed quickly and the last thread is
waiting for portmap to respond, then the second time we take the mutex
we will block waiting for the last thread.
However if the nfsd threads didn't die quite that fast, then there
will be no contention when we try to take the mutex again.

Unpredictability isn't fun, and waiting for the last thread to exit is
pointless, so avoid taking the lock twice.
To achieve this, get nfsd_svc return a non-negative number of active
threads when not returning a negative error.

Signed-off-by: NeilBrown <neilb@suse.de>
2009-06-18 09:40:31 -07:00
Peter Zijlstra
d3a9262e59 fs: Provide empty .set_page_dirty() aop for anon inodes
.set_page_dirty() is one of those a_ops that defaults to the
buffer implementation when not set. Therefore provide a dummy
function to make it do nothing.

(Uncovered by perfcounters fd's which can now be writable-mmap-ed.)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-18 14:46:10 +02:00
Jan Kara
24a5d59f34 udf: Use device size when drive reported bogus number of written blocks
Some drives report 0 as the number of written blocks when there are some blocks
recorded. Use device size in such case so that we can automagically mount such
media.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-06-18 12:33:16 +02:00
James Morris
4bf259e3ae nfs: remove unnecessary NFS_INO_INVALID_ACL checks
Unless I'm mistaken, NFS_INO_INVALID_ACL is being checked twice during
getacl calls (i.e. first via nfs_revalidate_inode() and then by each all
site).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:14 -07:00
Chuck Lever
a5a16bae70 NFS: More "sloppy" parsing problems
Specifying "port=-5" with the kernel's current mount option parser
generates "unrecognized mount option".  If "sloppy" is set, this
causes the mount to succeed and use the default values; the desired
behavior is that, since this is a valid option with an invalid value,
the mount should fail, even with "sloppy."

To properly handle "sloppy" parsing, we need to distinguish between
correct options with invalid values, and incorrect options.  We will
need to parse integer values by hand, therefore, and not rely on
match_token().

For instance, these must all fail with "invalid value":

	port=12345678
	port=-5
	port=samuel

and not with "unrecognized option," as they do currently.

Thus, for the sake of match_token() we need to treat the values for
these options as strings, and do the conversion to integers using
strict_strtol().

This is basically the same solution we used for the earlier "retry="
fix (commit ecbb3845), except in this case the kernel actually has to
parse the value, rather than ignore it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:14 -07:00
Chuck Lever
d23c45fd84 NFS: Invalid mount option values should always fail, even with "sloppy"
Ian Kent reports:

"I've noticed a couple of other regressions with the options vers
and proto option of mount.nfs(8).

The commands:

mount -t nfs -o vers=<invalid version> <server>:/<path> /<mountpoint>
mount -t nfs -o proto=<invalid proto> <server>:/<path> /<mountpoint>

both immediately fail.

But if the "-s" option is also used they both succeed with the
mount falling back to defaults (by the look of it).

In the past these failed even when the sloppy option was given, as
I think they should. I believe the sloppy option is meant to allow
the mount command to still function for mount options (for example
in shared autofs maps) that exist on other Unix implementations but
aren't present in the Linux mount.nfs(8). So, an invalid value
specified for a known mount option is different to an unknown mount
option and should fail appropriately."

See RH bugzilla 486266.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:13 -07:00
Chuck Lever
065015e5ef NFS: Remove unused XDR decoder functions
Clean up: Remove xdr_decode_fhstatus() and xdr_decode_fhstatus3(), now
that they are unused.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:13 -07:00
Chuck Lever
8e02f6b9aa NFS: Update MNT and MNT3 reply decoding functions
Solder xdr_stream-based XDR decoding functions into the in-kernel mountd
client that are more careful about checking data types and watching for
buffer overflows.  The new MNT3 decoder includes support for auth-flavor
list decoding.

The "_sz" macro for MNT3 replies was missing the size of the file handle.
I've added this back, and included the size of the auth flavor array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:13 -07:00
Chuck Lever
a14017db28 NFS: add XDR decoder for mountd version 3 auth-flavor lists
Introduce an xdr_stream-based XDR decoder that can unpack the auth-
flavor list returned in a MNT3 reply.

The nfs_mount() function's caller allocates an array, and passes the
size and a pointer to it.  The decoder decodes all the flavors it can
into the array, and returns the number of decoded flavors.

If the caller is not interested in the auth flavors, it can pass a
value of zero as the size of the pre-allocated array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:12 -07:00
Chuck Lever
4fdcd9966d NFS: add new file handle decoders to in-kernel mountd client
Introduce xdr_stream-based XDR file handle decoders to the in-kernel
mountd client.  These are more careful than the existing decoder
functions about buffer overflows and data type and range checking.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:12 -07:00
Chuck Lever
fb12529577 NFS: Add separate mountd status code decoders for each mountd version
Introduce data structures and xdr_stream-based decoding functions for
unmarshalling mountd status codes properly.

Mountd version 3 uses specific standard error return codes that are
not errno values and not NFS3ERR_ values.  These have a well-defined
standard mapping to local errno values.  Introduce data structures
and a decoder function that map these status codes to local errno
values properly.  This is new functionality (but not used yet).

Version 1 mountd status values are defined by RFC 1094 as UNIX error
values (errno values).  Errno values on heterogeneous systems do not
necessarily match each other.  To avoid exposing possibly incorrect
errno values to upper layers, the current XDR decoder converts all
non-zero MNT version 1 status codes to -EACCES.

The OpenGroup XNFS standard provides a mapping similar to but smaller
than the version 3 error codes.  Implement a decoder that uses the XNFS
error codes, replacing the current decoder.

For both mountd protocol versions, map unrecognized errors to -EACCES.

Finally we introduce a replacement data structure for mnt_fhstatus
at this time, which is used by the new XDR decoders.  In addition to
documenting that the status value returned by the XDR decoders is
always an errno, this new structure will be expanded in subsequent
patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:12 -07:00
Chuck Lever
99835db430 NFS: remove unused function in fs/nfs/mount_clnt.c
Clean up: remove xdr_encode_dirpath() now that it has been replaced.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:11 -07:00
Chuck Lever
29a1bd6bf8 NFS: Use xdr_stream-based XDR encoder for MNT's dirpath argument
Check the length of the supplied dirpath, and see that it fits
properly in the RPC buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:11 -07:00
Chuck Lever
2ad780978b NFS: Clean up MNT program definitions
Clean up:  Relocate MNT program procedure number definitions to the
only file that uses them.  Relocate the version number definitions,
which are shared, to nfs.h.  Remove duplicate program number
definitions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:11 -07:00
Chuck Lever
0e5c2632e1 lockd: Don't bother with RPC ping for NSM upcalls
Cut NSM upcall RPC traffic in half -- don't do a NULL call first.
The cases where a ping would be helpful are rare.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:11 -07:00
Chuck Lever
6c9dc42551 lockd: Update NSM state from SM_MON replies
When rpc.statd starts up in user space at boot time, it attempts to
write the latest NSM local state number into
/proc/sys/fs/nfs/nsm_local_state.

If lockd.ko isn't loaded yet (as is the case in most configurations),
that file doesn't exist, thus the kernel's NSM state remains set to
its initial value of zero during lockd operation.

This is a problem because rpc.statd and lockd use the NSM state number
to prevent repeated lock recovery on rebooted hosts.  If lockd sends
a zero NSM state, but then a delayed SM_NOTIFY with a real NSM state
number is received, there is no way for lockd or rpc.statd to
distinguish that stale SM_NOTIFY from an actual reboot.  Thus lock
recovery could be performed after the rebooted host has already
started reclaiming locks, and those locks will be lost.

We could change /etc/init.d/nfslock so it always modprobes lockd.ko
before starting rpc.statd.  However, if lockd.ko is ever unloaded
and reloaded, we are back at square one, since the NSM state is not
preserved across an unload/reload cycle.  This may happen frequently
on clients that use automounter.  A period of NFS inactivity causes
lockd.ko to be unloaded, and the kernel loses its NSM state setting.

Instead, let's use the fact that rpc.statd plants the local system's
NSM state in every SM_MON (and SM_UNMON) reply.  lockd performs a
synchronous SM_MON upcall to the local rpc.statd _before_ sending its
first NLM request to a new remote.  This would permit rpc.statd to
provide the current NSM state to lockd, even after lockd.ko had been
unloaded and reloaded.

Note that NLMPROC_LOCK arguments are constructed before the
nsm_monitor() call, so we have to rearrange argument construction very
slightly to make this all work out.

And, the kernel appears to treat NSM state as a u32 (see struct
nlm_args and nsm_res).  Make nsm_local_state a u32 as well, to ensure
we don't get bogus comparison results.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:10 -07:00
Chuck Lever
18fc316419 NFS: Fix false error return from nfs_callback_up() if ipv6.ko is not available
Clear "ret" if the error return from svc_create_xprt(AF_INET6) was
-EAFNOSUPORT.  Otherwise, callback start-up will succeed, but
nfs_callback_up() will return -EAFNOSUPPORT anyway, and the first
NFSv4 mount attempt after a reboot will fail.

Bug introduced by commit f738f517 in 2.6.30-rc1.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:10 -07:00
Chuck Lever
a21bdd9b96 NFS: Return error code from nfs_callback_up() to user space
If the kernel cannot start the NFSv4 callback service during a mount
request, it returns -ENOMEM to user space, resulting in this message:

   mount.nfs4: Cannot allocate memory

Adjust nfs_alloc_client() and nfs_get_client() to pass NFSv4 callback
start-up errors back to user space so a less mysterious error message
can be displayed by the mount command.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:10 -07:00
Chuck Lever
c381ad2cf2 NFS: Do not display the setting of the "intr" mount option
The "intr" mount option has been deprecated for a while, but
/proc/mounts continues to display "nointr" whether "intr" or "nointr"
has been specified for a mount point.

Since these options do not have any effect, simply do not display
them.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:09 -07:00
Suresh Jayaraman
bf40d3435c NFS: add support for splice writes
Adds support for splice writes. It effectively calls
generic_file_splice_write() to do the writes.

We need not worry about O_APPEND case as the combination of splice()
writes and O_APPEND is disallowed. This patch propagates NFS write
errors back to the caller. The number of bytes written via splice are
being added to NFSIO_NORMALWRITTENBYTES as these are effectively
cached writes.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 18:02:09 -07:00
Trond Myklebust
301933a0ac Merge commit 'linux-pnfs/nfs41-for-2.6.31' into nfsv41-for-2.6.31 2009-06-17 17:59:58 -07:00
Hisashi Hifumi
536fc240e7 jbd2: clean up jbd2_journal_try_to_free_buffers()
This patch reverts 3f31fddf, which is no longer needed because if a
race between freeing buffer and committing transaction functionality
occurs and dio gets error, currently dio falls back to buffered IO due
to the commit 6ccfa806.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Mingming Cao <cmm@us.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 20:08:51 -04:00
Ricardo Labiaga
68f3f90133 nfs41: Backchannel: CB_SEQUENCE validation
Validates the callback's sessionID, the slot number, and the sequence ID.
Increments the slot's sequence.

Detects replays, but simply prints a debug message (if debugging is enabled
since we don't yet implement a duplicate request cache for the backchannel.
This should not present a problem, since only idempotent callbacks are
currently implemented.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Backchannel: Be more obvious about the return value]
[nfs41: Backchannel: dprink in host order]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:43 -07:00
Ricardo Labiaga
963891ac43 nfs41: Backchannel: New find_client_with_session()
Finds the 'struct nfs_client' that matches the server's address, major
version number, and session ID.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:43 -07:00
Ricardo Labiaga
f8625a6a4b nfs41: Backchannel: Add a backchannel slot table to the session
Defines a new 'struct nfs4_slot_table' in the 'struct nfs4_session'
for use by the backchannel.  Initializes, resets, and destroys the backchannel
slot table in the same manner the forechannel slot table is initialized,
reset, and destroyed.

The sequenceid for each slot in the backchannel slot table is initialized
to 0, whereas the forechannel slotid's sequenceid is set to 1.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:42 -07:00
Ricardo Labiaga
050047ce71 nfs41: Backchannel: Refactor nfs4_init_slot_table()
Generalize nfs4_init_slot_table() so it can be used to initialize the
backchannel slot table in addition to the forechannel slot table.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:42 -07:00
Ricardo Labiaga
b73dafa7ac nfs41: Backchannel: Refactor nfs4_reset_slot_table()
Generalize nfs4_reset_slot_table() so it can be used to reset the
backchannel slot table in addition to the forechannel slot table.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:41 -07:00
Ricardo Labiaga
65fc64e547 nfs41: Backchannel: update cb_sequence args and results
Change the type of cs_addr and csr_status to 'struct sockaddr' and
'__be32' since the cb_sequence processing function will use existing
functionality that expects these types.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:40 -07:00
Benny Halevy
281fe15dc1 nfs41: verify CB_SEQUENCE position in callback compound
CB_SEQUENCE must appear first in the callback compound RPC.
If it is not the first operation NFS4ERR_SEQUENCE_POS must be returned.
If the first operation ni the CB_COMPOUND is not CB_SEQUENCE then
NFS4ERR_OP_NOT_IN_SESSION must be returned.

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: refactor op preprocessing out of process_op]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:39 -07:00
Benny Halevy
4aece6a19c nfs41: cb_sequence xdr implementation
[nfs41: get rid of READMEM and COPYMEM for callback_xdr.c]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get rid of READ64 in callback_xdr.c]
See http://linux-nfs.org/pipermail/pnfs/2009-June/007846.html
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:38 -07:00
Benny Halevy
d49433e1e3 nfs41: cb_sequence proc implementation
Currently, just free up any referring calls information.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: fix csr_{,target}highestslotid]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:38 -07:00
Benny Halevy
2d9b9ec344 nfs41: cb_sequence protocol level data structures
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:37 -07:00
Benny Halevy
34bc47c941 nfs41: consider minorversion in callback_xdr:process_op
Note that this patch changes the nfsv4.0 behavior also when
CONFIG_NFS_V4_1 is not defined where NFS4ERR_MINOR_VERS_MISMATCH
will be returned if the client received a CB_COMPOUND
with minorversion != 0.  Previously, it would have
returned NFS4ERR_OP_ILLEGAL for CB_SEQUENCE.
(or if the server is broken and sent OP_CB_GETATTR or OP_CB_RECALL
with minorversion!=0, they would have been processed normally.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: refactor op preprocessing out of process_op]
See http://linux-nfs.org/pipermail/pnfs/2009-June/007845.html
[nfs41: define CB_NOTIFY_DEVICEID as not supported]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:37 -07:00
Benny Halevy
45377b94ed nfs41: callback numbers definitions
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:36 -07:00
Benny Halevy
48a9e2d228 nfs41: decode minorversion 1 cb_compound header
decode cb_compound header conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Get rid of cb_compound_hdr_arg.callback_ident

callback_ident is not used anywhere so we shouldn't waste any memory to
store it.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: no need to break read_buf in decode_compound_hdr_arg]
See http://linux-nfs.org/pipermail/pnfs/2009-June/007844.html
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:35 -07:00
Benny Halevy
b8f2ef84b0 nfs41: store minorversion in cb_compound_hdr_arg
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:35 -07:00
Andy Adamson
5a0ffe544c nfs41: Release backchannel resources associated with session
Frees the preallocated backchannel resources that are associated with
this session when the session is destroyed.

A backchannel is currently created once per session. Destroy the backchannel
only when the session is destroyed.

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:34 -07:00
Andy Adamson
0f91421e8e nfs41: Client indicates presence of NFSv4.1 callback channel.
Set the SESSION4_BACK_CHAN flag to indicate the client supports a backchannel.

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:33 -07:00
Andy Adamson
0b5b7ae0a8 nfs41: Setup the backchannel
The NFS v4.1 callback service has already been setup, and
rpc_xprt->serv points to the svc_serv structure describing it.
Invoke the xprt_setup_backchannel() initialization to pre-
allocate the necessary backchannel structures.

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: change nfs4_put_session(nfs4_session**) to nfs4_destroy_session(nfs_session*)]
Signed-off-by: Alexandros Batsakis <Alexandros.Batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[moved xprt_setup_backchannel from nfs4_init_session to nfs4_init_backchannel]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:32 -07:00
Andy Adamson
e82dc22dac nfs41: Allow NFSv4 and NFSv4.1 callback services to coexist
Tracks the nfs_callback_info for both versions, enabling the callback
service for v4 and v4.1 to run concurrently and be stopped independently
of each other.

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:32 -07:00
Benny Halevy
8f97524235 nfs41: create a svc_xprt for nfs41 callback thread and use for incoming callbacks
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:31 -07:00
Ricardo Labiaga
a43cde94fe nfs41: Implement NFSv4.1 callback service process.
nfs41_callback_up() initializes the necessary queues and creates the new
nfs41_callback_svc thread.  This thread executes the callback service which
waits for requests to arrive on the svc_serv->sv_cb_list.

NFS41_BC_MIN_CALLBACKS is set to 1 because we expect callbacks to not
cause substantial latency.

The actual processing of the callback will be implemented as a separate patch.

There is only one NFSv4.1 callback service.  The first caller of
nfs4_callback_up() creates the service, subsequent callers increment a
reference count on the service.  The service is destroyed when the last
caller invokes nfs_callback_down().

The transport needs to hold a reference to the callback service in order
to invoke it during callback processing.  Currently this reference is only
obtained when the service is first created.  This is incorrect, since
subsequent registrations for other transports will leave the xprt->serv
pointer uninitialized, leading to an oops when a callback arrives on
the "unreferenced" transport.

This patch fixes the problem by ensuring that a reference to the service
is saved in xprt->serv, either because the service is created by this
invocation to nfs4_callback_up() or by a prior invocation.

Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Add a reference to svc_serv during callback service bring up]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Type check arguments of nfs_callback_up]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: save svc_serv in nfs_callback_info]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Removal of ugly #ifdefs]
[nfs41: Update to removal of ugly #ifdefs]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 14:11:29 -07:00
Trond Myklebust
5cd973c44a NFSv4/NLM: Push file locking BKL dependencies down into the NLM layer
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 13:23:01 -07:00
Trond Myklebust
3f09df70e3 NFS: Ensure we always hold the BKL when dereferencing inode->i_flock
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 13:23:00 -07:00
Trond Myklebust
965b5d6791 NFSv4: Handle more errors when recovering open file and locking state
It is possible for servers to return NFS4ERR_BAD_STATEID when
the state management code is recovering locks or is reclaiming state when
returning a delegation. Ensure that we handle that case.
While we're at it, add in handlers for NFS4ERR_STALE,
NFS4ERR_ADMIN_REVOKED, NFS4ERR_OPENMODE, NFS4ERR_DENIED and
NFS4ERR_STALE_STATEID, since the protocol appears to allow for them too.

Also handle ENOMEM...

Finally, rather than add new NFSv4.0-specific errors and error handling into
the generic delegation code, move that open file and locking state error
handling into the NFSv4 layer.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 13:22:59 -07:00
Trond Myklebust
d5122201a7 NFSv4: Move error handling out of the delegation generic code
The NFSv4 delegation recovery code is required by the protocol to handle
more errors. Rather than add NFSv4.0 specific errors into 'generic'
delegation code, we should move the error handling into the NFSv4 layer.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 13:22:58 -07:00
Trond Myklebust
01c3f05228 NFSv4: Fix the 'nolock' option regression
NFSv4 should just ignore the 'nolock' option. It is an NFSv2/v3 thing...
This fixes the Oops in http://bugzilla.kernel.org/show_bug.cgi?id=13330

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 13:22:58 -07:00
Benny Halevy
7146851376 nfs41: minorversion support for nfs4_{init,destroy}_callback
move nfs4_init_callback into nfs4_init_client_minor_version
and nfs4_destroy_callback into nfs4_clear_client_minor_version

as these need to happen also when auto-negotiating the minorversion
once the callback service for nfs41 becomes different than for nfs4.0

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Fix checkpatch warning]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Type check arguments of nfs_callback_up]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Backchannel: Remove FIXME comment]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 13:06:01 -07:00
Benny Halevy
9bdaa86d2a nfs41: Refactor nfs4_{init,destroy}_callback for nfs4.0
Refactor-out code to bring the callback service up and down.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 12:43:47 -07:00
Benny Halevy
34dc1ad752 nfs41: increment_{open,lock}_seqid
Unlike minorversion0, in nfsv4.1 the open and lock seqids need
not be incremented by the client and should always be set to zero.

This is implemented using a new nfs_rpc_ops methods -
increment_open_seqid and increment_lock_seqid

Signed-off-by: Rahul Iyer <iyer@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: check for session not minorversion]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:43:45 -07:00
Andy Adamson
78722e9c92 nfs41: only retry EXCHANGE_ID on recoverable errors
Stops an infinite loop of EXCHANGE_ID.

Signed-off-by: Andy Adamson <andros@netapp.com>
[fixed checkpatch warnings]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 12:43:44 -07:00
Benny Halevy
008f55d0e0 nfs41: recover lease in _nfs4_lookup_root
This creates the nfsv4.1 session on mount.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:13 -07:00
Andy Adamson
b4b82607ff nfs41: get_clid_cred for EXCHANGE_ID
Unlike SETCLIENTID, EXCHANGE_ID requires a machine credential. Do not search
for credentials other than the machine credential.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:13 -07:00
Andy Adamson
90a16617ee nfs41: add a get_clid_cred function to nfs4_state_recovery_ops
EXCHANGE_ID has different credential requirements than SETCLIENTID.
Prepare for a separate credential function.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:12 -07:00
Andy Adamson
591d71cbde nfs41: establish sessions-based clientid
nfsv4.1 clientid is established via EXCHANGE_ID rather than
SETCLIENTID{,_CONFIRM}

This is implemented using a new establish_clid method in
nfs4_state_recovery_ops.

nfs41: establish clientid via exchange id only if cred != NULL

>From 2.6.26 reclaimer() uses machine cred for setting up the client id
therefore it is never expected to be NULL.

Signed-off-by: Rahul Iyer <iyer@netapp.com>
[removed dprintk]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: lease renewal]
[revamped patch for new nfs4_state_manager design]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:11 -07:00
Andy Adamson
a7b721037f nfs41: introduce get_state_renewal_cred
Use the machine cred for sending SEQUENCE to renew
the client's lease.

[revamp patch for new state management design starting 2.6.29]
[nfs41: support minorversion 1 for nfs4_check_lease]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get cred in exchange_id when cred arg is NULL]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use cl_machined_cred instead of cl_ex_cred]
    Since EXCHANGE_ID insists on using the machine credential, cl_ex_cred is
    not needed. nfs4_proc_exchange_id() is only called if the machine credential
    is available. Remove the credential logic from nfs4_proc_exchange_id.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:11 -07:00
Benny Halevy
8e69514f29 nfs41: support minorversion 1 for nfs4_check_lease
[moved nfs4_get_renew_cred related changes to
 "nfs41: introduce get_state_renewal_cred"]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:10 -07:00
Benny Halevy
29fba38b79 nfs41: lease renewal
Send a NFSv4.1 SEQUENCE op rather than RENEW that was deprecated in
minorversion 1.
Use the nfs_client minorversion to select reboot_recover/
network_partition_recovery/state_renewal ops.

Note: we use reclaimer to create the nfs41 session before there are any
cl_superblocks for the nfs_client.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: check for session not minorversion]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[revamped patch for new nfs4_state_manager design]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: obliterate nfs4_state_recovery_ops.renew_lease method]
    moved to nfs4_state_maintenance_ops
[also undid per-minorversion nfs4_state_recovery_ops here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:09 -07:00
Andy Adamson
b069d94af7 nfs41: schedule async session reset
Define a new session reset state which is set upon a sequence operation error
in both the sync and async error handlers.

Place all new requests and all but the last outstanding rpc on the
slot_tbl_waitq. Spawn the recovery thread when the last slot is free.
Call nfs4_proc_destroy_session, reinitialize the session, call
nfs4_proc_create_session, clear the session reset state, and wake up the next
task on the slot_tbl_waitq.

Return the nfs4_proc_destroy_session status to the session reclaimer and
check for NFS4ERR_BADSESSION and NFS4ERR_DEADSESSION. Other destroy session
errors should be handled in nfs4_proc_destroy_session where the call can
be retried with adjusted arguments.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
nfs41: make nfs4_wait_bit_killable public]
    nfs4_wait_bit_killable to be used by NFSv4.1 session recover logic.
Signed-off-by: Rahul Iyer <iyer@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: have create_session work on nfs_client]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: trigger the state manager for session reset]
    Replace the session reset state with the NFS4CLNT_SESSION_SETUP cl_state.
    Place all rpc tasks to sleep on the slot table waitqueue until the slot
    table is drained, then schedule state recovery and wait for it to complete.
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: remove nfs41_session_recovery [ch]
Replaced by using the nfs4_state_manager.
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: nfs4_wait_bit_killable only used locally]
[nfs41: keep nfs4_wait_bit_killable static]
[nfs41: keep const nfs_server in nfs4_handle_exception]
[nfs41: remove session parameter from nfs4_find_slot]
Signed-off-by: Andy Adamson <andros@netapp.com
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: resset the session from nfs41_setup_sequence]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:09 -07:00
Andy Adamson
4745e3154b nfs41: kick start nfs41 session recovery when handling errors
Remove checking for any errors that the SEQUENCE operation does not return.
-NFS4ERR_STALE_CLIENTID, NFS4ERR_EXPIRED, NFS4ERR_CB_PATH_DOWN, NFS4ERR_BACK_CHAN_BUSY, NFS4ERR_OP_NOT_IN_SESSION.

SEQUENCE operation error recovery is very primative, we only reset the session.

Remove checking for any errors that are returned by the SEQUENCE operation, but
that resetting the session won't address.
NFS4ERR_RETRY_UNCACHED_REP, NFS4ERR_SEQUENCE_POS,NFS4ERR_TOO_MANY_OPS.

Add error checking for missing SEQUENCE errors that a session reset will
address.
NFS4ERR_BAD_HIGH_SLOT, NFS4ERR_DEADSESSION, NFS4ERR_SEQ_FALSE_RETRY.

A reset of the session is currently our only response to a SEQUENCE operation
error. Don't reset the session on errors where a new session won't help.

Don't reset the session on errors where a new session won't help.

[nfs41: nfs4_async_handle_error update error checking]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: trigger the state manager for session reset]
    Replace session state bit with nfs_client state bit.  Set the
    NFS4CLNT_SESSION_SETUP bit upon a session related error in the sync/async
    error handlers.
[nfs41: _nfs4_async_handle_error fix session reset error list]
Sequence operation errors that session reset could help.
NFS4ERR_BADSESSION
NFS4ERR_BADSLOT
NFS4ERR_BAD_HIGH_SLOT
NFS4ERR_DEADSESSION
NFS4ERR_CONN_NOT_BOUND_TO_SESSION
NFS4ERR_SEQ_FALSE_RETRY
NFS4ERR_SEQ_MISORDERED

Sequence operation errors that a session reset would not help

NFS4ERR_BADXDR
NFS4ERR_DELAY
NFS4ERR_REP_TOO_BIG
NFS4ERR_REP_TOO_BIG_TO_CACHE
NFS4ERR_REQ_TOO_BIG
NFS4ERR_RETRY_UNCACHED_REP
NFS4ERR_SEQUENCE_POS
NFS4ERR_TOO_MANY_OPS

Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41 nfs4_handle_exception fix session reset error list]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[moved nfs41_sequece_call_done code to nfs41: sequence operation]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:08 -07:00
Andy Adamson
eedc020e71 nfs41: use rpc prepare call state for session reset
[nfs41: change nfs4_restart_rpc argument]
[nfs41: check for session not minorversion]
[nfs41: trigger the state manager for session reset]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[always define nfs4_restart_rpc]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:07 -07:00
Andy Adamson
c3fad1b1aa nfs41: add session reset to state manager
Move the code to reset a session from the session_reclaimer to the
nfs4_state_manager.  Destroy the session, and create a new one. Treat
NFS4ERR_BADSESSION and NFS4ERR_DEADSESSION as a successful
nfs4_proc_destroy_session. Signal nfs4_proc_create_session that this is a
session reset so that the session slot table is re-used.

If the clientid is stale, set both NFS4CLNT_LEASE_EXPIRED and
NFS4CLNT_SESSION_SETUP bits and retry.

Use a switch statement in nfs4_session_recovery_handle_error for future
patche which will add handling for other errors.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>

[nfs41: session reset in nfs4_recovery_handle_error]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: reset session on nfs4_do_reclaim session reset error]
    If nfs4_do_reclaim gets a session reset error, nfs4_recovery_handle_error
    will set the NFS4CLNT_SESSION_SETUP bit, and the state manager should
    continue processing to reset the session.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[move nfs4_proc_destroy_session declaration here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:06 -07:00
Andy Adamson
76db6d9500 nfs41: add session setup to the state manager
At mount, nfs_alloc_client sets the cl_state NFS4CLNT_LEASE_EXPIRED bit
and nfs4_alloc_session sets the NFS4CLNT_SESSION_SETUP bit, so both bits are
set when nfs4_lookup_root calls nfs4_recover_expired_lease which schedules
the nfs4_state_manager and waits for it to complete.

Place the session setup after the clientid establishment in nfs4_state_manager
so that the session is setup right after the clientid has been established
without rescheduling the state manager.

Unlike nfsv4.0, the nfs_client struct is not ready to use until the session
has been established.  Postpone marking the nfs_client struct to NFS_CS_READY
until after a successful CREATE_SESSION call so that other threads cannot use
the client until the session is established.

If the EXCHANGE_ID call fails and the session has not been setup (the
NFS4CLNT_SESSION_SETUP bit is set), mark the client with the error and return.

If the session setup CREATE_SESSION call fails with NFS4ERR_STALE_CLIENTID
which could occur due to server reboot or network partition inbetween the
EXCHANGE_ID and CREATE_SESSION call, reset the NFS4CLNT_LEASE_EXPIRED and
NFS4CLNT_SESSION_SETUP bits and try again.

If the CREATE_SESSION call fails with other errors, mark the client with
the error and return.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>

[nfs41: NFS_CS_SESSION_SETUP cl_cons_state for back channel setup]
  On session setup, the CREATE_SESSION reply races with the server back channel
  probe which needs to succeed to setup the back channel. Set a new
  cl_cons_state NFS_CS_SESSION_SETUP just prior to the CREATE_SESSION call
  and add it as a valid state to nfs_find_client so that the client back channel
  can find the nfs_client struct and won't drop the server backchannel probe.
  Use a new cl_cons_state so that NFSv4.0 back channel behaviour which only
  sets NFS_CS_READY is unchanged.
  Adjust waiting on the nfs_client_active_wq accordingly.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>

[nfs41: rename NFS_CS_SESSION_SETUP to NFS_CS_SESSION_INITING]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: set NFS_CL_SESSION_INITING in alloc_session]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: move session setup into a function]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[moved nfs4_proc_create_session declaration here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:06 -07:00
Andy Adamson
ac72b7b3b3 nfs41: reset the session slot table
Separated from nfs41: schedule async session reset

Do not kfree the session slot table upon session reset, just re-initialize it.
Add a boolean to nfs4_proc_create_session to inidicate if this is a
session reset or a session initialization.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:05 -07:00
Andy Adamson
fc01cea963 nfs41: sequence operation
Implement the sequence operation conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Check returned sessionid, slotid and slot sequenceid in decode_sequence.

If the server returns different values for sessionID, slotID or slot sequence
number than what was sent, the server is looney tunes.

Pass the sequence operation status to nfs41_sequence_done in order to
determine when to increment the slot sequence ID.

Free slot is separated from sequence done.

Signed-off-by: Rahul Iyer <iyer@netapp.com>
Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Andy Adamson<andros@umich.edu>
[nfs41: sequence res use slotid]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: deref slot table in decode_sequence only for minorversion!=0]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_call_sync]
[nfs41: remove SEQ4_STATUS_USE_TK_STATUS]
[nfs41: return ESERVERFAULT in decode_sequence]
[no sr_session, no sr_flags]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use nfs4_call_sync_sequence to renew session lease]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove nfs4_call_sync_sequence forward definition]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: use struct nfs_client for nfs41_proc_async_sequence]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41 nfs41_sequence_call_done update error checking]
[nfs41 nfs41_sequence_done update error checking]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove switch on error from nfs41_sequence_call_done]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:04 -07:00
Andy Adamson
8328d59f38 nfs41: enable nfs_client only nfs4_async_handle_error
The session is per struct nfs_client, not per nfs_server. Allow the handler
to be called with no nfs_server which simplifies the nfs4_proc_async_sequence session renewal call and will let it be used by pnfs file layout data servers.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:25:04 -07:00
Andy Adamson
0f3e66c6a6 nfs41: destroy_session operation
Implement the destroy_session operation conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove extraneous rpc_clnt pointer]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41; NFS_CS_READY required for DESTROY_SESSION]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[nfs41: fix encode_destroy_session's xdr Xcoding pointer type]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 12:24:55 -07:00
Andy Adamson
96b09e024f nfs41: use session attributes for rsize and wsize
Set the mount points rsize and wsize to the negotiated session fore channel
maximum response and requeset size. These values will be bound checked in
nfs_server_set_fsinfo.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[move nfs4_session_set_rwsize into CONFIG_NFS_V4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:24:53 -07:00
Andy Adamson
8d35301d7d nfs41: verify session channel attribues
Invalidate the session if the server returns invalid fore or back channel
attributes.

Use a KERN_WARNING to report the fatal session estabishment error.

Signed-off-by: Andy Adamson <andros@netapp.com>
[refactor nfs4_verify_channel_attrs]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:24:52 -07:00
Andy Adamson
fc931582c2 nfs41: create_session operation
Implement the create_session operation conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Set the real fore channel max operations to preserve server resources.
Note: If the server returns < NFS4_MAX_OPS, the client will very soon
get an NFS4ERR_TOO_MANY_OPS. A later patch will handle this.

Set the max_rqst_sz and max_resp_sz to PAGE_SIZE - we preallocate the buffers.

Set the back channel max_resp_sz_cached to zero to force the client to
always set csa_cachethis to FALSE because the current implementation
of the back channel DRC only supports caching the CB_SEQUENCE operation.

The client back channel server supports one slot, and desires 2 operations
per compound.

Signed-off-by: Ricardo Labiaga <ricardo.labiaga@netapp.com>
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove extraneous rpc_clnt pointer]
Use the struct nfs_client cl_rpcclient.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_init_channel_attrs, just use nfs41_create_session_args]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use rsize and wsize for session channel attributes]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: set channel max operations]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: set back channel attributes]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: obliterate nfs4_adjust_channel_attrs]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: have create_session work on nfs_client]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: move CONFIG_NFS_V4_1 endif]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
[moved nfs4_init_slot_table definition here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use kcalloc to allocate slot table]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[nfs41: fix Xcode_create_session's xdr Xcoding pointer type]
[nfs41: refactor decoding of channel attributes]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 12:24:34 -07:00
Andy Adamson
2050f0cc07 nfs41: get_lease_time
get_lease_time uses the FSINFO rpc operation to
get the lease time attribute.

nfs4_get_lease_time() is only called from the state manager on session setup
so don't recover from clientid or sequence level errors.

We do need to recover from NFS4ERR_DELAY or NFS4ERR_GRACE.
Use NFS4_POLL_RETRY_MIN - the Linux server returns NFS4ERR_DELAY when an
upcall is needed to resolve an uncached export referenced by a file handle.

[nfs41: sequence res use slotid]
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove extraneous rpc_clnt pointer]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: have get_lease_time work on nfs_client]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get_lease_time recover from NFS4ERR_DELAY]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
[define nfs4_get_lease_time_{args,res}]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 12:24:32 -07:00
Benny Halevy
99fe60d062 nfs41: exchange_id operation
Implement the exchange_id operation conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Unlike NFSv4.0, NFSv4.1 requires machine credentials. RPC_AUTH_GSS machine
credentials will be passed into the kernel at mount time to be available for
the exchange_id operation.

RPC_AUTH_UNIX root mounts can use the UNIX root credential. Store the root
credential in the nfs_client struct.

Without a credential, NFSv4.1 state renewal fails.

[nfs41: establish clientid via exchange id only if cred != NULL]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: move nfstime4 from under CONFIG_NFS_V4_1]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: do not wait a lease time in exchange id]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[nfs41: Ignoring impid in decode_exchange_id is missing a READ_BUF]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: fix Xcode_exchange_id's xdr Xcoding pointer type]
[nfs41: get rid of unused struct nfs41_exchange_id_res members]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
2009-06-17 12:23:57 -07:00
Andy Adamson
938e101091 nfs41 delegreturn sequence setup done support
Separate delegreturn calls from nfs41: sequence setup/done support

Implement the delegreturn rpc_call_prepare method for
asynchronuos nfs rpcs, call nfs41_setup_sequence from
respective rpc_call_validate_args methods.

Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:51 -07:00
Andy Adamson
21d9a851aa nfs41 commit sequence setup done support
Separate commit calls from nfs41: sequence setup/done support

Implement the commit rpc_call_prepare method for
asynchronuos nfs rpcs, call nfs41_setup_sequence from
respective rpc_call_validate_args methods.

Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Support sessions with O_DIRECT.]
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: separate free slot from sequence done]
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:50 -07:00
Andy Adamson
def6ed7ef4 nfs41 write sequence setup done support
Separate write calls from nfs41: sequence setup/done support

Implement the write rpc_call_prepare method for
asynchronuos nfs rpcs, call nfs41_setup_sequence from
respective rpc_call_validate_args methods.

Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson <andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[move the nfs4_sequence_free_slot call in nfs_readpage_retry from]
[nfs41: separate free slot from sequence done
Signed-off-by: Andy Adamson <andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Support sessions with O_DIRECT.]
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:49 -07:00
Andy Adamson
f11c88af26 nfs41: read sequence setup/done support
Implement the read rpc_call_prepare method for
asynchronuos nfs rpcs, call nfs41_setup_sequence from
respective rpc_call_validate_args methods.

Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson <andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[move the nfs4_sequence_free_slot call in nfs_readpage_retry from]
[nfs41: separate free slot from sequence done]
[remove nfs_readargs.nfs_server, use calldata->inode instead]
Signed-off-by: Andy Adamson <andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Support sessions with O_DIRECT]
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:48 -07:00
Andy Adamson
472cfbd9b9 nfs41: unlink sequence setup/done support
Implement the rpc_call_prepare methods for
asynchronuos nfs rpcs, call nfs41_setup_sequence from
respective rpc_call_validate_args methods.

Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: separate free slot from sequence done]
[nfs41: sequence res use slotid]
[nfs41: remove SEQ4_STATUS_USE_TK_STATUS]
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:47 -07:00
Andy Adamson
a893693c15 nfs41: locku sequence setup/done support
Separate nfs4_locku calls from nfs41: sequence setup/done support
Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson <andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:46 -07:00
Andy Adamson
66179efee3 nfs41: lock sequence setup/done support
Separate nfs4_lock calls from nfs41: sequence setup/done support
Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
[use nfs4_sequence_done_free_slot]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:45 -07:00
Andy Adamson
d898528cdb nfs41: open sequence setup/done support
Separate nfs4_open calls from nfs41: sequence setup/done support
Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
[use nfs4_sequence_done_free_slot]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:44 -07:00
Andy Adamson
19ddab06ed nfs41: close sequence setup/done support
Separate nfs4_close calls from nfs41: sequence setup/done support
Call nfs4_sequence_done from respective rpc_call_done methods.

Note that we need to pass a pointer to the nfs_server in calls data
for passing on to nfs4_sequence_done.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pnfs: client data server write validate and release]
Signed-off-by: Andy Adamson<andros@umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: separate free slot from sequence done]
[nfs41: sequence res use slotid]
[nfs41: remove SEQ4_STATUS_USE_TK_STATUS]
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:43 -07:00
Andy Adamson
69ab40c4c3 nfs41: nfs41_call_sync_done
Implement nfs4.1 synchronous rpc_call_done method
that essentially just calls nfs4_sequence_done, that turns
around and calls nfs41_sequence_done for minorversion1 rpcs.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: check for session not minorversion]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[move adding nfs4_sequence_free_slot from nfs41-separate-free-slot-from-sequence-done]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs41_call_sync_data use nfs_client not nfs_server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:42 -07:00
Andy Adamson
b0df806c0f nfs41: nfs41_sequence_done
Handle session level errors, update slot sequence id and
sessions bookeeping, free slot.

[nfs41: sequence res use slotid]
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove SEQ4_STATUS_USE_TK_STATUS]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: check for session not minorversion]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: bail out early out of nfs41_sequence_done if !res->sr_session]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[move nfs4_sequence_done from nfs41: nfs41_call_sync_done]
Signed-off-by: Andy Adamson <andros@netapp.com>
[move nfs4_sequence_free_slot from nfs41: separate free slot from sequence done]
    Don't free the slot until after all rpc_restart_calls have completed.
    Session reset will require more work.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[moved reset sr_slotid to nfs41_sequence_free_slot]
[free slot also on unexpectecd error]
[remove seq_res.sr_session member, use nfs_client's instead]
[ditch seq_res.sr_flags until used]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[look at sr_slotid for bailing out early from nfs41_sequence_done]
[nfs41: rpc_wake_up_next if sessions slot was not consumed.]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove unused error checking in nfs41_sequence_done]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: remove nfs4_has_session check in nfs41_sequence_done]
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: remove nfs_client pointer check]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:41 -07:00
Andy Adamson
13615871cd nfs41: nfs41_sequence_free_slot
[from nfs41: separate free slot from sequence done]

Don't free the slot until after all rpc_restart_calls have completed.
Session reset will require more work.

As noted by Trond, since we're using rpc_wake_up_next rather than
rpc_wake_up() we must always wake up the next task in the queue
either by going through nfs4_free_slot, or just calling
rpc_wake_up_next if no slot is to be freed.

[nfs41: sequence res use slotid]
[nfs41: remove SEQ4_STATUS_USE_TK_STATUS]
[got rid of nfs4_sequence_res.sr_session, use nfs_client.cl_session instead]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: rpc_wake_up_next if sessions slot was not consumed.]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_sequence_free_slot use nfs_client for data server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:41 -07:00
Andy Adamson
e2c4ab3ce2 nfs41: free slot
Free a slot in the slot table.

Mark the slot as free in the bitmap-based allocation table
by clearing a bit corresponding to the slotid.

Update lowest_free_slotid if freed slotid is lower than that.
Update highest_used_slotid.  In the case the freed slotid
equals the highest_used_slotid, scan downwards for the next
highest used slotid using the optimized fls* functions.

Finally, wake up thread waiting on slot_tbl_waitq for a free slot
to become available.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: free slot use slotid]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use find_first_zero_bit for nfs4_find_slot]
    While at it, obliterate lowest_free_slotid and fix-up related comments.
    As per review comment 21/85.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use __clear_bit for nfs4_free_slot]
    While at it, fix-up function comment.
    Part of review comment 22/85.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: use find_last_bit in nfs4_free_slot to determine highest used slot.]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: rpc_sleep_on slot_tbl_waitq must be called under slot_tbl_lock]
    Otherwise there's a race (we've hit) with nfs4_free_slot where
    nfs41_setup_sequence sees a full slot table, unlocks slot_tbl_lock,
    nfs4_free_slots happen concurrently and call rpc_wake_up_next
    where there's nobody to wake up yet, context goes back to
    nfs41_setup_sequence which goes to sleep when the slot table
    is actually empty now and there's no-one to wake it up anymore.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:39 -07:00
Andy Adamson
fbcd4abcb3 nfs41: setup_sequence method
Allocate a slot in the session slot table and set the sequence op arguments.

Called at the rpc prepare stage.

Add a status to nfs41_sequence_res, initialize it to one so that we catch
rpc level failures which do not go through decode_sequence which sets
the new status field.

Note that upon an rpc level failure, we don't know if the server processed the
sequence operation or not. Proceed as if the server did process the sequence
operation.

Signed-off-by: Rahul Iyer <iyer@netapp.com>
[nfs41: sequence args use slotid]
[nfs41: find slot return slotid]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove SEQ4_STATUS_USE_TK_STATUS]
As per 11-14-08 review
[move extern declaration from nfs41: sequence setup/done support]
[removed sa_session definition, changed sa_cache_this into a u8 to reduce footprint]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: rpc_sleep_on slot_tbl_waitq must be called under slot_tbl_lock]
    Otherwise there's a race (we've hit) with nfs4_free_slot where
    nfs41_setup_sequence sees a full slot table, unlocks slot_tbl_lock,
    nfs4_free_slots happen concurrently and call rpc_wake_up_next
    where there's nobody to wake up yet, context goes back to
    nfs41_setup_sequence which goes to sleep when the slot table
    is actually empty now and there's no-one to wake it up anymore.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:39 -07:00
Benny Halevy
510b81756f nfs41: find slot
Find a free slot using bitmap-based allocation.
Use the optimized ffz function to find a zero bit
in the bitmap that indicates a free slot, starting
the search from the 'lowest_free_slotid' position.

If found, mark the slot as used in the bitmap, get
the slot's slotid and seqid, and update max_slotid
to be used by the SEQUENCE operation.

Also, update lowest_free_slotid for next search.

If no free slot was found the caller has to wait
for a free slot (outside the scope of this function)

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: find slot return slotid]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[use find_first_zero_bit for nfs4_find_slot as per review comment 21/85.]
[use NFS4_MAX_SLOT_TABLE rather than NFS4_NO_SLOT]
[nfs41: rpc_sleep_on slot_tbl_waitq must be called under slot_tbl_lock]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:37 -07:00
Andy Adamson
ce5039c1be nfs41: nfs4_setup_sequence
Perform the nfs4_setup_sequence in the rpc_call_prepare state.

If a session slot is not available, we will rpc_sleep_on the
slot wait queue leaving the tk_action as rpc_call_prepare.

Once we have a session slot, hang on to it even through rpc_restart_calls.
Ensure the nfs41_sequence_res sr_slot pointer is NULL before rpc_run_task is
called as nfs41_setup_sequence will only find a new slot if it is NULL.

A future patch will call free slot after any rpc_restart_calls, and handle the
rpc restart that result from a sequence operation error.

Signed-off-by: Rahul Iyer <iyer@netapp.com>
[nfs41: sequence res use slotid]
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: simplify nfs4_call_sync]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_call_sync]
[nfs41: check for session not minorversion]
[nfs41: remove rpc_message from nfs41_call_sync_args]
[moved NFS4_MAX_SLOT_TABLE logic into nfs41_setup_sequence]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs41_call_sync_data use nfs_client not nfs_server]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: expose nfs4_call_sync_session for lease renewal]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: remove unnecessary return check]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:36 -07:00
Andy Adamson
9b7b9fcc9c nfs41: xdr {encode,decode}_sequence
Implement stubs for encode and decode sequence, defined as no-ops when
CONFIG_NFS_V4_1 is not defined.
Add the nfsv41 encode and decode sizes. Add encode_sequence to all
nfs4_enc_* routines and decode_sequence to all nfs4_dec_* routines as required
by v41.

[was nfs41: minorversion support for xdr]
[added nfs_client argument to encode_sequence so not to use sequence_args to pass sa_session]
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:36 -07:00
Benny Halevy
66cc042970 nfs41: encode minorversion in compound header
Signed-off-by: Andy Adamdon <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: pass *session in seq_args and seq_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:35 -07:00
Benny Halevy
28f566942c NFS: use dynamically computed compound_hdr.replen for xdr_inline_pages offset
As Trond suggested, rather than passing a constant to xdr_inline_pages,
keep a running count of the expected reply bytes.  In preparation for
nfs41, where additional op sequence are expteced when talking to nfs41
servers.

[NFS: cb_compoundhdr.replen is in words not bytes]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get fs_locations replen before encoding the GETATTR]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: get getacl replen before encoding the GETATTR]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:34 -07:00
Benny Halevy
dadf0c2767 NFS: update hdr->replen for every encode op
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:33 -07:00
Benny Halevy
0c4e8c1877 NFS: define and initialize compound_hdr.replen
replen holds the running count of expected reply bytes.
repl will then be used by encoding routines for xdr_inline_pages offset
after which data bytes are to be received directly into the xdr
buffer pages.

NOTE: According to the nfsv4 and v4.1 RFCs, the replied tag SHOULD be the same
is the one sent, but this is not required as a MUST for the server to do so.
The server may screw us if it replies a tag of a different length in the
compound result.

[NFS: cb_compoundhdr.replen is in words not bytes]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:32 -07:00
Benny Halevy
6ce183919b NFS: use decode_change_info_maxsz for xdr maxsz calculations
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:31 -07:00
Andy Adamson
5f7dbd5c75 nfs41: set up seq_res.sr_slotid
Initialize nfs4_sequence_res sr_slotid to NFS4_MAX_SLOT_TABLE.

[was nfs41: sequence res use slotid]
Signed-off-by: Andy Adamson <andros@netapp.com>
[pulled definition of struct nfs4_sequence_res.sr_slotid to here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:30 -07:00
Benny Halevy
f3752975ca nfs41: nfs41: pass *session in seq_args and seq_res
To be used for getting the rpc's minorversion and for nfs41 xdr
{en,de}coding of the sequence operation.
Reset the seq session ptrs for minorversion=0 rpc calls.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:29 -07:00
Andy Adamson
cccef3b96a nfs41: introduce nfs4_call_sync
Use nfs4_call_sync rather than rpc_call_sync to provide
for a nfs41 sessions-enabled interface for sessions manipulation.

The nfs41 rpc logic uses the rpc_call_prepare method to
recover and create the session, as well as selecting a free slot id
and the rpc_call_done to free the slot and update slot table
related metadata.

In the coming patches we'll add rpc prepare and done routines
for setting up the sequence op and processing the sequence result.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: nfs4_call_sync]
As per 11-14-08 review.
Squash into "nfs41: introduce nfs4_call_sync" and "nfs41: nfs4_setup_sequence"
Define two functions one for v4 and one for v41
add a pointer to struct nfs4_client to the correct one.
Signed-off-by: Andy Adamson <andros@netapp.com>
[added BUG() in _nfs4_call_sync_session if !CONFIG_NFS_V4_1]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: check for session not minorversion]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[group minorversion specific stuff together]
Signed-off-by: Alexandros Batsakis <Alexandros.Batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfs41: fixup nfs4_clear_client_minor_version]
[introduce nfs4_init_client_minor_version() in this patch]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[cleaned-up patch: got rid of nfs_call_sync_t, dprintks, cosmetics, extra server defs]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:28 -07:00
Benny Halevy
22958463d5 nfs41: use nfs4_fs_locations_res
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[find nfs4_fs_locations_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:27 -07:00
Benny Halevy
73c403a9a9 nfs41: use nfs4_setaclres
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[define nfs_setaclres]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:26 -07:00
Benny Halevy
9e9ecc03d6 NFS: get rid of unused xdr decode_setattr(, res) argument
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:26 -07:00
Benny Halevy
663c79b3cd nfs41: use nfs4_getaclres
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: embed resp_len in nfs_getaclres]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:25 -07:00
Benny Halevy
d45b2989a7 nfs41: use nfs4_pathconf_res
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[define nfs4_pathconf_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:24 -07:00
Benny Halevy
3dda5e4347 nfs41: use nfs4_fsinfo_res
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[define nfs4_fsinfo_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:23 -07:00
Benny Halevy
24ad148a0f nfs41: use nfs4_statfs_res
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[define nfs4_statfs_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:22 -07:00
Benny Halevy
f50c700081 nfs41: use nfs4_readlink_res
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[define nfs4_readlink_res]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:21 -07:00
Benny Halevy
43652ad553 nfs41: use nfs4_server_caps_arg
In preparation for nfs41 sequence processing.

Signed-off-by: Andy Admason <andros@netapp.com>
[define nfs4_server_caps_arg]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:20 -07:00
Andy Adamson
557134a39c nfs41: sessions client infrastructure
NFSv4.1 Sessions basic data types, initialization, and destruction.

The session is always associated with a struct nfs_client that holds
the exchange_id results.

Signed-off-by: Rahul Iyer <iyer@netapp.com>
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[remove extraneous rpc_clnt pointer, use the struct nfs_client cl_rpcclient.
remove the rpc_clnt parameter from nfs4 nfs4_init_session]
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Use the presence of a session to determine behaviour instead of the
minorversion number.]
Signed-off-by: Andy Adamson <andros@netapp.com>
[constified nfs4_has_session's struct nfs_client parameter]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Rename nfs4_put_session() to nfs4_destroy_session() and call it from nfs4_free_client() not nfs4_free_server().
Also get rid of nfs4_get_session() and the ref_count in nfs4_session struct as keeping track of nfs_client should be sufficient]
Signed-off-by: Alexandros Batsakis <Alexandros.Batsakis@netapp.com>
[nfs41: pass rsize and wsize into nfs4_init_session]
Signed-off-by: Andy Adamson <andros@netapp.com>
[separated out removal of rpc_clnt parameter from nfs4_init_session ot a
 patch of its own]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Pass the nfs_client pointer into nfs4_alloc_session]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: don't assign to session->clp->cl_session in nfs4_destroy_session]
[nfs41: fixup nfs4_clear_client_minor_version]
[introduce nfs4_clear_client_minor_version() in this patch]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Refactor nfs4_init_session]
    Moved session allocation into nfs4_init_client_minor_version, called from
    nfs4_init_client.
    Leave rwise and wsize initialization in nfs4_init_session, called from
    nfs4_init_server.
    Reverted moving of nfs_fsid definition to nfs_fs_sb.h
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: Move NFS4_MAX_SLOT_TABLE define from under CONFIG_NFS_V4_1]
[Fix comile error when CONFIG_NFS_V4_1 is not set.]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[moved nfs4_init_slot_table definition to "create_session operation"]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfs41: alloc session with GFP_KERNEL]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:19 -07:00
Benny Halevy
c2e713dd83 nfs41: translate NFS4ERR_MINOR_VERS_MISMATCH to EPROTONOSUPPORT
To be returned to the mount command when trying to mount a v4 server
using minorversion 1.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:18 -07:00
Benny Halevy
5aae4a9ae0 nfs41: Use mount minorversion option
Use the mount minorversion option to initialize the nfs_client cl_minorversion
and match it in nfs_match_client() when looking up a nfs_client.

[nfs41: remove ifdefs around nfs_client_initdata.minorversion]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:17 -07:00
Benny Halevy
94a417f3d7 nfs41: nfs_client.cl_minorversion
This field is set to the nfsv4 minor version for this mount.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>

Note: This patch sets the referral to the same minorversion as the
current mount. Revisit in future patch.

Signed-off-by: Andy Adamson <andros@netapp.com>
[removed cl_minorversion assignment in nfs_set_client]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[always define nfs_client.cl_minorversion]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:16 -07:00
Mike Sager
3fd5be9e19 nfs41: add mount command option minorversion
mount -t nfs4 -o minorversion=[0|1] specifies whether to use 4.0 or 4.1.
By default, the minorversion is set to 0.

Signed-off-by: Mike Sager <sager@netapp.com>
[set default minorversion to 0 as per Trond and SteveD's request]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:15 -07:00
Ricardo Labiaga
1efae38140 nfs41: Add Kconfig symbols for NFSv4.1
Added CONFIG_NFS_V4_1 and made it depend upon CONFIG_NFS_V4 and EXPERIMENTAL.
Indicate that CONFIG_NFS_V4_1 is for NFS developers at the moment

At the moment we're expecting folks trying out nfs41 to
actively participate in the development process by helping us
debug issues and ideally send patches to fix problems.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-06-17 10:46:13 -07:00
Linus Torvalds
aa2638a210 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6:
  [SCSI] aic79xx: make driver respect nvram for IU and QAS settings
  [SCSI] don't attach ULD to Dell Universal Xport
  [SCSI] lpfc 8.3.3 : Update driver version to 8.3.3
  [SCSI] lpfc 8.3.3 : Add support for Target Reset handler entrypoint
  [SCSI] lpfc 8.3.3 : Fix a couple of spin_lock and memory issues and a crash
  [SCSI] lpfc 8.3.3 : FC/FCOE discovery fixes
  [SCSI] lpfc 8.3.3 : Fix various SLI-3 vs SLI-4 differences
  [SCSI] qla2xxx: Resolve a performance issue in interrupt
  [SCSI] cnic, bnx2i: Fix build failure when CONFIG_PCI is not set.
  [SCSI] nsp_cs: time_out reaches -1
  [SCSI] qla2xxx: fix printk format warnings
  [SCSI] ncr53c8xx: div reaches -1
  [SCSI] compat: don't perform unneeded copy in sg_io code
  [SCSI] zfcp: Update FC pass-through support
  [SCSI] zfcp: Add FC pass-through support
  [SCSI] FC Pass Thru support
2009-06-17 09:50:44 -07:00
Linus Torvalds
b7c142dbf1 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: start using hrtimers
  hrtimer: export ktime_add_safe
  UBIFS: do not forget to register BDI device
  UBIFS: allow sync option in rootflags
  UBIFS: remove dead code
  UBIFS: use anonymous device
  UBIFS: return proper error code if the compr is not present
  UBIFS: return error if link and unlink race
  UBIFS: reset no_space flag after inode deletion
2009-06-17 09:46:33 -07:00
Steven Whitehouse
a566a6b11c dlm: Fix uninitialised variable warning in lock.c
CC [M]  fs/dlm/lock.o
fs/dlm/lock.c: In function ‘find_rsb’:
fs/dlm/lock.c:438: warning: ‘r’ may be used uninitialized in this function

Since r is used on the error path to set r_ret, set it to NULL.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-06-17 11:31:32 -05:00
Linus Torvalds
9cb0fbf7f8 Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus: (47 commits)
  MIPS: Add hibernation support
  MIPS: Move Cavium CP0 hwrena impl bits to cpu-feature-overrides.h
  MIPS: Allow CPU specific overriding of CP0 hwrena impl bits.
  MIPS: Kconfig Add SYS_SUPPORTS_HUGETLBFS and enable it for some systems.
  Hugetlbfs: Enable hugetlbfs for more systems in Kconfig.
  MIPS: TLB support for hugetlbfs.
  MIPS: Add hugetlbfs page defines.
  MIPS: Add support files for hugetlbfs.
  MIPS: Remove unused parameters from iPTE_LW.
  Staging: Add octeon-ethernet driver files.
  MIPS: Export erratum function needed by octeon-ethernet driver.
  MIPS: Cavium-Octeon: Add more chip specific feature tests.
  MIPS: Cavium-Octeon: Add more board type constants.
  MIPS: Export cvmx_sysinfo_get needed by octeon-ethernet driver.
  MIPS: Add named alloc functions to OCTEON boot monitor memory allocator.
  MIPS: Alchemy: devboards: Convert to gpio calls.
  MIPS: Alchemy: xxs1500: use linux gpio api.
  MIPS: Alchemy: MTX-1: Use linux gpio api.
  MIPS: Alchemy: Rewrite GPIO support.
  MIPS: Alchemy: Remove unused au1000_gpio.h header
  ...
2009-06-17 09:13:52 -07:00
Linus Torvalds
feb72ce827 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  get rid of BKL in fs/sysv
  get rid of BKL in fs/minix
  get rid of BKL in fs/efs
  befs ->pust_super() doesn't need BKL
  Cleanup of adfs headers
  9P doesn't need BKL in ->umount_begin()
  fuse doesn't need BKL in ->umount_begin()
  No instance of ->bmap() needs BKL
  remove unlock_kernel() left accidentally
  ext4: avoid unnecessary spinlock in critical POSIX ACL path
  ext3: avoid unnecessary spinlock in critical POSIX ACL path
2009-06-17 08:46:57 -07:00
David Daney
852969b2d2 Hugetlbfs: Enable hugetlbfs for more systems in Kconfig.
As part of adding hugetlbfs support for MIPS, I am adding a new
kconfig variable 'SYS_SUPPORTS_HUGETLBFS'.  Since some mips cpu
varients don't yet support it, we can enable selection of HUGETLBFS on
a system by system basis from the arch/mips/Kconfig.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
CC: William Irwin <wli@holomorphy.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2009-06-17 11:06:31 +01:00
Al Viro
5ac3455a84 get rid of BKL in fs/sysv
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:37 -04:00
Al Viro
cc46759a8c get rid of BKL in fs/minix
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:37 -04:00
Al Viro
e7ec952f6a get rid of BKL in fs/efs
Only readdir() really needed it, and that's easily fixable by switch to
generic_file_llseek()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:36 -04:00
Al Viro
536c94901e befs ->pust_super() doesn't need BKL
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:36 -04:00
Al Viro
608ba50bd0 Cleanup of adfs headers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:36 -04:00
Al Viro
ee450f796f 9P doesn't need BKL in ->umount_begin()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:36 -04:00
Al Viro
66c6af2e8b fuse doesn't need BKL in ->umount_begin()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:36 -04:00
Al Viro
fe36adf47e No instance of ->bmap() needs BKL
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:35 -04:00
J. R. Okajima
b0895513f4 remove unlock_kernel() left accidentally
commit 337eb00a2c
Push BKL down into ->remount_fs()
and
commit 4aa98cf768
Push BKL down into do_remount_sb()

were uncorrectly merged.
The former removes one pair of lock/unlock_kernel(), but the latter adds
several unlock_kernel(). Finally a few unlock_kernel() calls left.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:35 -04:00
Theodore Ts'o
210ad6aedb ext4: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
to do POSIX ACL checks on any files not owned by the caller, and it does
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful
about it, especially with locking. That's doubly sad, since the common
case tends to be that there are no ACL's associated with the files in
question.

ext4 already caches the ACL data so that it doesn't have to look it up
over and over again, but it does so by taking the inode->i_lock spinlock
on every lookup. Which is a noticeable overhead even if it's a private
lock, especially on CPU's where the serialization is expensive (eg Intel
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is
unnecessary. Even if somebody else were to be changing the ACL's on
another CPU, we simply don't care - if we've seen a NULL ACL, we might as
well use it.

So just load the ACL speculatively without any locking, and if it was
NULL, just use it. If it's non-NULL (either because we had a cached
entry, or because the cache hasn't been filled in at all), it means that
we'll need to get the lock and re-load it properly.

(This commit was ported from a patch originally authored by Linus for
ext3.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:35 -04:00
Linus Torvalds
9c64daff9d ext3: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
to do POSIX ACL checks on any files not owned by the caller, and it does
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful
about it, especially with locking. That's doubly sad, since the common
case tends to be that there are no ACL's associated with the files in
question.

ext3 already caches the ACL data so that it doesn't have to look it up
over and over again, but it does so by taking the inode->i_lock spinlock
on every lookup. Which is a noticeable overhead even if it's a private
lock, especially on CPU's where the serialization is expensive (eg Intel
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is
unnecessary. Even if somebody else were to be changing the ACL's on
another CPU, we simply don't care - if we've seen a NULL ACL, we might as
well use it.

So just load the ACL speculatively without any locking, and if it was
NULL, just use it. If it's non-NULL (either because we had a cached
entry, or because the cache hasn't been filled in at all), it means that
we'll need to get the lock and re-load it properly.

This is noticeable even on Nehalem, which does locking quite well (much
better than P4). From lmbench:

	Processor, Processes - times in microseconds - smaller is better
	--------------------------------------------------------------------
	Host                 OS  Mhz null null      open slct fork exec sh
	                             call  I/O stat clos TCP  proc proc proc
	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
 - before:
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
 - after:
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148

where you can see what appears to be a roughly 3% improvement in stat
and open/close latencies from just the removal of the locking overhead.

Of course, this only matters for files you don't own (the owner never
needs to do the ACL checks), but that's the common case for libraries,
header files, and executables. As well as for the base components of any
absolute pathname, even if you are the owner of the final file.

[ At some point we probably want to move this ACL caching logic entirely
  into the VFS layer (and only call down to the filesystem when
  uncached), but in the meantime this improves ext3 a bit.

  A similar fix to btrfs makes a much bigger difference (15x improvement
  in lmbench) due to broken caching. ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-17 00:36:35 -04:00
David Howells
005411c3e9 AFS: Correctly translate auth error aborts and don't failover in such cases
Authentication error abort codes should be translated to appropriate
Linux error codes, rather than all being translated to EREMOTEIO - which
indicates that the server had internal problems.

Additionally, a server shouldn't be marked unavailable and the next
server tried if an authentication error occurs.  This will quickly make
all the servers unavailable to the client.  Instead the error should be
returned straight to the user.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 21:20:14 -07:00
Linus Torvalds
517d08699b Merge branch 'akpm'
* akpm: (182 commits)
  fbdev: bf54x-lq043fb: use kzalloc over kmalloc/memset
  fbdev: *bfin*: fix __dev{init,exit} markings
  fbdev: *bfin*: drop unnecessary calls to memset
  fbdev: bfin-t350mcqb-fb: drop unused local variables
  fbdev: blackfin has __raw I/O accessors, so use them in fb.h
  fbdev: s1d13xxxfb: add accelerated bitblt functions
  tcx: use standard fields for framebuffer physical address and length
  fbdev: add support for handoff from firmware to hw framebuffers
  intelfb: fix a bug when changing video timing
  fbdev: use framebuffer_release() for freeing fb_info structures
  radeon: P2G2CLK_ALWAYS_ONb tested twice, should 2nd be P2G2CLK_DAC_ALWAYS_ONb?
  s3c-fb: CPUFREQ frequency scaling support
  s3c-fb: fix resource releasing on error during probing
  carminefb: fix possible access beyond end of carmine_modedb[]
  acornfb: remove fb_mmap function
  mb862xxfb: use CONFIG_OF instead of CONFIG_PPC_OF
  mb862xxfb: restrict compliation of platform driver to PPC
  Samsung SoC Framebuffer driver: add Alpha Channel support
  atmel-lcdc: fix pixclock upper bound detection
  offb: use framebuffer_alloc() to allocate fb_info struct
  ...

Manually fix up conflicts due to kmemcheck in mm/slab.c
2009-06-16 19:50:13 -07:00
Tomas Szepe
69050eee8e CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK
CONFIG_FILE_LOCKING should not depend on CONFIG_BLOCK.

This makes it possible to run complete systems out of a CONFIG_BLOCK=n
initramfs on current kernels again (this last worked on 2.6.27.*).

Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:52 -07:00
Thomas Gleixner
8b0b1db013 remove put_cpu_no_resched()
put_cpu_no_resched() is an optimization of put_cpu() which unfortunately
can cause high latencies.

The nfs iostats code uses put_cpu_no_resched() in a code sequence where a
reschedule request caused by an interrupt between the get_cpu() and the
put_cpu_no_resched() can delay the reschedule for at least HZ.

The other users of put_cpu_no_resched() optimize correctly in interrupt
code, but there is no real harm in using the put_cpu() function which is
an alias for preempt_enable().  The extra check of the preemmpt count is
not as critical as the potential source of missing a reschedule.

Debugged in the preempt-rt tree and verified in mainline.

Impact: remove a high latency source

[akpm@linux-foundation.org: build fix]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:48 -07:00
Eric Dumazet
4938d7e023 poll: avoid extra wakeups in select/poll
After introduction of keyed wakeups Davide Libenzi did on epoll, we are
able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets.  But TX completion for UDP/TCP frames call
sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and can sleep
again, because nothing is really available.  If number of fds is large,
this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and useless
wakeups are avoided.  This reduces number of context switches by about 50%
on some setups, and work performed by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:48 -07:00
Robert P. J. Day
02d5341ae5 ntfs: use is_power_of_2() function for clarity.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:48 -07:00
Wu Fengguang
84a8924560 writeback: skip new or to-be-freed inodes
1) I_FREEING tests should be coupled with I_CLEAR

The two I_FREEING tests are racy because clear_inode() can set i_state to
I_CLEAR between the clear of I_SYNC and the test of I_FREEING.

2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible
   races with generic_forget_inode()

generic_forget_inode() sets I_WILL_FREE call writeback on its own, so
generic_sync_sb_inodes() shall not try to step in and create possible races:

  generic_forget_inode
    inode->i_state |= I_WILL_FREE;
    spin_unlock(&inode_lock);
                                       generic_sync_sb_inodes()
                                         spin_lock(&inode_lock);
                                         __iget(inode);
                                         __writeback_single_inode
                                           // see non zero i_count
 may WARN here ==>                         WARN_ON(inode->i_state & I_WILL_FREE);
                                         spin_unlock(&inode_lock);
 may call generic_forget_inode again ==> iput(inode);

The above race and warning didn't turn up because writeback_inodes() holds
the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns
early.  But we are not sure the UBIFS calls and future callers will
guarantee that.  So skip I_WILL_FREE inodes for the sake of safety.

Cc: Eric Sandeen <sandeen@sandeen.net>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:45 -07:00
Mike Waychison
286973552f mm: remove __invalidate_mapping_pages variant
Remove __invalidate_mapping_pages atomic variant now that its sole caller
can sleep (fixed in eccb95cee4 ("vfs: fix
lock inversion in drop_pagecache_sb()")).

This fixes softlockups that can occur while in the drop_caches path.

Signed-off-by: Mike Waychison <mikew@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:43 -07:00
David Rientjes
2ff05b2b4e oom: move oom_adj value from task_struct to mm_struct
The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm.  If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct.  This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already.  They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:43 -07:00
KOSAKI Motohiro
6837765963 mm: remove CONFIG_UNEVICTABLE_LRU config option
Currently, nobody wants to turn UNEVICTABLE_LRU off.  Thus this
configurability is unnecessary.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Minchan Kim <minchan.kim@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:42 -07:00
Wu Fengguang
1779754959 proc: export more page flags in /proc/kpageflags
Export all page flags faithfully in /proc/kpageflags.

	11. KPF_MMAP		(pseudo flag) memory mapped page
	12. KPF_ANON		(pseudo flag) memory mapped page (anonymous)
	13. KPF_SWAPCACHE	page is in swap cache
	14. KPF_SWAPBACKED	page is swap/RAM backed
	15. KPF_COMPOUND_HEAD	(*)
	16. KPF_COMPOUND_TAIL	(*)
	17. KPF_HUGE		hugeTLB pages
	18. KPF_UNEVICTABLE	page is in the unevictable LRU list
	19. KPF_HWPOISON(TBD)	hardware detected corruption
	20. KPF_NOPAGE		(pseudo flag) no page frame at the address
	32-39.			more obscure flags for kernel developers

	(*) For compound pages, exporting _both_ head/tail info enables
	    users to tell where a compound page starts/ends, and its order.

The accompanying page-types tool will handle the details like decoupling
overloaded flags and hiding obscure flags to normal users.

Thanks to KOSAKI and Andi for their valuable recommendations!

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:38 -07:00
Wu Fengguang
ed7ce0f102 proc: kpagecount/kpageflags code cleanup
Move increments of pfn/out to bottom of the loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:36 -07:00
Wu Fengguang
20a0307c03 mm: introduce PageHuge() for testing huge/gigantic pages
A series of patches to enhance the /proc/pagemap interface and to add a
userspace executable which can be used to present the pagemap data.

Export 10 more flags to end users (and more for kernel developers):

        11. KPF_MMAP            (pseudo flag) memory mapped page
        12. KPF_ANON            (pseudo flag) memory mapped page (anonymous)
        13. KPF_SWAPCACHE       page is in swap cache
        14. KPF_SWAPBACKED      page is swap/RAM backed
        15. KPF_COMPOUND_HEAD   (*)
        16. KPF_COMPOUND_TAIL   (*)
        17. KPF_HUGE		hugeTLB pages
        18. KPF_UNEVICTABLE     page is in the unevictable LRU list
        19. KPF_HWPOISON        hardware detected corruption
        20. KPF_NOPAGE          (pseudo flag) no page frame at the address

        (*) For compound pages, exporting _both_ head/tail info enables
            users to tell where a compound page starts/ends, and its order.

a simple demo of the page-types tool

# ./page-types -h
page-types [options]
            -r|--raw                  Raw mode, for kernel developers
            -a|--addr    addr-spec    Walk a range of pages
            -b|--bits    bits-spec    Walk pages with specified bits
            -l|--list                 Show page details in ranges
            -L|--list-each            Show page details one by one
            -N|--no-summary           Don't show summay info
            -h|--help                 Show this usage message
addr-spec:
            N                         one page at offset N (unit: pages)
            N+M                       pages range from N to N+M-1
            N,M                       pages range from N to M-1
            N,                        pages range from N to end
            ,M                        pages range from 0 to M
bits-spec:
            bit1,bit2                 (flags & (bit1|bit2)) != 0
            bit1,bit2=bit1            (flags & (bit1|bit2)) == bit1
            bit1,~bit2                (flags & (bit1|bit2)) == bit1
            =bit1,bit2                flags == (bit1|bit2)
bit-names:
          locked              error         referenced           uptodate
           dirty                lru             active               slab
       writeback            reclaim              buddy               mmap
       anonymous          swapcache         swapbacked      compound_head
   compound_tail               huge        unevictable           hwpoison
          nopage           reserved(r)         mlocked(r)    mappedtodisk(r)
         private(r)       private_2(r)   owner_private(r)            arch(r)
        uncached(r)       readahead(o)       slob_free(o)     slub_frozen(o)
      slub_debug(o)
                                   (r) raw mode bits  (o) overloaded bits

# ./page-types
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          487369     1903  _________________________________
0x0000000000000014               5        0  __R_D____________________________  referenced,dirty
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000000000024              34        0  __R__l___________________________  referenced,lru
0x0000000000000028            3838       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              48        0  ___U_l_______________________I___  uptodate,lru,readahead
0x000000000000002c            6478       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x0000000000000040            8344       32  ______A__________________________  active
0x0000000000000060               1        0  _____lA__________________________  lru,active
0x0000000000000068             348        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x000000000000006c             988        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             503        1  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              30        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007

# ./page-types -r
             flags      page-count       MB  symbolic-flags                     long-symbolic-flags
0x0000000000000000          468002     1828  _________________________________
0x0000000100000000           19102       74  _____________________r___________  reserved
0x0000000000008000              41        0  _______________H_________________  compound_head
0x0000000000010000             188        0  ________________T________________  compound_tail
0x0000000000008014               1        0  __R_D__________H_________________  referenced,dirty,compound_head
0x0000000000010014               4        0  __R_D___________T________________  referenced,dirty,compound_tail
0x0000000000000020               1        0  _____l___________________________  lru
0x0000000800000024              34        0  __R__l__________________P________  referenced,lru,private
0x0000000000000028            3794       14  ___U_l___________________________  uptodate,lru
0x0001000000000028              46        0  ___U_l_______________________I___  uptodate,lru,readahead
0x0000000400000028              44        0  ___U_l_________________d_________  uptodate,lru,mappedtodisk
0x0001000400000028               2        0  ___U_l_________________d_____I___  uptodate,lru,mappedtodisk,readahead
0x000000000000002c            6434       25  __RU_l___________________________  referenced,uptodate,lru
0x000100000000002c              47        0  __RU_l_______________________I___  referenced,uptodate,lru,readahead
0x000000040000002c              14        0  __RU_l_________________d_________  referenced,uptodate,lru,mappedtodisk
0x000000080000002c              30        0  __RU_l__________________P________  referenced,uptodate,lru,private
0x0000000800000040            8124       31  ______A_________________P________  active,private
0x0000000000000040             219        0  ______A__________________________  active
0x0000000800000060               1        0  _____lA_________________P________  lru,active,private
0x0000000000000068             322        1  ___U_lA__________________________  uptodate,lru,active
0x0001000000000068              12        0  ___U_lA______________________I___  uptodate,lru,active,readahead
0x0000000400000068              13        0  ___U_lA________________d_________  uptodate,lru,active,mappedtodisk
0x0000000800000068              12        0  ___U_lA_________________P________  uptodate,lru,active,private
0x000000000000006c             977        3  __RU_lA__________________________  referenced,uptodate,lru,active
0x000100000000006c              48        0  __RU_lA______________________I___  referenced,uptodate,lru,active,readahead
0x000000040000006c               5        0  __RU_lA________________d_________  referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c               3        0  __RU_lA_________________P________  referenced,uptodate,lru,active,private
0x0000000c0000006c               3        0  __RU_lA________________dP________  referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068               1        0  ___U_lA________________dP________  uptodate,lru,active,mappedtodisk,private
0x0000000000004078               1        0  ___UDlA_______b__________________  uptodate,dirty,lru,active,swapbacked
0x000000000000407c              34        0  __RUDlA_______b__________________  referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400             538        2  __________B______________________  buddy
0x0000000000000804               1        0  __R________M_____________________  referenced,mmap
0x0000000000000828            1029        4  ___U_l_____M_____________________  uptodate,lru,mmap
0x0001000000000828              43        0  ___U_l_____M_________________I___  uptodate,lru,mmap,readahead
0x000000000000082c             382        1  __RU_l_____M_____________________  referenced,uptodate,lru,mmap
0x000100000000082c              12        0  __RU_l_____M_________________I___  referenced,uptodate,lru,mmap,readahead
0x0000000000000868             192        0  ___U_lA____M_____________________  uptodate,lru,active,mmap
0x0001000000000868              12        0  ___U_lA____M_________________I___  uptodate,lru,active,mmap,readahead
0x000000000000086c             800        3  __RU_lA____M_____________________  referenced,uptodate,lru,active,mmap
0x000100000000086c              31        0  __RU_lA____M_________________I___  referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878               2        0  ___UDlA____M__b__________________  uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000             492        1  ____________a____________________  anonymous
0x0000000000005008               2        0  ___U________a_b__________________  uptodate,anonymous,swapbacked
0x0000000000005808               4        0  ___U_______Ma_b__________________  uptodate,mmap,anonymous,swapbacked
0x000000000000580c               1        0  __RU_______Ma_b__________________  referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868            2839       11  ___U_lA____Ma_b__________________  uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c              29        0  __RU_lA____Ma_b__________________  referenced,uptodate,lru,active,mmap,anonymous,swapbacked
             total          513968     2007

# ./page-types --raw --list --no-summary --bits reserved
offset  count   flags
0       15      _____________________r___________
31      4       _____________________r___________
159     97      _____________________r___________
4096    2067    _____________________r___________
6752    2390    _____________________r___________
9355    3       _____________________r___________
9728    14526   _____________________r___________

This patch:

Introduce PageHuge(), which identifies huge/gigantic pages by their
dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and make
__free_pages_ok() non-static.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:36 -07:00
Andy Adamson
5d77ddfbcb nfsd41: sanity check client drc maxreqs
Ensure the client requested maximum requests are between 1 and
NFSD_MAX_SLOTS_PER_SESSION

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-16 17:13:16 -07:00
Oleg Nesterov
8eeee4e2f0 send_sigio_to_task: sanitize the usage of fown->signum
send_sigio_to_task() reads fown->signum several times, we can race with
F_SETSIG which changes ->signum lockless.  In theory, this can fool
security checks or we can call group_send_sig_info() with the wrong
->si_signo which does not match "int sig".

Change the code to cache ->signum.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 15:36:17 -07:00
Oleg Nesterov
2f38d70fb4 shift current_cred() from __f_setown() to f_modown()
Shift current_cred() from __f_setown() to f_modown(). This reduces
the number of arguments and saves 48 bytes from fs/fcntl.o.

[ Note: this doesn't clear euid/uid when pid is set to NULL.  But if
  f_owner.pid == NULL we never use f_owner.uid/euid.  Otherwise we'd
  have a bug anyway: we must not send signals if pid was reset to NULL.  ]

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 14:19:00 -07:00
Linus Torvalds
e1f5b94fd0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (143 commits)
  USB: xhci depends on PCI.
  USB: xhci: Add Makefile, MAINTAINERS, and Kconfig entries.
  USB: xhci: Respect critical sections.
  USB: xHCI: Fix interrupt moderation.
  USB: xhci: Remove packed attribute from structures.
  usb; xhci: Fix TRB offset calculations.
  USB: xhci: replace if-elseif-else with switch-case
  USB: xhci: Make xhci-mem.c include linux/dmapool.h
  USB: xhci: drop spinlock in xhci_urb_enqueue() error path.
  USB: Change names of SuperSpeed ep companion descriptor structs.
  USB: xhci: Avoid compiler reordering in Link TRB giveback.
  USB: xhci: Clean up xhci_irq() function.
  USB: xhci: Avoid global namespace pollution.
  USB: xhci: Fix Link TRB handoff bit twiddling.
  USB: xhci: Fix register write order.
  USB: xhci: fix some compiler warnings in xhci.h
  USB: xhci: fix lots of compiler warnings.
  USB: xhci: use xhci_handle_event instead of handle_event
  USB: xhci: URB cancellation support.
  USB: xhci: Scatter gather list support for bulk transfers.
  ...
2009-06-16 13:06:10 -07:00
Linus Torvalds
6fd03301d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits)
  debugfs: use specified mode to possibly mark files read/write only
  debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem.
  xen: remove driver_data direct access of struct device from more drivers
  usb: gadget: at91_udc: remove driver_data direct access of struct device
  uml: remove driver_data direct access of struct device
  block/ps3: remove driver_data direct access of struct device
  s390: remove driver_data direct access of struct device
  parport: remove driver_data direct access of struct device
  parisc: remove driver_data direct access of struct device
  of_serial: remove driver_data direct access of struct device
  mips: remove driver_data direct access of struct device
  ipmi: remove driver_data direct access of struct device
  infiniband: ehca: remove driver_data direct access of struct device
  ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device
  hvcs: remove driver_data direct access of struct device
  xen block: remove driver_data direct access of struct device
  thermal: remove driver_data direct access of struct device
  scsi: remove driver_data direct access of struct device
  pcmcia: remove driver_data direct access of struct device
  PCIE: remove driver_data direct access of struct device
  ...

Manually fix up trivial conflicts due to different direct driver_data
direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}
2009-06-16 12:57:37 -07:00
Linus Torvalds
cd5232bd6b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
  jfs: fix regression preventing coalescing of extents
2009-06-16 12:23:52 -07:00
Linus Torvalds
300df7dc89 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2/net: Use wait_event() in o2net_send_message_vec()
  ocfs2: Adjust rightmost path in ocfs2_add_branch.
  ocfs2: fdatasync should skip unimportant metadata writeout
  ocfs2: Remove redundant gotos in ocfs2_mount_volume()
  ocfs2: Add statistics for the checksum and ecc operations.
  ocfs2 patch to track delayed orphan scan timer statistics
  ocfs2: timer to queue scan of all orphan slots
  ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories
  ocfs2: Fix possible deadlock in quota recovery
  ocfs2: Fix possible deadlock with quotas in ocfs2_setattr()
  ocfs2: Fix lock inversion in ocfs2_local_read_info()
  ocfs2: Fix possible deadlock in ocfs2_global_read_dquot()
  ocfs2: update comments in masklog.h
  ocfs2: Don't printk the error when listing too many xattrs.
2009-06-16 12:11:57 -07:00
Linus Torvalds
d613839ef9 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: remove some includings of blktrace_api.h
  mg_disk: seperate mg_disk.h again
  block: Introduce helper to reset queue limits to default values
  cfq: remove extraneous '\n' in blktrace output
  ubifs: register backing_dev_info
  btrfs: properly register fs backing device
  block: don't overwrite bdi->state after bdi_init() has been run
  cfq: cleanup for last_end_request in cfq_data
2009-06-16 11:46:45 -07:00
Dave Kleikamp
f7c52fd17a jfs: fix regression preventing coalescing of extents
Commit fec1878fe9 caused a regression in
which contiguous blocks being allocated to the end of an extent were
getting a new extent created.  This typically results in files entirely
made up of 1-block extents even though the blocks are contiguous on
disk.

Apparently grub doesn't handle a jfs file being fragmented into too many
extents, since it refuses to boot a kernel from jfs that was created by
the 2.6.30 kernel.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Alex <alevkovich@tut.by>
2009-06-16 13:43:22 -05:00
Linus Torvalds
69257cae20 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: always update root items for fs trees at commit time
2009-06-16 11:30:16 -07:00
Linus Torvalds
23059a0df5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6:
  fat: split fat_generic_ioctl
  FAT: add 'errors' mount option
2009-06-16 11:29:44 -07:00
Alexandros Batsakis
6c18ba9f5e nfsd41: move channel attributes from nfsd4_session to a nfsd4_channel_attr struct
the change is valid for both the forechannel and the backchannel (currently dummy)

Signed-off-by: Alexandros Batsakis <Alexandros.Batsakis@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-16 10:13:45 -07:00
Li Zefan
e212d6f250 block: remove some includings of blktrace_api.h
When porting blktrace to tracepoints, we changed to trace/block.h
for trace prober declarations.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 11:19:36 +02:00
Jens Axboe
a979eff181 ubifs: register backing_dev_info
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:21:04 +02:00
Jens Axboe
ad081f1430 btrfs: properly register fs backing device
btrfs assigns this bdi to all inodes on that file system, so make
sure it's registered. This isn't really important now, but will be
when we put dirty inodes there. Even now, we miss the stats when the
bdi isn't visible.

Also fixes failure to check bdi_init() return value, and bad inherit of
->capabilities flags from the default bdi.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-16 08:21:03 +02:00
Alan Stern
74675a5850 NLS: update handling of Unicode
This patch (as1239) updates the kernel's treatment of Unicode.  The
character-set conversion routines are well behind the current state of
the Unicode specification: They don't recognize the existence of code
points beyond plane 0 or of surrogate pairs in the UTF-16 encoding.

The old wchar_t 16-bit type is retained because it's still used in
lots of places.  This shouldn't cause any new problems; if a
conversion now results in an invalid 16-bit code then before it must
have yielded an undefined code.

Difficult-to-read names like "utf_mbstowcs" are replaced with more
transparent names like "utf8s_to_utf16s" and the ordering of the
parameters is rationalized (buffer lengths come immediate after the
pointers they refer to, and the inputs precede the outputs).
Fortunately the low-level conversion routines are used in only a few
places; the interfaces to the higher-level uni2char and char2uni
methods have been left unchanged.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:44:43 -07:00
Clemens Ladisch
905c02acbd nls: utf8_wcstombs: fix buffer overflow
utf8_wcstombs forgot to include one-byte UTF-8 characters when
calculating the output buffer size, i.e., theoretically, it was possible
to overflow the output buffer with an input string that contains enough
ASCII characters.

In practice, this was no problem because the only user so far (VFAT)
always uses a big enough output buffer.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:44:43 -07:00
Clemens Ladisch
e27ecdd94d nls: utf8_wcstombs: use correct buffer size in error case
When utf8_wcstombs encounters a character that cannot be encoded, we
must not decrease the remaining output buffer size because nothing has
been written to the output buffer.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:44:43 -07:00
Robin Getz
e4792aa30f debugfs: use specified mode to possibly mark files read/write only
In many SoC implementations there are hardware registers can be read or
write only.  This extends the debugfs to enforce the file permissions for
these types of registers by providing a set of fops which are read or
write only.  This assumes that the kernel developer knows more about the
hardware than the user (even root users) -- which is normally true.

Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Bryan Wu <cooloney@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:30:28 -07:00
Jonathan Corbet
400ced61fa debugfs: fix docbook error
Fix an error in debugfs_create_blob's docbook description

It cannot actually be used to write a binary blob.

Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2009-06-15 21:30:24 -07:00
Steven Rostedt
56a83cc929 debugfs: dont stop on first failed recursive delete
debugfs: dont stop on first failed recursive delete

While running a while loop of removing a module that removes a debugfs
directory with debugfs_remove_recursive, and at the same time doing a
while loop of cat of a file in that directory, I would hit a point where
somehow the cat of the file caused the remove to fail.

The result is that other files did not get removed when the module
was removed. I simple read of one of those file can oops the kernel
because the operations to the file no longer exist (removed by module).

The funny thing is that the file being cat'ed was removed. It was
the siblings that were not. I see in the code to debugfs_remove_recursive
there's a test that checks if the child fails to bail out of the loop
to prevent an infinite loop.

What this patch does is to still try any siblings in that directory.
If all the siblings fail, or there are no more siblings, then we exit
the loop.

This fixes the above symptom, but...

This is no full proof. It makes the debugfs_remove_recursive a bit more
robust, but it does not explain why the one file failed. There may
be some kind of delay deletion that makes the debugfs think it did
not succeed. So this patch is more of a fix for the symptom but not
the disease.

This patch still makes the debugfs_remove_recursive more robust and
until I can find out why the bug exists, this patch will keep
the kernel from oopsing in most cases.  Even after the cause is found
I think this change can stand on its own and should be kept.

[ Impact: prevent kernel oops on module unload and reading debugfs files ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:30:24 -07:00
Armin Kuster
557411eb2c Sysfs: fix possible memleak in sysfs_follow_link
There is the possiblity of a memory leak if a page is allocated and if
sysfs_getlink() fails in the sysfs_follow_link.

Signed-off-by: Armin Kuster <akuster@mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-15 21:30:23 -07:00
Yu Zhiguo
b9081d90f5 NFS: kill off complicated macro 'PROC'
kill off obscure macro 'PROC' of NFSv2&3 in order to make the code more clear.

Among other things, this makes it simpler to grep for callers of these
functions--something which has frequently caused confusion among nfs
developers.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-15 19:34:32 -07:00
J. Bruce Fields
e4636d535e nfsd: minor nfsd_vfs_write cleanup
There's no need to check host_err >= 0 every time here when we could
check host_err < 0 once, following the usual kernel style.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-15 19:18:34 -07:00
J. Bruce Fields
d911df7b8d nfsd: Pull write-gathering code out of nfsd_vfs_write
This is a relatively self-contained piece of code that handles a special
case--move it to its own function.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-15 18:54:05 -07:00
J. Bruce Fields
9d2a3f31d6 nfsd: track last inode only in use_wgather case
Updating last_ino and last_dev probably isn't useful in the !use_wgather
case.

Also remove some pointless ifdef'd-out code.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-15 18:52:47 -07:00
Trond Myklebust
48e03bc515 nfsd: Use write gathering only with NFSv2
NFSv3 and above can use unstable writes whenever they are sending more
than one write, rather than relying on the flaky write gathering
heuristics. More often than not, write gathering is currently getting it
wrong when the NFSv3 clients are sending a single write with FILE_SYNC
for efficiency reasons.

This patch turns off write gathering for NFSv3/v4, and ensures that
it only applies to the one case that can actually benefit: namely NFSv2.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-15 18:14:57 -07:00
J. Bruce Fields
7eef4091a6 Merge commit 'v2.6.30' into for-2.6.31 2009-06-15 18:08:07 -07:00
Yan Zheng
978d910d31 Btrfs: always update root items for fs trees at commit time
commit_fs_roots skips updating root items for fs trees that aren't modified.
This is unsafe now that relocation code modifies root item's last_snapshot
field without modifying corresponding fs tree.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-15 20:01:02 -04:00
Sunil Mushran
9af0b38ff3 ocfs2/net: Use wait_event() in o2net_send_message_vec()
Replace wait_event_interruptible() with wait_event() in o2net_send_message_vec().
This is because this function is called by the dlm that expects signals to be
blocked.

Fixes oss bugzilla#1126
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1126

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-15 14:50:14 -07:00
Tao Ma
6b791bcc8b ocfs2: Adjust rightmost path in ocfs2_add_branch.
In ocfs2_add_branch, we use the rightmost rec of the leaf extent block
to generate the e_cpos for the newly added branch. In the most case, it
is OK but if the parent extent block's rightmost rec covers more clusters
than the leaf does, it will cause kernel panic if we insert some clusters
in it. The message is something like:
(7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression:
le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count)
(7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28,
next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1
 [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2]
 [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2]
 [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2]
...

The panic can be easily reproduced by the following small test case
(with bs=512, cs=4K, and I remove all the error handling so that it looks
clear enough for reading).

int main(int argc, char **argv)
{
	int fd, i;
	char buf[5] = "test";

	fd = open(argv[1], O_RDWR|O_CREAT);

	for (i = 0; i < 30; i++) {
		lseek(fd, 40960 * i, SEEK_SET);
		write(fd, buf, 5);
	}

	ftruncate(fd, 1146880);

	lseek(fd, 1126400, SEEK_SET);
	write(fd, buf, 5);

	close(fd);

	return 0;
}

The reason of the panic is that:
the 30 writes and the ftruncate makes the file's extent list looks like:

	Tree Depth: 1   Count: 19   Next Free Rec: 1
	## Offset        Clusters       Block#
	0  0             280            86183
	SubAlloc Bit: 7   SubAlloc Slot: 0
	Blknum: 86183   Next Leaf: 0
	CRC32: 00000000   ECC: 0000
	Tree Depth: 0   Count: 28   Next Free Rec: 28
	## Offset        Clusters       Block#          Flags
	0  0             1              143368          0x0
	1  10            1              143376          0x0
	...
	26 260           1              143576          0x0
	27 270           1              143584          0x0

Now another write at 1126400(275 cluster) whiich will write at the gap
between 271 and 280 will trigger ocfs2_add_branch, but the result after
the function looks like:
	Tree Depth: 1   Count: 19   Next Free Rec: 2
	## Offset        Clusters       Block#
	0  0             280            86183
	1  271           0             143592
So the extent record is intersected and make the following operation bug out.

This patch just try to remove the gap before we add the new branch, so that
the root(branch) rightmost rec will cover the same right position. So in the
above case, before adding branch the tree will be changed to
	Tree Depth: 1   Count: 19   Next Free Rec: 1
	## Offset        Clusters       Block#
	0  0             271            86183
	SubAlloc Bit: 7   SubAlloc Slot: 0
	Blknum: 86183   Next Leaf: 0
	CRC32: 00000000   ECC: 0000
	Tree Depth: 0   Count: 28   Next Free Rec: 28
	## Offset        Clusters       Block#          Flags
	0  0             1              143368          0x0
	1  10            1              143376          0x0
	...
	26 260           1              143576          0x0
	27 270           1              143584          0x0
And after branch add, the tree looks like
	Tree Depth: 1   Count: 19   Next Free Rec: 2
	## Offset        Clusters       Block#
	0  0             271            86183
	1  271           0             143592

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-15 14:49:43 -07:00
Linus Torvalds
9c7cb99a82 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (22 commits)
  nilfs2: support contiguous lookup of blocks
  nilfs2: add sync_page method to page caches of meta data
  nilfs2: use device's backing_dev_info for btree node caches
  nilfs2: return EBUSY against delete request on snapshot
  nilfs2: modify list of unsupported features in caveats
  nilfs2: enable sync_page method
  nilfs2: set bio unplug flag for the last bio in segment
  nilfs2: allow future expansion of metadata read out via get info ioctl
  NILFS2: Pagecache usage optimization on NILFS2
  nilfs2: remove nilfs_btree_operations from btree mapping
  nilfs2: remove nilfs_direct_operations from direct mapping
  nilfs2: remove bmap pointer operations
  nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct
  nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function
  nilfs2: move get block functions in bmap.c into btree codes
  nilfs2: remove nilfs_bmap_delete_block
  nilfs2: remove nilfs_bmap_put_block
  nilfs2: remove header file for segment list operations
  nilfs2: eliminate removal list of segments
  nilfs2: add sufile function that can modify multiple segment usages
  ...
2009-06-15 09:13:49 -07:00
Alexey Zaytsev
df59c0ad05 [SCSI] compat: don't perform unneeded copy in sg_io code
The members from 'status' in struct sg_io_hdr to the last are used to
transfer information from kernel to user space.  The values that user
space sets are just ignored.

Signed-off-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-15 10:09:30 -05:00
Steve French
361ea1ae54 [CIFS] Fix build break
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-15 13:46:12 +00:00
Jeff Layton
61f98ffd74 cifs: display scopeid in /proc/mounts
Move address display into a new function and display the scopeid as part
of the address in /proc/mounts.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-15 13:42:30 +00:00
Christian Engelmayer
a2ab0ce09e jffs2: leaking jffs2_summary in function jffs2_scan_medium
In case of an error returned by file_dirty() 's' is not freed as the cleanup
path is skipped.

Reported by Coverity.

Signed-off-by: Christian Engelmayer <christian.engelmayer@frequentis.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-06-15 11:17:46 +01:00
Theodore Ts'o
4159175058 ext4: Don't update ctime for non-extent-mapped inodes
The VFS handles updating ctime, so we don't need to update the inode's
ctime in ext4_splace_branch() to update the direct or indirect blocks.
This was harmless when we did this in ext3, but in ext4, thanks to
delayed allocation, updating the ctime in ext4_splice_branch() can
cause the ctime to mysteriously jump when the blocks are finally
allocated.

Thanks to Björn Steinbrink for pointing out this problem on the git
mailing list.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-15 03:41:23 -04:00
Mike Frysinger
0a8eba9b7f ramfs: ignore unknown mount options
On systems where CONFIG_SHMEM is disabled, mounting tmpfs filesystems can
fail when tmpfs options are used.  This is because tmpfs creates a small
wrapper around ramfs which rejects unknown options, and ramfs itself only
supports a tiny subset of what tmpfs supports.  This makes it pretty hard
to use the same userspace systems across different configuration systems.
As such, ramfs should ignore the tmpfs options when tmpfs is merely a
wrapper around ramfs.

This used to work before commit c3b1b1cbf0 as previously, ramfs would
ignore all options.  But now, we get:
ramfs: bad mount option: size=10M
mount: mounting mdev on /dev failed: Invalid argument

Another option might be to restore the previous behavior, where ramfs
simply ignored all unknown mount options ... which is what Hugh prefers.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-14 17:58:25 -07:00
Aneesh Kumar K.V
62e086be5d ext4: Move __ext4_journalled_writepage() to avoid forward declaration
In addition, fix two unused variable warnings.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:59:34 -04:00
Aneesh Kumar K.V
43ce1d23b4 ext4: Fix mmap/truncate race when blocksize < pagesize && !nodellaoc
This patch fixes the mmap/truncate race that was fixed for delayed
allocation by merging ext4_{journalled,normal,da}_writepage() into
ext4_writepage().

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:58:45 -04:00
Aneesh Kumar K.V
c364b22c95 ext4: Fix mmap/truncate race when blocksize < pagesize && delayed allocation
It is possible to see buffer_heads which are not mapped in the
writepage callback in the following scneario (where the fs blocksize
is 1k and the page size is 4k):

1) truncate(f, 1024)
2) mmap(f, 0, 4096)
3) a[0] = 'a'
4) truncate(f, 4096)
5) writepage(...)

Now if we get a writepage callback immediately after (4) and before an
attempt to write at any other offset via mmap address (which implies we
are yet to get a pagefault and do a get_block) what we would have is the
page which is dirty have first block allocated and the other three
buffer_heads unmapped.

In the above case the writepage should go ahead and try to write the
first blocks and clear the page_dirty flag. Further attempts to write
to the page will again create a fault and result in allocating blocks
and marking page dirty.  If we don't write any other offset via mmap
address we would still have written the first block to the disk and
rest of the space will be considered as a hole.

So to address this, we change all of the places where we look for
delayed, unmapped, or unwritten buffer heads, and only check for
delayed or unwritten buffer heads instead.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:57:10 -04:00
Theodore Ts'o
de9a55b841 ext4: Fix up whitespace issues in fs/ext4/inode.c
This is a pure cleanup patch.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-14 17:45:34 -04:00
Theodore Ts'o
0610b6e999 ext4: Fix 64-bit block type problem on 32-bit platforms
The function ext4_mb_free_blocks() was using an "unsigned long" to
pass a block number; this will cause 64-bit block numbers to get
truncated on x86 and other 32-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2009-06-15 03:45:05 -04:00
Linus Torvalds
489f7ab6c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (31 commits)
  trivial: remove the trivial patch monkey's name from SubmittingPatches
  trivial: Fix a typo in comment of addrconf_dad_start()
  trivial: usb: fix missing space typo in doc
  trivial: pci hotplug: adding __init/__exit macros to sgi_hotplug
  trivial: Remove the hyphen from git commands
  trivial: fix ETIMEOUT -> ETIMEDOUT typos
  trivial: Kconfig: .ko is normally not included in module names
  trivial: SubmittingPatches: fix typo
  trivial: Documentation/dell_rbu.txt: fix typos
  trivial: Fix Pavel's address in MAINTAINERS
  trivial: ftrace:fix description of trace directory
  trivial: unnecessary (void*) cast removal in sound/oss/msnd.c
  trivial: input/misc: Fix typo in Kconfig
  trivial: fix grammo in bus_for_each_dev() kerneldoc
  trivial: rbtree.txt: fix rb_entry() parameters in sample code
  trivial: spelling fix in ppc code comments
  trivial: fix typo in bio_alloc kernel doc
  trivial: Documentation/rbtree.txt: cleanup kerneldoc of rbtree.txt
  trivial: Miscellaneous documentation typo fixes
  trivial: fix typo milisecond/millisecond for documentation and source comments.
  ...
2009-06-14 13:46:25 -07:00
Steve French
b70b92e41d Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-06-14 13:34:46 +00:00
Andreas Dilger
11013911da ext4: teach the inode allocator to use a goal inode number
Enhance the inode allocator to take a goal inode number as a
paremeter; if it is specified, it takes precedence over Orlov or
parent directory inode allocation algorithms.

The extents migration function uses the goal inode number so that the
extent trees allocated the migration function use the correct flex_bg.
In the future, the goal inode functionality will also be used to
allocate an adjacent inode for the extended attributes.

Also, for testing purposes the goal inode number can be specified via
/sys/fs/{dev}/inode_goal.  This can be useful for testing inode
allocation beyond 2^32 blocks on very large filesystems.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 11:45:35 -04:00
Theodore Ts'o
f157a4aa98 ext4: Use a hash of the topdir directory name for the Orlov parent group
Instead of using a random number to determine the goal parent grop for
the Orlov top directories, use a hash of the directory name.  This
allows for repeatable results when trying to benchmark filesystem
layout algorithms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 11:09:42 -04:00
Theodore Ts'o
4ab2f15b7f ext4: move the abort flag from s_mount_opts to s_mount_flags
We're running out of space in the mount options word, and
EXT4_MOUNT_ABORT isn't really a mount option, but a run-time flag.  So
move it to become EXT4_MF_FS_ABORTED in s_mount_flags.

Also remove bogus ext2_fs.h / ext4.h simultaneous #include protection,
which can never happen.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:36 -04:00
Theodore Ts'o
bc0b0d6d69 ext4: update the s_last_mounted field in the superblock
This field can be very helpful when a system administrator is trying
to sort through large numbers of block devices or filesystem images.
What is stored in this field can be ambiguous if multiple filesystem
namespaces are in play; what we store in practice is the mountpoint
interpreted by the process's namespace which first opens a file in the
filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:48 -04:00
Theodore Ts'o
7f4520cc62 ext4: change s_mount_opt to be an unsigned int
We can only fit 32 options in s_mount_opt because an unsigned long is
32-bits on a x86 machine.  So use an unsigned int to save space on
64-bit platforms.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-13 10:09:41 -04:00
Akira Fujita
748de6736c ext4: online defrag -- Add EXT4_IOC_MOVE_EXT ioctl
The EXT4_IOC_MOVE_EXT exchanges the blocks between orig_fd and donor_fd,
and then write the file data of orig_fd to donor_fd.
ext4_mext_move_extent() is the main fucntion of ext4 online defrag,
and this patch includes all functions related to ext4 online defrag.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: Takashi Sato <t-sato@yk.jp.nec.com>
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 19:24:03 -04:00
Jeff Layton
1e68b2b275 cifs: add new routine for converting AF_INET and AF_INET6 addrs
...to consolidate some logic used in more than one place.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-13 08:17:30 +00:00
Jeff Layton
340481a364 cifs: have cifs_show_options show forceuid/forcegid options
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-13 08:17:30 +00:00
Jeff Layton
8616e0fc1e cifs: remove unneeded NULL checks from cifs_show_options
show_options is always called with the namespace_sem held. Therefore we
don't need to worry about the vfsmount being NULL, or it vanishing while
the function is running. By the same token, there's no need to worry
about the superblock, tcon, smb or tcp sessions being NULL on entry.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-13 08:17:30 +00:00
Felix Blyakher
fd40261354 Merge branch 'master' of git://oss.sgi.com/xfs/xfs into for-linus 2009-06-12 21:28:59 -05:00
Christoph Hellwig
e83f1eb6bf xfs: fix small mismerge in xfs_vn_mknod
Identation got messed up when merging the current_umask changes with
the generic ACL support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-12 21:15:31 -05:00
Christoph Hellwig
493b87e5ed xfs: fix warnings with CONFIG_XFS_QUOTA disabled
Fix warnings about unitialized dquot variables by making sure
xfs_qm_vop_dqalloc touches it even when quotas are disabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-12 21:15:12 -05:00
Linus Torvalds
f3ad116588 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs:
  configfs: Rework configfs_depend_item() locking and make lockdep happy
  configfs: Silence lockdep on mkdir() and rmdir()
2009-06-12 18:21:19 -07:00
Linus Torvalds
c53567ad45 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: use more NOFS allocation
  dlm: connect to nodes earlier
  dlm: fix use count with multiple joins
  dlm: Make name input parameter of {,dlm_}new_lockspace() const
2009-06-12 13:17:12 -07:00
Linus Torvalds
c9b8af00ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (154 commits)
  [SCSI] osd: Remove out-of-tree left overs
  [SCSI] libosd: Use REQ_QUIET requests.
  [SCSI] osduld: use filp_open() when looking up an osd-device
  [SCSI] libosd: Define an osd_dev wrapper to retrieve the request_queue
  [SCSI] libosd: osd_req_{read,write} takes a length parameter
  [SCSI] libosd: Let _osd_req_finalize_data_integrity receive number of out_bytes
  [SCSI] libosd: osd_req_{read,write}_kern new API
  [SCSI] libosd: Better printout of OSD target system information
  [SCSI] libosd: OSD2r05: Attribute definitions
  [SCSI] libosd: OSD2r05: Additional command enums
  [SCSI] mpt fusion: fix up doc book comments
  [SCSI] mpt fusion: Added support for Broadcast primitives Event handling
  [SCSI] mpt fusion: Queue full event handling
  [SCSI] mpt fusion: RAID device handling and Dual port Raid support is added
  [SCSI] mpt fusion: Put IOC into ready state if it not already in ready state
  [SCSI] mpt fusion: Code Cleanup patch
  [SCSI] mpt fusion: Rescan SAS topology added
  [SCSI] mpt fusion: SAS topology scan changes, expander events
  [SCSI] mpt fusion: Firmware event implementation using seperate WorkQueue
  [SCSI] mpt fusion: rewrite of ioctl_cmds internal generated function
  ...
2009-06-12 09:50:42 -07:00
Linus Torvalds
6cb8a91174 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
  GFS2: Remove lock_kernel from gfs2_put_super()
  GFS2: Add tracepoints
2009-06-12 09:43:44 -07:00
Linus Torvalds
7f3591cfac Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest: (31 commits)
  lguest: add support for indirect ring entries
  lguest: suppress notifications in example Launcher
  lguest: try to batch interrupts on network receive
  lguest: avoid sending interrupts to Guest when no activity occurs.
  lguest: implement deferred interrupts in example Launcher
  lguest: remove obsolete LHREQ_BREAK call
  lguest: have example Launcher service all devices in separate threads
  lguest: use eventfds for device notification
  eventfd: export eventfd_signal and eventfd_fget for lguest
  lguest: allow any process to send interrupts
  lguest: PAE fixes
  lguest: PAE support
  lguest: Add support for kvm_hypercall4()
  lguest: replace hypercall name LHCALL_SET_PMD with LHCALL_SET_PGD
  lguest: use native_set_* macros, which properly handle 64-bit entries when PAE is activated
  lguest: map switcher with executable page table entries
  lguest: fix writev returning short on console output
  lguest: clean up length-used value in example launcher
  lguest: Segment selectors are 16-bit long. Fix lg_cpu.ss1 definition.
  lguest: beyond ARRAY_SIZE of cpu->arch.gdt
  ...
2009-06-12 09:32:26 -07:00
Linus Torvalds
c34752bc8b Merge branch 'cuse' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'cuse' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  CUSE: implement CUSE - Character device in Userspace
  fuse: export symbols to be used by CUSE
  fuse: update fuse_conn_init() and separate out fuse_conn_kill()
  fuse: don't use inode in fuse_file_poll
  fuse: don't use inode in fuse_do_ioctl() helper
  fuse: don't use inode in fuse_sync_release()
  fuse: create fuse_do_open() helper for CUSE
  fuse: clean up args in fuse_finish_open() and fuse_release_fill()
  fuse: don't use inode in helpers called by fuse_direct_io()
  fuse: add members to struct fuse_file
  fuse: prepare fuse_direct_io() for CUSE
  fuse: clean up fuse_write_fill()
  fuse: use struct path in release structure
  fuse: misc cleanups
2009-06-12 09:31:20 -07:00
Linus Torvalds
d614aec475 Merge branch 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6
* 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (29 commits)
  ide: re-implement ide_pci_init_one() on top of ide_pci_init_two()
  ide: unexport ide_find_dma_mode()
  ide: fix PowerMac bootup oops
  ide: skip probe if there are no devices on the port (v2)
  sl82c105: add printk() logging facility
  ide-tape: fix proc warning
  ide: add IDE_DFLAG_NIEN_QUIRK device flag
  ide: respect quirk_drives[] list on all controllers
  hpt366: enable all quirks for devices on quirk_drives[] list
  hpt366: sync quirk_drives[] list with pdc202xx_{new,old}.c
  ide: remove superfluous SELECT_MASK() call from do_rw_taskfile()
  ide: remove superfluous SELECT_MASK() call from ide_driveid_update()
  icside: remove superfluous ->maskproc method
  ide-tape: fix IDE_AFLAG_* atomic accesses
  ide-tape: change IDE_AFLAG_IGNORE_DSC non-atomically
  pdc202xx_old: kill resetproc() method
  pdc202xx_old: don't call pdc202xx_reset() on IRQ timeout
  pdc202xx_old: use ide_dma_test_irq()
  ide: preserve Host Protected Area by default (v2)
  ide-gd: implement block device ->set_capacity method (v2)
  ...
2009-06-12 09:29:42 -07:00
Nikanth Karthikesan
76d93ff344 trivial: fix typo in bio_alloc kernel doc
Fix typo in bio_alloc kernel doc.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:47 +02:00
Thadeu Lima de Souza Cascardo
ab2274af05 trivial: fix typo compatiable/compatiability has extra 'a'.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:46 +02:00
Wolfram Sang
2eadfc0ed6 trivial: fs/inode: Fix typo in file_update_time nanodoc
The advertised flag for not updating the time was wrong.

Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:45 +02:00
Nikanth Karthikesan
ff677f8d10 trivial: fix comment typo in fs/compat.c
Fix a typo in fs/compat.c

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:44 +02:00
Ali Gholami Rudi
88164ff4fc trivial: ext2: fix a typo in comment in ext2.h
Signed-off-by: Ali Gholami Rudi <ali@rudi.ir>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-06-12 18:01:44 +02:00
Felix Blyakher
7747a0b0af xfs: fix freeing memory in xfs_getbmap()
Regression from commit 28e211700a.
Need to free temporary buffer allocated in xfs_getbmap().

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reported-by: Justin Piszcz <jpiszcz@lucidpixels.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-06-12 10:26:52 -05:00
James Bottomley
82681a318f [SCSI] Merge branch 'linus'
Conflicts:
	drivers/message/fusion/mptsas.c

fixed up conflict between req->data_len accessors and mptsas driver updates.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-12 10:02:03 -05:00
Rusty Russell
5718607bb6 eventfd: export eventfd_signal and eventfd_fget for lguest
lguest wants to attach eventfds to guest notifications, and lguest is
usually a module.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: Davide Libenzi <davidel@xmailserver.org>
2009-06-12 22:27:09 +09:30
Steven Whitehouse
3ea400581f GFS2: Remove lock_kernel from gfs2_put_super()
It is not required here.

Signed-off-by: Steven Whitehouse <swhiteho@redhat,com>
Cc: Christoph Hellwig <hch@infradead.org>
2009-06-12 13:40:47 +01:00
Steven Whitehouse
63997775b7 GFS2: Add tracepoints
This patch adds the ability to trace various aspects of the GFS2
filesystem. The trace points are divided into three groups,
glocks, logging and bmap. These points have been chosen because
they allow inspection of the major internal functions of GFS2
and they are also generic enough that they are unlikely to need
any major changes as the filesystem evolves.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-12 08:49:20 +01:00
Ryusuke Konishi
aa7dfb8954 nilfs2: get rid of bd_mount_sem use from nilfs
This will remove every bd_mount_sem use in nilfs.

The intended exclusion control was replaced by the previous patch
("nilfs2: correct exclusion control in nilfs_remount function") for
nilfs_remount(), and this patch will replace remains with a new mutex
that this inserts in nilfs object.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:18 -04:00
Ryusuke Konishi
e59399d010 nilfs2: correct exclusion control in nilfs_remount function
nilfs_remount() changes mount state of a superblock instance.  Even
though nilfs accesses other superblock instances during mount or
remount, the mount state was not properly protected in
nilfs_remount().

Moreover, nilfs_remount() has a lock order reversal problem;
nilfs_get_sb() holds:

  1. bdev->bd_mount_sem
  2. sb->s_umount  (sget acquires)

and nilfs_remount() holds:

  1. sb->s_umount  (locked by the caller in vfs)
  2. bdev->bd_mount_sem

To avoid these problems, this patch divides a semaphore protecting
super block instances from nilfs->ns_sem, and applies it to the mount
state protection in nilfs_remount().

With this change, bd_mount_sem use is removed from nilfs_remount() and
the lock order reversal will be resolved.  And the new rw-semaphore,
nilfs->ns_super_sem will properly protect the mount state except the
modification from nilfs_error function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:18 -04:00
Ryusuke Konishi
6dd4740662 nilfs2: simplify remaining sget() use
This simplifies the test function passed on the remaining sget()
callsite in nilfs.

Instead of checking mount type (i.e. ro-mount/rw-mount/snapshot mount)
in the test function passed to sget(), this patch first looks up the
nilfs_sb_info struct which the given mount type matches, and then
acquires the super block instance holding the nilfs_sb_info.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:18 -04:00
Ryusuke Konishi
3f82ff5516 nilfs2: get rid of sget use for checking if current mount is present
This stops using sget() for checking if an r/w-mount or an r/o-mount
exists on the device.  This elimination uses a back pointer to the
current mount added to nilfs object.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:17 -04:00
Ryusuke Konishi
33c8e57c86 nilfs2: get rid of sget use for acquiring nilfs object
This will change the way to obtain nilfs object in nilfs_get_sb()
function.

Previously, a preliminary sget() call was performed, and the nilfs
object was acquired from a super block instance found by the sget()
call.

This patch, instead, instroduces a new dedicated function
find_or_create_nilfs(); as the name implies, the function finds an
existent nilfs object from a global list or creates a new one if no
object is found on the device.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:17 -04:00
Ryusuke Konishi
81fc20bd0e nilfs2: remove meaningless EBUSY case from nilfs_get_sb function
The following EBUSY case in nilfs_get_sb() is meaningless.  Indeed,
this error code is never returned to the caller.

    if (!s->s_root) {
          ...
    } else if (!(s->s_flags & MS_RDONLY)) {
        err = -EBUSY;
    }

This simply removes the else case.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:17 -04:00
Christoph Hellwig
0c95ee190e remove the call to ->write_super in __sync_filesystem
Now that all filesystems provide ->sync_fs methods we can change
__sync_filesystem to only call ->sync_fs.

This gives us a clear separation between periodic writeouts which
are driven by ->write_super and data integrity syncs that go
through ->sync_fs. (modulo file_fsync which is also going away)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:17 -04:00
Christoph Hellwig
d731e06323 nilfs2: call nilfs2_write_super from nilfs2_sync_fs
The call to ->write_super from __sync_filesystem will go away, so make
sure nilfs2 performs the same actions from inside ->sync_fs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:17 -04:00
Christoph Hellwig
d579ed00aa jffs2: call jffs2_write_super from jffs2_sync_fs
The call to ->write_super from __sync_filesystem will go away, so make
sure jffs2 performs the same actions from inside ->sync_fs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:16 -04:00
Christoph Hellwig
8c8006564a ufs: add ->sync_fs
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:16 -04:00
Christoph Hellwig
ad43ffdeea sysv: add ->sync_fs
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:16 -04:00
Christoph Hellwig
7fbc6df0e7 hfsplus: add ->sync_fs
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:16 -04:00
Christoph Hellwig
58bc5bbb87 hfs: add ->sync_fs
Add a ->sync_fs method for data integrity syncs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:15 -04:00
Christoph Hellwig
f83d6d46e7 fat: add ->sync_fs
Add a ->sync_fs method for data integrity syncs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:15 -04:00
Christoph Hellwig
40f31dd47e ext2: add ->sync_fs
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:15 -04:00
Christoph Hellwig
80e09fb942 exofs: add ->sync_fs
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:15 -04:00
Christoph Hellwig
561e47ce72 bfs: add ->sync_fs
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:14 -04:00
Christoph Hellwig
e28964365f affs: add ->sync_fs
Add a ->sync_fs method for data integrity syncs.  Factor out common code
between affs_put_super, affs_write_super and the new affs_sync_fs into
a helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:14 -04:00
Al Viro
c475879556 sanitize ->fsync() for affs
unfortunately, for affs (especially for affs directories) we have
no real way to keep track of metadata ownership.  So we have to
do more or less what file_fsync() does, but we do *not* need to
call write_super() there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:14 -04:00
Al Viro
4427f0c36e repair bfs_write_inode(), switch bfs to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:14 -04:00
Al Viro
224c886643 Fix adfs GET_FRAG_ID() on big-endian
Missing conversion to host-endian before doing shifts

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:14 -04:00
Al Viro
ffdc9064f8 repair adfs ->write_inode(), switch to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
bea6b64c27 switch omfs to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
90de066443 switch udf to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
a932801543 switch ufs to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:13 -04:00
Al Viro
05459ca81a repair sysv_write_inode(), switch sysv to simple_fsync()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
0d7916d7e9 switch minix to simple_fsync()
* get minix_write_inode() to honour the second argument
* now we can use simple_fsync() for minixfs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
e1740a462e switch ext2 to simple_fsync()
kill ext2_sync_file() (along with ext2/fsync.c), get rid of
ext2_update_inode() - it's an alias of ext2_write_inode().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
b522412aea Sanitize ->fsync() for FAT
* mark directory data blocks as assoc. metadata
* add new inode to deal with FAT, mark FAT blocks as assoc. metadata of that
* now ->fsync() is trivial both for files and directories

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
964f536966 fs/qnx4: sanitize includes
fs-internal parts of qnx4_fs.h taken to fs/qnx4/qnx4.h, includes adjusted,
qnx4_fs.h doesn't need unifdef anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:12 -04:00
Al Viro
79d2576758 Sanitize qnx4 fsync handling
* have directory operations use mark_buffer_dirty_inode(),
  so that sync_mapping_buffers() would get those.
* make qnx4_write_inode() honour its last argument.
* get rid of insane copies of very ancient "walk the indirect blocks"
  in qnx4/fsync - they never matched the actual fs layout and, fortunately,
  never'd been called.  Again, all this junk is not needed; ->fsync()
  should just do sync_mapping_buffers + sync_inode (and if we implement
  block allocation for qnx4, we'll need to use mark_buffer_dirty_inode()
  for extent blocks)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Al Viro
d5aacad548 New helper - simple_fsync()
writes associated buffers, then does sync_inode() to write
the inode itself (and to make it clean).  Depends on
->write_inode() honouring the second argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Alessio Igor Bogani
337eb00a2c Push BKL down into ->remount_fs()
[xfs, btrfs, capifs, shmem don't need BKL, exempt]

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:11 -04:00
Nick Piggin
4195f73d13 fs: block_dump missing dentry locking
I think the block_dump output in __mark_inode_dirty is missing dentry locking.
Surely the i_dentry list can change any time, so we may not even *get* a
dentry there. If we do get one by chance, then it would appear to be able to
go away or get renamed at any time...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Nick Piggin
545b9fd3d7 fs: remove incorrect I_NEW warnings
Some filesystems can call in to sync an inode that is still in the
I_NEW state (eg. ext family, when mounted with -osync). This is OK
because the filesystem has sole access to the new inode, so it can
modify i_state without races (because no other thread should be
modifying it, by definition of I_NEW). Ie. a false positive, so
remove the warnings.

The races are described here 7ef0d7377c,
which is also where the warnings were introduced.

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Christoph Hellwig
f95022161d xfs: remove ->write_super and stop maintaining ->s_dirt
the write_super method is used for

 (1) writing back the superblock periodically from pdflush
 (2) called just before ->sync_fs for data integerity syncs

We don't need (1) because we have our own peridoc writeout through xfssyncd,
and we don't need (2) because xfs_fs_sync_fs performs a proper synchronous
superblock writeout after all other data and metadata has been written out.

Also remove ->s_dirt tracking as it's only used to decide when too call
->write_super.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Jens Axboe
13205fb926 ntfs: remove old debug check for dirty data in ntfs_put_super()
This should not trigger anymore, so kill it.

Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:10 -04:00
Theodore Ts'o
9fd5746fd3 fs: Remove i_cindex from struct inode
The only user of the i_cindex element in the inode structure is used
is by the firewire drivers.  As part of an attempt to slim down the
inode structure to save memory --- since a typical Linux system will
have hundreds of thousands if not millions of inodes cached, a
reduction in the size inode has high leverage.

The firewire driver does not need i_cindex in any fast path, so it's
simple enough to calculate when it is needed, instead of wasting space
in the inode structure.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: krh@redhat.com
Cc: stefanr@s5r6.in-berlin.de
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Christoph Hellwig
ebc1ac1645 ->write_super lock_super pushdown
Push down lock_super into ->write_super instances and remove it from the
caller.

Following filesystem don't need ->s_lock in ->write_super and are skipped:

 * bfs, nilfs2 - no other uses of s_lock and have internal locks in
	->write_super
 * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
 * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
 	->write_super
 * xfs - no other uses of s_lock and uses internal lock (buffer lock on
	superblock buffer) to serialize ->write_super.  Also xfs_fs_write_super
	is superflous and will go away in the next merge window

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Christoph Hellwig
01ba687577 jffs2: move jffs2_write_super to super.c
jffs2_write_super is only called from super.c and doesn't use any
functionality from fs.c.  So move it over to super.c and make it
static there.

[should go in through the vfs tree as it is a requirement for the
 next patch]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:09 -04:00
Al Viro
4aa98cf768 Push BKL down into do_remount_sb()
[folded fix from Jiri Slaby]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Al Viro
7f78d4cd4c Push BKL down beyond VFS-only parts of do_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Al Viro
6fac98dd21 Push BKL into do_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Al Viro
bbd6851a32 Push lock_super() into the ->remount_fs() of filesystems that care about it
Note that since we can't run into contention between remount_fs and write_super
(due to exclusion on s_umount), we have to care only about filesystems that
touch lock_super() on their own.  Out of those ext3, ext4, hpfs, sysv and ufs
do need it; fat doesn't since its ->remount_fs() only accesses assign-once
data (basically, it's "we have no atime on directories and only have atime on
files for vfat; force nodiratime and possibly noatime into *flags").

[folded a build fix from hch]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:08 -04:00
Christoph Hellwig
6cfd014842 push BKL down into ->put_super
Move BKL into ->put_super from the only caller.  A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment.  Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.

[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
a9e220f832 No need to do lock_super() for exclusion in generic_shutdown_super()
We can't run into contention on it.  All other callers of lock_super()
either hold s_umount (and we have it exclusive) or hold an active
reference to superblock in question, which prevents the call of
generic_shutdown_super() while the reference is held.  So we can
replace lock_super(s) with get_fs_excl() in generic_shutdown_super()
(and corresponding change for unlock_super(), of course).

Since ext4 expects s_lock held for its put_super, take lock_super()
into it.  The rest of filesystems do not care at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
62c6943b4b Trim a bit of crap from fs.h
do_remount_sb() is fs/internal.h fodder, fsync_no_super() is long gone.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Al Viro
443b94baaa Make sure that all callers of remount hold s_umount exclusive
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:07 -04:00
Christoph Hellwig
5af7926ff3 enforce ->sync_fs is only called for rw superblock
Make sure a superblock really is writeable by checking MS_RDONLY
under s_umount.  sync_filesystems needed some re-arragement for
that, but all but one sync_filesystem caller had the correct locking
already so that we could add that check there.  cachefiles grew
s_umount locking.

I've also added a WARN_ON to sync_filesystem to assert this for
future callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig
e500475338 cleanup sync_supers
Merge the write_super helper into sync_super and move the check for
->write_super earlier so that we can avoid grabbing a reference to
a superblock that doesn't have it.

While we're at it also add a little comment documenting sync_supers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Alexey Dobriyan
f3da392e9f dcache: extrace and use d_unlinked()
d_unlinked() will be used in middle-term to ban checkpointing when opened
but unlinked file is detected, and in long term, to detect such situation
and special case on it.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig
8c85e12512 remove ->write_super call in generic_shutdown_super
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.

Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.

Exceptions:

 - affs already has identical copy & pasted code at the beginning of
   affs_put_super so no need to do it twice.
 - xfs does the right thing without it and I have changes pending for
   the xfs tree touching this are so I don't really need conflicts
   here..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:06 -04:00
Christoph Hellwig
517bfae283 qnx4: remove ->write_super
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
94cb993f2e ocfs2: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
b7d245de25 gfs2: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
ca41f7b918 ext3: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Christoph Hellwig
59d697b702 btrfs: remove ->write_super and stop maintaining ->s_dirt
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:05 -04:00
Jan Kara
c3f8a40c1c quota: Introduce writeout_quota_sb() (version 4)
Introduce this function which just writes all the quota structures but
avoids all the syncing and cache pruning work to expose quota structures
to userspace. Use this function from __sync_filesystem when wait == 0.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Christoph Hellwig
850b201b08 quota: cleanup dquota sync functions (version 4)
Currently the VFS calls vfs_dq_sync to sync out disk quotas for a given
superblock.  This is a small wrapper around sync_dquots which for the
case of a non-NULL superblock is a small wrapper around quota_sync_sb.

Just make quota_sync_sb global (rename it to sync_quota_sb) and call it
directly.  Also call it directly for those cases in quota.c that have a
superblock and leave sync_dquots purely an iterator over sync_quota_sb and
remove it's superblock argument.

To make this nicer move the check for the lack of a quota_sync method
from the callers into sync_quota_sb.

[folded build fix from Alexander Beregalov <a.beregalov@gmail.com>]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Jan Kara
60b0680fa2 vfs: Rename fsync_super() to sync_filesystem() (version 4)
Rename the function so that it better describe what it really does. Also
remove the unnecessary include of buffer_head.h.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Jan Kara
c15c54f5f0 vfs: Move syncing code from super.c to sync.c (version 4)
Move sync_filesystems(), __fsync_super(), fsync_super() from
super.c to sync.c where it fits better.

[build fixes folded]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:04 -04:00
Jan Kara
5cee5815d1 vfs: Make sys_sync() use fsync_super() (version 4)
It is unnecessarily fragile to have two places (fsync_super() and do_sync())
doing data integrity sync of the filesystem. Alter __fsync_super() to
accommodate needs of both callers and use it. So after this patch
__fsync_super() is the only place where we gather all the calls needed to
properly send all data on a filesystem to disk.

Nice bonus is that we get a complete livelock avoidance and write_supers()
is now only used for periodic writeback of superblocks.

sync_blockdevs() introduced a couple of patches ago is gone now.

[build fixes folded]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Jan Kara
429479f031 vfs: Make __fsync_super() a static function (version 4)
__fsync_super() does the same thing as fsync_super(). So change the only
caller to use fsync_super() and make __fsync_super() static. This removes
unnecessarily duplicated call to sync_blockdev() and prepares ground
for the changes to __fsync_super() in the following patches.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Jan Kara
bfe881255c vfs: Call ->sync_fs() even if s_dirt is 0 (version 4)
sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then
->sync_fs() isn't called. This does not really make much sence since s_dirt is
generally used by a filesystem to mean that ->write_super() needs to be called.
But ->sync_fs() does different things. I even suspect that some filesystems
(btrfs?) sets s_dirt just to fool this logic.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Jan Kara
5a3e5cb8e0 vfs: Fix sys_sync() and fsync_super() reliability (version 4)
So far, do_sync() called:
  sync_inodes(0);
  sync_supers();
  sync_filesystems(0);
  sync_filesystems(1);
  sync_inodes(1);

This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
and others are hit by this) when racing e.g. with background writeback. A
similar problem hits also other filesystems (e.g. ext2) because of
write_supers() being called before the sync_inodes(1).

Change the ordering of calls in do_sync() - this requires a new function
sync_blockdevs() to preserve the property that block devices are always synced
after write_super() / sync_fs() call.

The same issue is fixed in __fsync_super() function used on umount /
remount read-only.

[AV: build fixes]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:03 -04:00
Christoph Hellwig
876a9f76ab remove s_async_list
Remove the unused s_async_list in the superblock, a leftover of the
broken async inode deletion code that leaked into mainline.  Having this
in the middle of the sync/unmount path is not helpful for the following
cleanups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
864d7c4c06 fs: move mark_files_ro into file_table.c
This function walks the s_files lock, and operates primarily on the
files in a superblock, so it better belongs here (eg. see also
fs_may_remount_ro).

[AV: ... and it shouldn't be static after that move]

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
96029c4e09 fs: introduce mnt_clone_write
This patch speeds up lmbench lat_mmap test by about another 2% after the
first patch.

Before:
 avg = 462.286
 std = 5.46106

After:
 avg = 453.12
 std = 9.58257

(50 runs of each, stddev gives a reasonable confidence)

It does this by introducing mnt_clone_write, which avoids some heavyweight
operations of mnt_want_write if called on a vfsmount which we know already
has a write count; and mnt_want_write_file, which can call mnt_clone_write
if the file is open for write.

After these two patches, mnt_want_write and mnt_drop_write go from 7% on
the profile down to 1.3% (including mnt_clone_write).

[AV: mnt_want_write_file() should take file alone and derive mnt from it;
not only all callers have that form, but that's the only mnt about which
we know that it's already held for write if file is opened for write]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
npiggin@suse.de
d3ef3d7351 fs: mnt_want_write speedup
This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
A microbenchmark yes, but it exercises some important paths in the mm.

Before:
 avg = 501.9
 std = 14.7773

After:
 avg = 462.286
 std = 5.46106

(50 runs of each, stddev gives a reasonable confidence, but there is quite
a bit of variation there still)

It does this by removing the complex per-cpu locking and counter-cache and
replaces it with a percpu counter in struct vfsmount. This makes the code
much simpler, and avoids spinlocks (although the msync is still pretty
costly, unfortunately). It results in about 900 bytes smaller code too. It
does increase the size of a vfsmount, however.

It should also give a speedup on large systems if CPUs are frequently operating
on different mounts (because the existing scheme has to operate on an atomic in
the struct vfsmount when switching between mounts). But I'm most interested in
the single threaded path performance for the moment.

[AV: minor cleanup]

Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:02 -04:00
Al Viro
3174c21b74 Move junk from proc_fs.h to fs/proc/internal.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
1c755af4df switch lookup_mnt()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
79ed022619 switch follow_mount()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
9393bd07cf switch follow_down()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
589ff870ed Switch collect_mounts() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:01 -04:00
Al Viro
bab77ebf51 switch follow_up() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
e64c390ca0 switch rqst_exp_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
91c9fa8f75 switch rqst_exp_get_by_name()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
5bf3bd2b5c switch exp_parent() to struct path
... and lose the always-NULL last argument (non-NULL case had been
split off a while ago).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:36:00 -04:00
Al Viro
55430e2ece nfsd struct path use: exp_get_by_name()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
dd5cae6e97 Don't bother with check_mnt() in do_add_mount() on shrinkable ones
These guys are what we add as submounts; checks for "is that attached in
our namespace" are simply irrelevant for those and counterproductive for
use of private vfsmount trees a-la what NFS folks want.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
5b85711953 Make vfs_path_lookup() use starting point as root
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
2a73787110 Cache root in nameidata
New field: nd->root.  When pathname resolution wants to know the root,
check if nd->root.mnt is non-NULL; use nd->root if it is, otherwise
copy current->fs->root there.  After path_walk() is finished, we check
if we'd got a cached value in nd->root and drop it.  Before calling
path_walk() we should either set nd->root.mnt to NULL *or* copy (and
pin down) some path to nd->root.  In the latter case we won't be
looking at current->fs->root at all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:59 -04:00
Al Viro
9b4a9b14a7 Preparations to caching root in path_walk()
Split do_path_lookup(), opencode the call from do_filp_open()
do_filp_open() is the only caller of do_path_lookup() that
cares about root afterwards (it keeps resolving symlinks on
O_CREAT path after it'd done LOOKUP_PARENT walk).  So when
we start caching fs->root in path_walk(), it'll need a different
treatment.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:58 -04:00
Al Viro
4e44b6852e Get rid of path_lookup in autofs4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:58 -04:00
Jeff Mahoney
73422811d2 reiserfs: allow exposing privroot w/ xattrs enabled
This patch adds an -oexpose_privroot option to allow access to the privroot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-11 21:35:58 -04:00
Felix Blyakher
35fd035968 Merge branch 'master' of git://git.kernel.org/pub/scm/fs/xfs/xfs 2009-06-11 16:56:49 -05:00
Linus Torvalds
a525890cb6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (23 commits)
  Btrfs: fix extent_buffer leak during tree log replay
  Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
  Btrfs: fix -o nodatasum printk spelling
  Btrfs: check duplicate backrefs for both data and metadata
  Btrfs: init worker struct fields before kthread-run
  Btrfs: pin buffers during write_dev_supers
  Btrfs: avoid races between super writeout and device list updates
  Fix btrfs when ACLs are configured out
  Btrfs: fdatasync should skip metadata writeout
  Btrfs: remove crc32c.h and use libcrc32c directly.
  Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
  Btrfs: autodetect SSD devices
  Btrfs: add mount -o ssd_spread to spread allocations out
  Btrfs: avoid allocation clusters that are too spread out
  Btrfs: Add mount -o nossd
  Btrfs: avoid IO stalls behind congested devices in a multi-device FS
  Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
  Btrfs: fix metadata dirty throttling limits
  Btrfs: reduce mount -o ssd CPU usage
  Btrfs: balance btree more often
  ...
2009-06-11 14:23:12 -07:00
Linus Torvalds
3bb66d7f8c Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  fsnotify: allow groups to set freeing_mark to null
  inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
  dnotify: do not bother to lock entry->lock when reading mask
  dnotify: do not use ?true:false when assigning to a bool
  fsnotify: move events should indicate the event was on a child
  inotify: reimplement inotify using fsnotify
  fsnotify: handle filesystem unmounts with fsnotify marks
  fsnotify: fsnotify marks on inodes pin them in core
  fsnotify: allow groups to add private data to events
  fsnotify: add correlations between events
  fsnotify: include pathnames with entries when possible
  fsnotify: generic notification queue and waitq
  dnotify: reimplement dnotify using fsnotify
  fsnotify: parent event notification
  fsnotify: add marks to inodes so groups can interpret how to handle those inodes
  fsnotify: unified filesystem notification backend
2009-06-11 14:22:55 -07:00
Linus Torvalds
512626a04e Merge branch 'for-linus' of git://linux-arm.org/linux-2.6
* 'for-linus' of git://linux-arm.org/linux-2.6:
  kmemleak: Add the corresponding MAINTAINERS entry
  kmemleak: Simple testing module for kmemleak
  kmemleak: Enable the building of the memory leak detector
  kmemleak: Remove some of the kmemleak false positives
  kmemleak: Add modules support
  kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
  kmemleak: Add the vmalloc memory allocation/freeing hooks
  kmemleak: Add the slub memory allocation/freeing hooks
  kmemleak: Add the slob memory allocation/freeing hooks
  kmemleak: Add the slab memory allocation/freeing hooks
  kmemleak: Add documentation on the memory leak detector
  kmemleak: Add the base support

Manual conflict resolution (with the slab/earlyboot changes) in:
	drivers/char/vt.c
	init/main.c
	mm/slab.c
2009-06-11 14:15:57 -07:00
Linus Torvalds
8a1ca8cedd Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
  perf_counter: Turn off by default
  perf_counter: Add counter->id to the throttle event
  perf_counter: Better align code
  perf_counter: Rename L2 to LL cache
  perf_counter: Standardize event names
  perf_counter: Rename enums
  perf_counter tools: Clean up u64 usage
  perf_counter: Rename perf_counter_limit sysctl
  perf_counter: More paranoia settings
  perf_counter: powerpc: Implement generalized cache events for POWER processors
  perf_counters: powerpc: Add support for POWER7 processors
  perf_counter: Accurate period data
  perf_counter: Introduce struct for sample data
  perf_counter tools: Normalize data using per sample period data
  perf_counter: Annotate exit ctx recursion
  perf_counter tools: Propagate signals properly
  perf_counter tools: Small frequency related fixes
  perf_counter: More aggressive frequency adjustment
  perf_counter/x86: Fix the model number of Intel Core2 processors
  perf_counter, x86: Correct some event and umask values for Intel processors
  ...
2009-06-11 14:01:07 -07:00
Eric Paris
a092ee20fd fsnotify: allow groups to set freeing_mark to null
Most fsnotify listeners (all but inotify) do not care about marks being
freed.  Allow groups to set freeing_mark to null and do not call any
function if it is set that way.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:55 -04:00
Eric Paris
e42e27736d inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
inotify and dnotify will both indicate that they want any event which came
from a child inode.  The fix is to mask off FS_EVENT_ON_CHILD when deciding
if inotify or dnotify is interested in a given event.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:54 -04:00
Eric Paris
ce61856bd2 dnotify: do not bother to lock entry->lock when reading mask
entry->lock is needed to make sure entry->mask does not change while
manipulating it.  In dnotify_should_send_event() we don't care if we get an
old or a new mask value out of this entry so there is no point it taking
the lock.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:54 -04:00
Eric Paris
5ac697b793 dnotify: do not use ?true:false when assigning to a bool
dnotify_should send event assigned a bool using ?true:false when computing
a bit operation.  This is poitless and the bool type does this for us.

Signed-off-by: Eric Paris <eparis@redhat.com>
2009-06-11 14:57:54 -04:00
Eric Paris
63c882a054 inotify: reimplement inotify using fsnotify
Reimplement inotify_user using fsnotify.  This should be feature for feature
exactly the same as the original inotify_user.  This does not make any changes
to the in kernel inotify feature used by audit.  Those patches (and the eventual
removal of in kernel inotify) will come after the new inotify_user proves to be
working correctly.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
164bc61951 fsnotify: handle filesystem unmounts with fsnotify marks
When an fs is unmounted with an fsnotify mark entry attached to one of its
inodes we need to destroy that mark entry and we also (like inotify) send
an unmount event.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
1ef5f13c6c fsnotify: fsnotify marks on inodes pin them in core
This patch pins any inodes with an fsnotify mark in core.  The idea is that
as soon as the mark is removed from the inode->fsnotify_mark_entries list
the inode will be iput.  In reality is doesn't quite work exactly this way.
The igrab will happen when the mark is added to an inode, but the iput will
happen when the inode pointer is NULL'd inside the mark.

It's possible that 2 racing things will try to remove the mark from
different directions.  One may try to remove the mark because of an
explicit request and one might try to remove it because the inode was
deleted.  It's possible that the removal because of inode deletion will
remove the mark from the inode's list, but the removal by explicit request
will actually set entry->inode == NULL; and call the iput.  This is safe.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
e4aff11736 fsnotify: allow groups to add private data to events
inotify needs per group information attached to events.  This patch allows
groups to attach private information and implements a callback so that
information can be freed when an event is being destroyed.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
47882c6f51 fsnotify: add correlations between events
As part of the standard inotify events it includes a correlation cookie
between two dentry move operations.  This patch includes the same behaviour
in fsnotify events.  It is needed so that inotify userspace can be
implemented on top of fsnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:54 -04:00
Eric Paris
62ffe5dfba fsnotify: include pathnames with entries when possible
When inotify wants to send events to a directory about a child it includes
the name of the original file.  This patch collects that filename and makes
it available for notification.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
a2d8bc6cb4 fsnotify: generic notification queue and waitq
inotify needs to do asyc notification in which event information is stored
on a queue until the listener is ready to receive it.  This patch
implements a generic notification queue for inotify (and later fanotify) to
store events to be sent at a later time.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
3c5119c05d dnotify: reimplement dnotify using fsnotify
Reimplement dnotify using fsnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
c28f7e56e9 fsnotify: parent event notification
inotify and dnotify both use a similar parent notification mechanism.  We
add a generic parent notification mechanism to fsnotify for both of these
to use.  This new machanism also adds the dentry flag optimization which
exists for inotify to dnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
3be25f49b9 fsnotify: add marks to inodes so groups can interpret how to handle those inodes
This patch creates a way for fsnotify groups to attach marks to inodes.
These marks have little meaning to the generic fsnotify infrastructure
and thus their meaning should be interpreted by the group that attached
them to the inode's list.

dnotify and inotify  will make use of these markings to indicate which
inodes are of interest to their respective groups.  But this implementation
has the useful property that in the future other listeners could actually
use the marks for the exact opposite reason, aka to indicate which inodes
it had NO interest in.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:53 -04:00
Eric Paris
90586523eb fsnotify: unified filesystem notification backend
fsnotify is a backend for filesystem notification.  fsnotify does
not provide any userspace interface but does provide the basis
needed for other notification schemes such as dnotify.  fsnotify
can be extended to be the backend for inotify or the upcoming
fanotify.  fsnotify provides a mechanism for "groups" to register for
some set of filesystem events and to then deliver those events to
those groups for processing.

fsnotify has a number of benefits, the first being actually shrinking the size
of an inode.  Before fsnotify to support both dnotify and inotify an inode had

        unsigned long           i_dnotify_mask; /* Directory notify events */
        struct dnotify_struct   *i_dnotify; /* for directory notifications */
        struct list_head        inotify_watches; /* watches on this inode */
        struct mutex            inotify_mutex;  /* protects the watches list

But with fsnotify this same functionallity (and more) is done with just

        __u32                   i_fsnotify_mask; /* all events for this inode */
        struct hlist_head       i_fsnotify_mark_entries; /* marks on this inode */

That's right, inotify, dnotify, and fanotify all in 64 bits.  We used that
much space just in inotify_watches alone, before this patch set.

fsnotify object lifetime and locking is MUCH better than what we have today.
inotify locking is incredibly complex.  See 8f7b0ba1c8 as an example of
what's been busted since inception.  inotify needs to know internal semantics
of superblock destruction and unmounting to function.  The inode pinning and
vfs contortions are horrible.

no fsnotify implementers do allocation under locks.  This means things like
f04b30de3 which (due to an overabundance of caution) changes GFP_KERNEL to
GFP_NOFS can be reverted.  There are no longer any allocation rules when using
or implementing your own fsnotify listener.

fsnotify paves the way for fanotify.  In brief fanotify is a notification
mechanism that delivers the lisener both an 'event' and an open file descriptor
to the object in question.  This means that fanotify is pathname agnostic.
Some on lkml may not care for the original companies or users that pushed for
TALPA, but fanotify was designed with flexibility and input for other users in
mind.  The readahead group expressed interest in fanotify as it could be used
to profile disk access on boot without breaking the audit system.  The desktop
search groups have also expressed interest in fanotify as it solves a number
of the race conditions and problems present with managing inotify when more
than a limited number of specific files are of interest.  fanotify can provide
for a userspace access control system which makes it a clean interface for AV
vendors to hook without trying to do binary patching on the syscall table,
LSM, and everywhere else they do their things today.  With this patch series
fanotify can be implemented in less than 1200 lines of easy to review code.
Almost all of which is the socket based user interface.

This patch series builds fsnotify to the point that it can implement
dnotify and inotify_user.  Patches exist and will be sent soon after
acceptance to finish the in kernel inotify conversion (audit) and implement
fanotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
2009-06-11 14:57:52 -04:00
Linus Torvalds
871fa90791 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
  jfs: Add missing mutex_unlock call to error path
  missing unlock in jfs_quota_write()
2009-06-11 11:27:09 -07:00
Linus Torvalds
c9059598ea Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
  block: add request clone interface (v2)
  floppy: fix hibernation
  ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
  fs/bio.c: add missing __user annotation
  block: prevent possible io_context->refcount overflow
  Add serial number support for virtio_blk, V4a
  block: Add missing bounce_pfn stacking and fix comments
  Revert "block: Fix bounce limit setting in DM"
  cciss: decode unit attention in SCSI error handling code
  cciss: Remove no longer needed sendcmd reject processing code
  cciss: change SCSI error handling routines to work with interrupts enabled.
  cciss: separate error processing and command retrying code in sendcmd_withirq_core()
  cciss: factor out fix target status processing code from sendcmd functions
  cciss: simplify interface of sendcmd() and sendcmd_withirq()
  cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
  cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
  block: needs to set the residual length of a bidi request
  Revert "block: implement blkdev_readpages"
  block: Fix bounce limit setting in DM
  Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
  ...

Manually fix conflicts with tracing updates in:
	block/blk-sysfs.c
	drivers/ide/ide-atapi.c
	drivers/ide/ide-cd.c
	drivers/ide/ide-floppy.c
	drivers/ide/ide-tape.c
	include/trace/events/block.h
	kernel/trace/blktrace.c
2009-06-11 11:10:35 -07:00
Linus Torvalds
0a33f80a83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (25 commits)
  GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta
  GFS2: Fix cache coherency between truncate and O_DIRECT read
  GFS2: Fix locking issue mounting gfs2meta fs
  GFS2: Remove unused variable
  GFS2: smbd proccess hangs with flock() call.
  GFS2: Remove args subdir from gfs2 sysfs files
  GFS2: Remove lockstruct subdir from gfs2 sysfs files
  GFS2: Move gfs2_unlink_ok into ops_inode.c
  GFS2: Move gfs2_readlinki into ops_inode.c
  GFS2: Move gfs2_rmdiri into ops_inode.c
  GFS2: Merge mount.c and ops_super.c into super.c
  GFS2: Clean up some file names
  GFS2: Be more aggressive in reclaiming unlinked inodes
  GFS2: Add a rgrp bitmap full flag
  GFS2: Improve resource group error handling
  GFS2: Don't warn when delete inode fails on ro filesystem
  GFS2: Update docs
  GFS2: Umount recovery race fix
  GFS2: Remove a couple of unused sysfs entries
  GFS2: Add commit= mount option
  ...
2009-06-11 10:36:12 -07:00
Linus Torvalds
ddbb868493 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: remove never-used in6_addr option
  cifs: add addr= mount option alias for ip=
  [CIFS] Add mention of new mount parm (forceuid) to cifs readme
  cifs: make overriding of ownership conditional on new mount options
  cifs: fix IPv6 address length check
  cifs: clean up set_cifs_acl interfaces
  cifs: reorganize get_cifs_acl
  [CIFS] Update readme to indicate change to default mount (serverino)
  cifs: make serverino the default when mounting
  cifs: rename cifs_iget to cifs_root_iget
  cifs: make cnvrtDosUnixTm take a little-endian args and an offset
  cifs: have cifs_NTtimeToUnix take a little-endian arg
  cifs: tighten up default file_mode/dir_mode
  cifs: fix artificial limit on reading symlinks
2009-06-11 10:02:46 -07:00
Linus Torvalds
3296ca27f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
  nommu: Provide mmap_min_addr definition.
  TOMOYO: Add description of lists and structures.
  TOMOYO: Remove unused field.
  integrity: ima audit dentry_open failure
  TOMOYO: Remove unused parameter.
  security: use mmap_min_addr indepedently of security models
  TOMOYO: Simplify policy reader.
  TOMOYO: Remove redundant markers.
  SELinux: define audit permissions for audit tree netlink messages
  TOMOYO: Remove unused mutex.
  tomoyo: avoid get+put of task_struct
  smack: Remove redundant initialization.
  integrity: nfsd imbalance bug fix
  rootplug: Remove redundant initialization.
  smack: do not beyond ARRAY_SIZE of data
  integrity: move ima_counts_get
  integrity: path_check update
  IMA: Add __init notation to ima functions
  IMA: Minimal IMA policy and boot param for TCB IMA policy
  selinux: remove obsolete read buffer limit from sel_read_bool
  ...
2009-06-11 10:01:41 -07:00
Linus Torvalds
e893123c73 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
  ext4: Avoid corrupting the uninitialized bit in the extent during truncate
  ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
  ext4: fix dx_map_entry to support 256k directory blocks
  ext4: truncate the file properly if we fail to copy data from userspace
  ext4: Avoid leaking blocks after a block allocation failure
  ext4: Change all super.c messages to print the device
  ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
  ext4: super.c whitespace cleanup
  jbd2: Fix minor typos in comments in fs/jbd2/journal.c
  ext4: Clean up calls to ext4_get_group_desc()
  ext4: remove unused function __ext4_write_dirty_metadata
  ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
  ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
  ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
  ext4: down i_data_sem only for read when walking tree for fiemap
  ext4: Add a comprehensive block validity check to ext4_get_blocks()
  ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
  ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
  ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
  ext4: Add documentation to the ext4_*get_block* functions
  ...
2009-06-11 10:00:50 -07:00
Catalin Marinas
2e1483c995 kmemleak: Remove some of the kmemleak false positives
There are allocations for which the main pointer cannot be found but
they are not memory leaks. This patch fixes some of them. For more
information on false positives, see Documentation/kmemleak.txt.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2009-06-11 17:04:18 +01:00
Linus Torvalds
49c355617f Merge branch 'serial-from-alan'
* serial-from-alan: (79 commits)
  moxa: prevent opening unavailable ports
  imx: serial: use tty_encode_baud_rate to set true rate
  imx: serial: add IrDA support to serial driver
  imx: serial: use rational library function
  lib: isolate rational fractions helper function
  imx: serial: handle initialisation failure correctly
  imx: serial: be sure to stop xmit upon shutdown
  imx: serial: notify higher layers in case xmit IRQ was not called
  imx: serial: fix one bit field type
  imx: serial: fix whitespaces (no changes in functionality)
  tty: use prepare/finish_wait
  tty: remove sleep_on
  sierra: driver interface blacklisting
  sierra: driver urb handling improvements
  tty: resolve some sierra breakage
  timbuart: Fix the termios logic
  serial: Added Timberdale UART driver
  tty: Add URL for ttydev queue
  devpts: unregister the file system on error
  tty: Untangle termios and mm mutex dependencies
  ...
2009-06-11 08:57:47 -07:00
Ingo Molnar
940010c5a3 Merge branch 'linus' into perfcounters/core
Conflicts:
	arch/x86/kernel/irqinit.c
	arch/x86/kernel/irqinit_64.c
	arch/x86/kernel/traps.c
	arch/x86/mm/fault.c
	include/linux/sched.h
	kernel/exit.c
2009-06-11 17:55:42 +02:00
Alan Cox
93d5581e20 devpts: unregister the file system on error
Closes-bug: http://bugzilla.kernel.org/show_bug.cgi?id=13429

Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-11 08:51:06 -07:00
Chris Mason
b263c2c8bf Btrfs: fix extent_buffer leak during tree log replay
During tree log replay, we read in the tree log roots,
process them and then free them.  A recent change
takes an extra reference on the root node of the tree
when the root is read in, and stores that reference
in root->commit_root.

This reference was not being freed, leaving us with
one buffer pinned in ram for each subvol with
a tree log root after a crash.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:24:47 -04:00
Chris Mason
0b4dcea579 Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
This happens during subvol creation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 11:13:35 -04:00
Chris Mason
067c28adc5 Btrfs: fix -o nodatasum printk spelling
It was printing nodatacsum, which was not the correct option name.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 09:30:13 -04:00
Yan Zheng
85d4198e40 Btrfs: check duplicate backrefs for both data and metadata
lookup_inline_extent_backref only checks for duplicate backref for data
extents. It assumes backrefs for tree block never conflict.

This patch makes lookup_inline_extent_backref check for duplicate backrefs
for both data and tree block, so that we can detect potential bug earlier.
This is a safety check, strictly speaking it is not required.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-11 08:51:34 -04:00
Linus Torvalds
8623661180 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
  Revert "x86, bts: reenable ptrace branch trace support"
  tracing: do not translate event helper macros in print format
  ftrace/documentation: fix typo in function grapher name
  tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
  tracing: add protection around module events unload
  tracing: add trace_seq_vprint interface
  tracing: fix the block trace points print size
  tracing/events: convert block trace points to TRACE_EVENT()
  ring-buffer: fix ret in rb_add_time_stamp
  ring-buffer: pass in lockdep class key for reader_lock
  tracing: add annotation to what type of stack trace is recorded
  tracing: fix multiple use of __print_flags and __print_symbolic
  tracing/events: fix output format of user stack
  tracing/events: fix output format of kernel stack
  tracing/trace_stack: fix the number of entries in the header
  ring-buffer: discard timestamps that are at the start of the buffer
  ring-buffer: try to discard unneeded timestamps
  ring-buffer: fix bug in ring_buffer_discard_commit
  ftrace: do not profile functions when disabled
  tracing: make trace pipe recognize latency format flag
  ...
2009-06-10 19:53:40 -07:00
James Morris
73fbad283c Merge branch 'next' into for-linus 2009-06-11 11:03:14 +10:00
Shin Hong
fd0fb038d5 Btrfs: init worker struct fields before kthread-run
This patch fixes a bug which may result race condition
between btrfs_start_workers() and worker_loop().

btrfs_start_workers() executed in a parent thread writes
on workers->worker and worker_loop() in a child thread
reads workers->worker. However, there is no synchronization
enforcing the order of two operations.

This patch makes btrfs_start_workers() fill workers->worker
before it starts a child thread with worker_loop()

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 20:11:29 -04:00
Linus Torvalds
99e97b860e Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sched: fix typo in sched-rt-group.txt file
  ftrace: fix typo about map of kernel priority in ftrace.txt file.
  sched: properly define the sched_group::cpumask and sched_domain::span fields
  sched, timers: cleanup avenrun users
  sched, timers: move calc_load() to scheduler
  sched: Don't export sched_mc_power_savings on multi-socket single core system
  sched: emit thread info flags with stack trace
  sched: rt: document the risk of small values in the bandwidth settings
  sched: Replace first_cpu() with cpumask_first() in ILB nomination code
  sched: remove extra call overhead for schedule()
  sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus())
  wait: don't use __wake_up_common()
  sched: Nominate a power-efficient ilb in select_nohz_balancer()
  sched: Nominate idle load balancer from a semi-idle package.
  sched: remove redundant hierarchy walk in check_preempt_wakeup
2009-06-10 15:32:59 -07:00
Michal Simek
0e0c62123b fs/bio.c: add missing __user annotation
As reported by sparse:

fs/bio.c:720:13: warning: incorrect type in assignment (different address spaces)
fs/bio.c:720:13:    expected char *iov_addr
fs/bio.c:720:13:    got void [noderef] <asn:1>*
fs/bio.c:724:36: warning: incorrect type in argument 2 (different address spaces)
fs/bio.c:724:36:    expected void const [noderef] <asn:1>*from
fs/bio.c:724:36:    got char *iov_addr

Signed-off-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-10 23:07:15 +02:00
Hisashi Hifumi
4eedeb75e7 Btrfs: pin buffers during write_dev_supers
write_dev_supers is called in sequence.  First is it called with wait == 0,
which starts IO on all of the super blocks for a given device.  Then it is
called with wait == 1 to make sure they all reach the disk.

It doesn't currently pin the buffers between the two calls, and it also
assumes the buffers won't go away between the two calls, leading to
an oops if the VM manages to free the buffers in the middle of the sync.

This fixes that assumption and updates the code to return an error if things
are not up to date when the wait == 1 run is done.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 16:49:25 -04:00
Chris Mason
e5e9a5206a Btrfs: avoid races between super writeout and device list updates
On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete.  There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened  with no transation writers running.

With the btrfs fsync log and other async transaction updates, this
has been racey for some time.  This adds a mutex to protect
the device list.  The existing volume mutex could not be reused due to
transaction lock ordering requirements.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 15:17:02 -04:00
Jeff Layton
61b6bc525a cifs: remove never-used in6_addr option
This option was never used to my knowledge. Remove it before someone
does...

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-10 18:34:35 +00:00
Aneesh Kumar K.V
a41f207169 ext4: Avoid corrupting the uninitialized bit in the extent during truncate
The unitialized bit was not properly getting preserved in in an extent
which is partially truncated because the it was geting set to the
value of the first extent to be removed or truncated as part of the
truncate operation, and if there are multiple extents are getting
removed or modified as part of the truncate operation, it is only the
last extent which will might be partially truncated, and its
uninitalized bit is not necessarily the same as the first extent to be
truncated.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-10 14:22:55 -04:00
Jeff Layton
58f7f68f22 cifs: add addr= mount option alias for ip=
When you look in /proc/mounts, the address of the server gets displayed
as "addr=". That's really a better option to use anyway since it's more
generic. What if we eventually want to support non-IP transports? It
also makes CIFS option consistent with the NFS option of the same name.

Begin the migration to that option name by adding an alias for ip=
called addr=.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-10 15:39:14 +00:00
Al Viro
7df336ec12 Fix btrfs when ACLs are configured out
... otherwise generic_permission() will allow *anything* for all
files you don't own and that have some group permissions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:36:43 -04:00
Hisashi Hifumi
524724ed1f Btrfs: fdatasync should skip metadata writeout
In btrfs, fdatasync and fsync are identical, but
fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.

--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run

Results:
-2.6.30-rc8
Test execution summary:
    total time:                          1980.6540s
    total number of events:              10001
    total time taken by event execution: 1192.9804
    per-request statistics:
         min:                            0.0000s
         avg:                            0.1193s
         max:                            15.3720s
         approx.  95 percentile:         0.7257s

Threads fairness:
    events (avg/stddev):           625.0625/151.32
    execution time (avg/stddev):   74.5613/9.46

-2.6.30-rc8-patched
Test execution summary:
    total time:                          1695.9118s
    total number of events:              10000
    total time taken by event execution: 871.3214
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0871s
         max:                            10.4644s
         approx.  95 percentile:         0.4787s

Threads fairness:
    events (avg/stddev):           625.0000/131.86
    execution time (avg/stddev):   54.4576/8.98

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
David Woodhouse
163e783e6a Btrfs: remove crc32c.h and use libcrc32c directly.
There's no need to preserve this abstraction; it used to let us use
hardware crc32c support directly, but libcrc32c is already doing that for us
through the crypto API -- so we're already using the Intel crc32c
acceleration where appropriate.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:53 -04:00
Christoph Hellwig
6cbff00f46 Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
Add support for the standard attributes set via chattr and read via
lsattr.  Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.

Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.

Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
c289811cc0 Btrfs: autodetect SSD devices
During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS.  If they are all
non-rotating, SSD mode is enabled by default.

If the FS was mounted with -o nossd, the non-rotating
flag is ignored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
451d7585a8 Btrfs: add mount -o ssd_spread to spread allocations out
Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.

The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.

mount -o ssd_spread will make sure there are no allocated blocks
mixed in.  It should perform better on lower end SSDs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:52 -04:00
Chris Mason
c604480171 Btrfs: avoid allocation clusters that are too spread out
In SSD mode for data, and all the time for metadata the allocator
will try to find a cluster of nearby blocks for allocations.  This
commit adds extra checks to make sure that each free block in the
cluster is close to the last one.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:51 -04:00
Chris Mason
3b30c22f64 Btrfs: Add mount -o nossd
This allows you to turn off the ssd mode via remount.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:50 -04:00
Chris Mason
d644d8a1e3 Btrfs: avoid IO stalls behind congested devices in a multi-device FS
The btrfs IO submission threads try to service a bunch of devices with a small
number of threads.  They do a congestion check to try and avoid waiting
on requests for a busy device.

The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO.  The counter used to decide if we can switch to the next device
is somewhat overloaded.  It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.

This commit adds a new counter for the number of bios sent while
servicing a device.  It doesn't get reset or fiddled with.  On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason
d84275c938 Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests.  For each btrfs device, there are
two lists of bios.  One list is for WRITE_SYNC bios and the other
is for regular priority bios.

The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios.  This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.

WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking.  Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.

If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages.  If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:49 -04:00
Chris Mason
585ad2c379 Btrfs: fix metadata dirty throttling limits
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.

A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks.  This commit switches it
to use the correct counter.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason
2c943de6ad Btrfs: reduce mount -o ssd CPU usage
The block allocator in SSD mode will try to find groups of free blocks
that are close together.  This commit makes it loop less on a given
group size before bumping it.

The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:48 -04:00
Chris Mason
cfbb930846 Btrfs: balance btree more often
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done.  This commit
makes us more aggressively balance leaves and nodes as they become
less full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Chris Mason
b361242102 Btrfs: stop avoiding balancing at the end of the transaction.
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.

With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore.  So, the balance avoidance is no longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:47 -04:00
Yan Zheng
5d4f98a28c Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.

When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one.  At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.

The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root.  This commit reduces the
transaction overhead by avoiding the need for dead root records.

When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.

This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.

We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.

This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.

This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.

This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.

The improved balancing code scales significantly better with a large
number of snapshots.

This is a very large commit and was written in a number of
pieces.  But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Yan Zheng
5c939df56c btrfs: Fix set/clear_extent_bit for 'end == (u64)-1'
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 11:29:46 -04:00
Christoph Hellwig
ef14f0c157 xfs: use generic Posix ACL code
This patch rips out the XFS ACL handling code and uses the generic
fs/posix_acl.c code instead.  The ondisk format is of course left
unchanged.

This also introduces the same ACL caching all other Linux filesystems do
by adding pointers to the acl and default acl in struct xfs_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-10 17:07:47 +02:00
Ryusuke Konishi
c3a7abf06c nilfs2: support contiguous lookup of blocks
Although get_block() callback function can return extent of contiguous
blocks with bh->b_size, nilfs_get_block() function did not support
this feature.

This adds contiguous lookup feature to the block mapping codes of
nilfs, and allows the nilfs_get_blocks() function to return the extent
information by applying the feature.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
fa032744ad nilfs2: add sync_page method to page caches of meta data
This applies block_sync_page() function to the sync_page method of
page caches for meta data files, gc page caches, and btree node
buffers.  This is a companion patch of ("nilfs2: enable sync_page
mothod") which applied the function for data pages.

This allows lock_page() for those meta data to unplug pending bio
requests.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
a53b4751ae nilfs2: use device's backing_dev_info for btree node caches
Previously, default_backing_dev_info was used for the mapping of btree
node caches.  This uses device dependent backing_dev_info to allow
detailed control of the device for the btree node pages.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
30c25be71f nilfs2: return EBUSY against delete request on snapshot
This helps userland programs like the rmcp command to distinguish
error codes returned against a checkpoint removal request.

Previously -EPERM was returned, and not discriminable from real
permission errors.  This also allows removal of the latest checkpoint
because the deletion leads to create a new checkpoint, and thus it's
harmless for the filesystem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:12 +09:00
Ryusuke Konishi
e85dc1d529 nilfs2: enable sync_page method
This adds a missing sync_page method which unplugs bio requests when
waiting for page locks. This will improve read performance of nilfs.

Here is a measurement result using dd command.

Without this patch:

 # mount -t nilfs2 /dev/sde1 /test
 # dd if=/test/aaa of=/dev/null bs=512k
 1024+0 records in
 1024+0 records out
 536870912 bytes (537 MB) copied, 6.00688 seconds, 89.4 MB/s

With this patch:

 # mount -t nilfs2 /dev/sde1 /test
 # dd if=/test/aaa of=/dev/null bs=512k
 1024+0 records in
 1024+0 records out
 536870912 bytes (537 MB) copied, 3.54998 seconds, 151 MB/s

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
30bda0b8ae nilfs2: set bio unplug flag for the last bio in segment
This sets BIO_RW_UNPLUG flag on the last bio of each segment during
write.  The last bio should be unplugged immediately because the
caller waits for the completion after the submission.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
003ff182fd nilfs2: allow future expansion of metadata read out via get info ioctl
Nilfs has some ioctl commands to read out metadata from meta data
files:

 - NILFS_IOCTL_GET_CPINFO for checkpoint file,
 - NILFS_IOCTL_GET_SUINFO for segment usage file, and
 - NILFS_IOCTL_GET_VINFO for Disk Address Transalation (DAT) file,
   respectively.

Every routine on these metadata files is implemented so that it allows
future expansion of on-disk format.  But, the above ioctl commands do
not support expansion even though nilfs_argv structure can handle
arbitrary size for data exchanged via ioctl.

This allows future expansion of the following structures which give
basic format of the "get information" ioctls:

 - struct nilfs_cpinfo
 - struct nilfs_suinfo
 - struct nilfs_vinfo

So, this introduces forward compatility of such ioctl commands.

In this patch, a sanity check in nilfs_ioctl_get_info() function is
changed to accept larger data structure [1], and metadata read
routines are rewritten so that they become compatible for larger
structures; the routines will just ignore the remaining fields which
the current version of nilfs doesn't know.

[1] The ioctl function already has another upper limit (PAGE_SIZE
    against a structure, which appears in nilfs_ioctl_wrap_copy
    function), and this will not cause security problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Hisashi Hifumi
258ef67e24 NILFS2: Pagecache usage optimization on NILFS2
Hi,

I introduced "is_partially_uptodate" aops for NILFS2.

A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.

I did a performance test using the sysbench.

1 --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0 --fil
e-rw-ratio=1 run

-2.6.30-rc5

Test execution summary:
    total time:                          151.2907s
    total number of events:              200000
    total time taken by event execution: 2409.8387
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0120s
         max:                            0.9306s
         approx.  95 percentile:         0.0439s

Threads fairness:
    events (avg/stddev):           12500.0000/238.52
    execution time (avg/stddev):   150.6149/0.01

-2.6.30-rc5-patched

Test execution summary:
    total time:                          140.8828s
    total number of events:              200000
    total time taken by event execution: 2240.8577
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0112s
         max:                            0.8750s
         approx.  95 percentile:         0.0418s

Threads fairness:
    events (avg/stddev):           12500.0000/218.43
    execution time (avg/stddev):   140.0536/0.01

arch: ia64
pagesize: 16k

Thanks.

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
7cde31d7d6 nilfs2: remove nilfs_btree_operations from btree mapping
will remove indirect function calls using nilfs_btree_operations
table.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
355c6b6103 nilfs2: remove nilfs_direct_operations from direct mapping
will remove indirect function calls using nilfs_direct_operations
table.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:11 +09:00
Ryusuke Konishi
d4b961576d nilfs2: remove bmap pointer operations
Previously, the bmap codes of nilfs used three types of function
tables.  The abuse of indirect function calls decreased source
readability and suffered many indirect jumps which would confuse
branch prediction of processors.

This eliminates one type of the function tables,
nilfs_bmap_ptr_operations, which was used to dispatch low level
pointer operations of the nilfs bmap.

This adds a new integer variable "b_ptr_type" to nilfs_bmap struct,
and uses the value to select the pointer operations.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
3033342a0b nilfs2: remove useless b_low and b_high fields from nilfs_bmap struct
This will cut off 16 bytes from the nilfs_bmap struct which is
embedded in the on-memory inode of nilfs.

The b_high field was never used, and the b_low field stores a constant
value which can be determined by whether the inode uses btree for
block mapping or not.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
e473c1f265 nilfs2: remove pointless NULL check of bpop_commit_alloc_ptr function
This indirect function is set to NULL only for gc cache inodes, but
the gc cache inodes never call this function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
f198dbb9cf nilfs2: move get block functions in bmap.c into btree codes
Two get block function for btree nodes, nilfs_bmap_get_block() and
nilfs_bmap_get_new_block(), are called only from the btree codes.
This relocation will increase opportunities of compiler optimization.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
9f098900ad nilfs2: remove nilfs_bmap_delete_block
nilfs_bmap_delete_block() is a wrapper function calling
nilfs_btnode_delete().  This removes it for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
087d01b425 nilfs2: remove nilfs_bmap_put_block
nilfs_bmap_put_block() is a wrapper function calling brelse().  This
eliminates the wrapper for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:10 +09:00
Ryusuke Konishi
654137dd46 nilfs2: remove header file for segment list operations
This will eliminate obsolete list operations of nilfs_segment_entry
structure which has been used to handle mutiple segment numbers.

The patch ("nilfs2: remove list of freeing segments") removed use of
the structure from the segment constructor code, and this patch
simplifies the remaining code by integrating it into recovery.c.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
071cb4b819 nilfs2: eliminate removal list of segments
This will clean up the removal list of segments and the related
functions from segment.c and ioctl.c, which have hurt code
readability.

This elimination is applied by using nilfs_sufile_updatev() previously
introduced in the patch ("nilfs2: add sufile function that can modify
multiple segment usages").

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
dda54f4b87 nilfs2: add sufile function that can modify multiple segment usages
This is a preparation for the later cleanup patch ("nilfs2: remove
list of freeing segments").

This adds nilfs_sufile_updatev() to sufile, which can modify multiple
segment usages at a time.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
d97a51a7e3 nilfs2: unify bmap operations starting use of indirect block address
This simplifies some low level functions of bmap.

Three bmap pointer operations, nilfs_bmap_start_v(),
nilfs_bmap_commit_v(), and nilfs_bmap_abort_v(), are unified into one
nilfs_bmap_start_v() function. And the related indirect function calls
are replaced with it.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Ryusuke Konishi
6582207064 nilfs2: remove nilfs_dat_prepare_free function
This function is unused.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-06-10 23:41:09 +09:00
Boaz Harrosh
fc2fac5b5f [SCSI] libosd: Define an osd_dev wrapper to retrieve the request_queue
libosd users that need to work with bios, must sometime use
the request_queue associated with the osd_dev. Make a wrapper for
that, and convert all in-tree users.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10 09:00:13 -05:00
Boaz Harrosh
62f469b596 [SCSI] libosd: osd_req_{read,write} takes a length parameter
For supporting of chained-bios we can not inspect the first
bio only, as before. Caller shall pass the total length of the
request, ie. sum_bytes(bio-chain).

Also since the bio might be a chain we don't set it's direction
on behalf of it's callers. The bio direction should be properly
set prior to this call. So fix a couple of write users that now
need to set the bio direction properly

[In this patch I change both library code and user sites at
 exofs, to make it easy on integration. It should be submitted
 via James's scsi-misc tree.]

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10 08:59:52 -05:00
Boaz Harrosh
0e35afbc8b [SCSI] libosd: osd_req_{read,write}_kern new API
By popular demand, define usefull wrappers for osd_req_read/write
that recieve kernel pointers. All users had their own.

Also remove these from exofs

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
2009-06-10 08:57:07 -05:00
Steven Whitehouse
003dec8913 GFS2: Merge gfs2_get_sb into gfs2_get_sb_meta
These don't need to be separate functions.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-10 10:31:45 +01:00
Steven Whitehouse
40bc9a27e0 GFS2: Fix cache coherency between truncate and O_DIRECT read
If a page was partially zeroed as the result of a truncate, then it was
not being correctly marked dirty. This resulted in the deleted data
reappearing if the file was read back via direct I/O.

Reported-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-10 09:09:40 +01:00
Jan Kara
a61d90d75d jbd: fix race in buffer processing in commit code
In commit code, we scan buffers attached to a transaction.  During this
scan, we sometimes have to drop j_list_lock and then we recheck whether
the journal buffer head didn't get freed by journal_try_to_free_buffers().
 But checking for buffer_jbd(bh) isn't enough because a new journal head
could get attached to our buffer head.  So add a check whether the journal
head remained the same and whether it's still at the same transaction and
list.

This is a nasty bug and can cause problems like memory corruption (use after
free) or trigger various assertions in JBD code (observed).

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09 16:59:03 -07:00
Ian Kent
463aea1a1c autofs4: remove hashed check in validate_wait()
The recent ->lookup() deadlock correction required the directory inode
mutex to be dropped while waiting for expire completion.  We were
concerned about side effects from this change and one has been identified.

I saw several error messages.

They cause autofs to become quite confused and don't really point to the
actual problem.

Things like:

handle_packet_missing_direct:1376: can't find map entry for (43,1827932)

which is usually totally fatal (although in this case it wouldn't be
except that I treat is as such because it normally is).

do_mount_direct: direct trigger not valid or already mounted
/test/nested/g3c/s1/ss1

which is recoverable, however if this problem is at play it can cause
autofs to become quite confused as to the dependencies in the mount tree
because mount triggers end up mounted multiple times.  It's hard to
accurately check for this over mounting case and automount shouldn't need
to if the kernel module is doing its job.

There was one other message, similar in consequence of this last one but I
can't locate a log example just now.

When checking if a mount has already completed prior to adding a new mount
request to the wait queue we check if the dentry is hashed and, if so, if
it is a mount point.  But, if a mount successfully completed while we
slept on the wait queue mutex the dentry must exist for the mount to have
completed so the test is not really needed.

Mounts can also be done on top of a global root dentry, so for the above
case, where a mount request completes and the wait queue entry has already
been removed, the hashed test returning false can cause an incorrect
callback to the daemon.  Also, d_mountpoint() is not sufficient to check
if a mount has completed for the multi-mount case when we don't have a
real mount at the base of the tree.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-09 16:59:03 -07:00
Hisashi Hifumi
e04cc15f52 ocfs2: fdatasync should skip unimportant metadata writeout
In ocfs2, fdatasync and fsync are identical.
I think fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.

#sysbench --num-threads=16 --max-requests=300000 --test=fileio
--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run

Results:
-2.6.30-rc8
Test execution summary:
    total time:                          107.1445s
    total number of events:              119559
    total time taken by event execution: 116.1050
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0010s
         max:                            0.1220s
         approx.  95 percentile:         0.0016s

Threads fairness:
    events (avg/stddev):           7472.4375/303.60
    execution time (avg/stddev):   7.2566/0.64

-2.6.30-rc8-patched
Test execution summary:
    total time:                          86.8529s
    total number of events:              300016
    total time taken by event execution: 24.3077
    per-request statistics:
         min:                            0.0000s
         avg:                            0.0001s
         max:                            0.0336s
         approx.  95 percentile:         0.0001s

Threads fairness:
    events (avg/stddev):           18751.0000/718.75
    execution time (avg/stddev):   1.5192/0.05

Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-09 10:45:47 -07:00
Li Zefan
55782138e4 tracing/events: convert block trace points to TRACE_EVENT()
TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
these new capabilities to this tracepoint:

  - zero-copy and per-cpu splice() tracing
  - binary tracing without printf overhead
  - structured logging records exposed under /debug/tracing/events
  - trace events embedded in function tracer output and other plugins
  - user-defined, per tracepoint filter expressions
  ...

Cons:

  - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

  - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

  - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

      dd                   dd + ioctl blktrace       dd + TRACE_EVENT (splice)
1     7.36s, 42.7 MB/s     7.50s, 42.0 MB/s          7.41s, 42.5 MB/s
2     7.43s, 42.3 MB/s     7.48s, 42.1 MB/s          7.43s, 42.4 MB/s
3     7.38s, 42.6 MB/s     7.45s, 42.2 MB/s          7.41s, 42.5 MB/s

So the overhead of tracing is very small, and no regression when using
those trace events vs blktrace.

And the binary output of TRACE_EVENT is much smaller than blktrace:

 # ls -l -h
 -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

Following are some comparisons between TRACE_EVENT and blktrace:

plug:
  kjournald-480   [000]   303.084981: block_plug: [kjournald]
  kjournald-480   [000]   303.084981:   8,0    P   N [kjournald]

unplug_io:
  kblockd/0-118   [000]   300.052973: block_unplug_io: [kblockd/0] 1
  kblockd/0-118   [000]   300.052974:   8,0    U   N [kblockd/0] 1

remap:
  kjournald-480   [000]   303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384
  kjournald-480   [000]   303.085043:   8,0    A   W 102736992 + 8 <- (8,8) 33384

bio_backmerge:
  kjournald-480   [000]   303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald]
  kjournald-480   [000]   303.085086:   8,0    M   W 102737032 + 8 [kjournald]

getrq:
  kjournald-480   [000]   303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084975:   8,0    G   W 102736984 + 8 [kjournald]

  bash-2066  [001]  1072.953770:   8,0    G   N [bash]
  bash-2066  [001]  1072.953773: block_getrq: 0,0 N 0 + 0 [bash]

rq_complete:
  konsole-2065  [001]   300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0]
  konsole-2065  [001]   300.053191:   8,0    C   W 103669040 + 16 [0]

  ksoftirqd/1-7   [001]  1072.953811:   8,0    C   N (5a 00 08 00 00 00 00 00 24 00) [0]
  ksoftirqd/1-7   [001]  1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0]

rq_insert:
  kjournald-480   [000]   303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald]
  kjournald-480   [000]   303.084986:   8,0    I   W 102736984 + 8 [kjournald]

Changelog from v2 -> v3:

- use the newly introduced __dynamic_array().

Changelog from v1 -> v2:

- use __string() instead of __array() to minimize the memory required
  to store hex dump of rq->cmd().

- support large pc requests.

- add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

- some cleanups.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-06-09 12:34:23 -04:00
Theodore Ts'o
0eab928221 ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
If a non-existent file is opened via O_WRONLY|O_CREAT|O_TRUNC, there's
no need to treat this as a true file truncation, so we shouldn't
activate the replace-via-truncate hueristic.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 09:54:40 -04:00
Tejun Heo
151060ac13 CUSE: implement CUSE - Character device in Userspace
CUSE enables implementing character devices in userspace.  With recent
additions of ioctl and poll support, FUSE already has most of what's
necessary to implement character devices.  All CUSE has to do is
bonding all those components - FUSE, chardev and the driver model -
nicely.

When client opens /dev/cuse, kernel starts conversation with
CUSE_INIT.  The client tells CUSE which device it wants to create.  As
the previous patch made fuse_file usable without associated
fuse_inode, CUSE doesn't create super block or inodes.  It attaches
fuse_file to cdev file->private_data during open and set ff->fi to
NULL.  The rest of the operation is almost identical to FUSE direct IO
case.

Each CUSE device has a corresponding directory /sys/class/cuse/DEVNAME
(which is symlink to /sys/devices/virtual/class/DEVNAME if
SYSFS_DEPRECATED is turned off) which hosts "waiting" and "abort"
among other things.  Those two files have the same meaning as the FUSE
control files.

The only notable lacking feature compared to in-kernel implementation
is mmap support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-06-09 11:24:11 +02:00
James Morris
0b4ec6e4e0 Merge branch 'master' into next 2009-06-09 09:27:53 +10:00
Toshiyuki Okajima
9aee228607 ext4: fix dx_map_entry to support 256k directory blocks
The dx_map_entry structure doesn't support over 64KB block size by
current usage of its member("offs"). Because "offs" treats an offset
of copies of the ext4_dir_entry_2 structure as is. This member size is
16 bits. But real offset for over 64KB(256KB) block size needs 18
bits. However, real offset keeps 4 byte boundary, so lower 2 bits is
not used.

Therefore, we do the following to fix this limitation:
For "store": 
	we divide the real offset by 4 and then store this result to "offs" 
	member.
For "use":
	we multiply "offs" member by 4 and then use this result 
	as real offset.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-08 12:41:35 -04:00
Christoph Hellwig
8b5403a6d7 xfs: remove SYNC_BDFLUSH
SYNC_BDFLUSH is a leftover from IRIX and rather misnamed for todays
code.  Make xfs_sync_fsdata and xfs_dq_sync use the SYNC_TRYLOCK flag
for not blocking on logs just as the inode sync code already does.

For xfs_sync_fsdata it's a trivial 1:1 replacement, but for xfs_qm_sync
I use the opportunity to decouple the non-blocking lock case from the
different flushing modes, similar to the inode sync code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:37:16 +02:00
Christoph Hellwig
b0710ccc6d xfs: remove SYNC_IOWAIT
We want to wait for all I/O to finish when we do data integrity syncs.  So
there is no reason to keep SYNC_WAIT separate from SYNC_IOWAIT.  This
causes a little change in behaviour for the ENOSPC flushing code which now
does a second submission and wait of buffered I/O, but that should finish
ASAP as we already did an asynchronous writeout earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:37:11 +02:00
Christoph Hellwig
075fe10286 xfs: split xfs_sync_inodes
xfs_sync_inodes is used to write back either file data or inode metadata.
In general we always do these separately, except for one fishy case in
xfs_fs_put_super that does both.  So separate xfs_sync_inodes into
separate xfs_sync_data and xfs_sync_attr functions.  In xfs_fs_put_super
we first call the data sync and then the attr sync as that was the previous
order.  The moved log force in that path doesn't make a difference because
we will force the log again as part of the real unmount process.

The filesystem readonly checks are not performed by the new function but
instead moved into the callers, given that most callers alredy have it
further up in the stack.  Also add debug checks that we do not pass in
incorrect flags in the new xfs_sync_data and xfs_sync_attr function and
fix the one place that did pass in a wrong flag.

Also remove a comment mentioning xfs_sync_inodes that has been incorrect
for a while because we always take either the iolock or ilock in the
sync path these days.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:48 +02:00
Christoph Hellwig
fe588ed328 xfs: use generic inode iterator in xfs_qm_dqrele_all_inodes
Use xfs_inode_ag_iterator instead of opencoding the inode walk in the
quota code.  Mark xfs_inode_ag_iterator and xfs_sync_inode_valid non-static
to allow using them from the quota code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:27 +02:00
Dave Chinner
75f3cb1393 xfs: introduce a per-ag inode iterator
Given that we walk across the per-ag inode lists so often, it makes sense to
introduce an iterator for this.

Convert the sync and reclaim code to use this new iterator, quota code will
follow in the next patch.

Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode
already under reclaim.  This simplifies the AG iterator and doesn't
matter for the only other caller.

[hch: merged the lookup and execute callbacks back into one to get the
 pag_ici_lock locking correct and simplify the code flow]

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:14 +02:00
Dave Chinner
abc1064742 xfs: remove unused parameter from xfs_reclaim_inodes
The noblock parameter of xfs_reclaim_inodes is only ever set to zero. Remove
it and all the conditional code that is never executed.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:12 +02:00
Dave Chinner
1da8eecab5 xfs: factor out inode validation for sync
Separate the validation of inodes found by the radix
tree walk from the radix tree lookup.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:07 +02:00
Christoph Hellwig
845b6d0cbb xfs: split inode flushing from xfs_sync_inodes_ag
In many cases we only want to sync inode metadata. Split out the inode
flushing into a separate helper to prepare factoring the inode sync code.

Based on a patch from Dave Chinner, but redone to keep the current behaviour
exactly and leave changes to the flushing logic to another patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:05 +02:00
Dave Chinner
5a34d5cd09 xfs: split inode data writeback from xfs_sync_inodes_ag
In many cases we only want to sync inode data. Start spliting the inode sync
into data sync and inode sync by factoring out the inode data flush.

[hch: minor cleanups]

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:35:03 +02:00
Christoph Hellwig
7d095257e3 xfs: kill xfs_qmops
Kill the quota ops function vector and replace it with direct calls or
stubs in the CONFIG_XFS_QUOTA=n case.

Make sure we check XFS_IS_QUOTA_RUNNING in the right spots.  We can remove
the number of those checks because the XFS_TRANS_DQ_DIRTY flag can't be set
otherwise.

This brings us back closer to the way this code worked in IRIX and earlier
Linux versions, but we keep a lot of the more useful factoring of common
code.

Eventually we should also kill xfs_qm_bhv.c, but that's left for a later
patch.

Reduces the size of the source code by about 250 lines and the size of
XFS module by about 1.5 kilobytes with quotas enabled:

   text	   data	    bss	    dec	    hex	filename
 615957	   2960	   3848	 622765	  980ad	fs/xfs/xfs.o
 617231	   3152	   3848	 624231	  98667	fs/xfs/xfs.o.old

Fallout:

 - xfs_qm_dqattach is split into xfs_qm_dqattach_locked which expects
   the inode locked and xfs_qm_dqattach which does the locking around it,
   thus removing XFS_QMOPT_ILOCKED.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:33:32 +02:00
Christoph Hellwig
0c5e1ce89f xfs: validate quota log items during log recovery
Arkadiusz has seen really strange crashes in xfs_qm_dqcheck that
I can only explain by a log item being too smal to actually fit the
xfs_dqblk_t we're dereferencing all over xfs_qm_dqcheck.  So add
graceful checks for NULL or too small quota items to the log recovery
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:33:21 +02:00
Christoph Hellwig
e1696834e8 xfs: update max log size
Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the
maximum log size supported by mkfs.  Merged back the changes to xfs_fs.h
so the growfs enforced the same limit and the headers are in sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:32:59 +02:00
Christoph Hellwig
21bea49594 fat: split fat_generic_ioctl
Split up fat_generic_ioctl and add separate functions for the two
implemented ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-08 21:39:50 +09:00
David Woodhouse
e635a01ea0 Merge branch 'next-mtd' of git://aeryn.fluff.org.uk/bjdooks/linux 2009-06-08 12:21:27 +01:00
Artem Bityutskiy
f2c5dbd7b7 UBIFS: start using hrtimers
UBIFS uses timers for write-buffer write-back. It is not
crucial for us to write-back exactly on time. We are fine
to write-back a little earlier or later. And this means
we may optimize UBIFS timer so that it could be groped
with a close timer event, so that the CPU would not be
waken up just to do the write back. This is optimization
to lessen power consumption, which is important in
embedded devices UBIFS is used for.

hrtimers have a nice feature: they are effectively range
timers, and we may defind the soft and hard limits for
it. Standard timers do not have these feature. They may
only be made deferrable, but this means there is effectively
no hard limit. So, we will better use hrtimers.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08 11:14:58 +03:00
Artem Bityutskiy
3f36406f26 UBIFS: do not forget to register BDI device
Reviewed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08 11:14:21 +03:00
Bartlomiej Zolnierkiewicz
db429e9ec0 partitions: add ->set_capacity block device method
* Add ->set_capacity block device method and use it in rescan_partitions()
  to attempt enabling native capacity of the device upon detecting the
  partition which exceeds device capacity.

* Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
  native capacity during partition scan.

Together with the consecutive patch implementing ->set_capacity method in
ide-gd device driver this allows automatic disabling of Host Protected Area
(HPA) if any partitions overlapping HPA are detected.

Cc: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:52 +02:00
Bartlomiej Zolnierkiewicz
02c33b123e partitions: warn about the partition exceeding device capacity
The current warning message says only about the kernel's action taken
without mentioning the underlying reason behind it.

Noticed-by: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:51 +02:00
Hugh Dickins
f07502dae2 integrity: fix IMA inode leak
CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
until the system runs out of memory.  Nowhere is calling ima_inode_free()
a.k.a. ima_iint_delete().  Fix that by calling it from destroy_inode().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-06 14:33:41 -07:00
Steve French
f0472d0ec8 [CIFS] Add mention of new mount parm (forceuid) to cifs readme
Also update fs/cifs/CHANGES

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06 21:09:39 +00:00
Jeff Layton
4ae1507f6d cifs: make overriding of ownership conditional on new mount options
We have a bit of a problem with the uid= option. The basic issue is that
it means too many things and has too many side-effects.

It's possible to allow an unprivileged user to mount a filesystem if the
user owns the mountpoint, /bin/mount is setuid root, and the mount is
set up in /etc/fstab with the "user" option.

When doing this though, /bin/mount automatically adds the "uid=" and
"gid=" options to the share. This is fortunate since the correct uid=
option is needed in order to tell the upcall what user's credcache to
use when generating the SPNEGO blob.

On a mount without unix extensions this is fine -- you generally will
want the files to be owned by the "owner" of the mount. The problem
comes in on a mount with unix extensions. With those enabled, the
uid/gid options cause the ownership of files to be overriden even though
the server is sending along the ownership info.

This means that it's not possible to have a mount by an unprivileged
user that shows the server's file ownership info. The result is also
inode permissions that have no reflection at all on the server. You
simply cannot separate ownership from the mode in this fashion.

This behavior also makes MultiuserMount option less usable. Once you
pass in the uid= option for a mount, then you can't use unix ownership
info and allow someone to share the mount.

While I'm not thrilled with it, the only solution I can see is to stop
making uid=/gid= force the overriding of ownership on mounts, and to add
new mount options that turn this behavior on.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06 21:03:27 +00:00
Ingo Molnar
75b5032212 Merge branch 'linus' into perfcounters/core
Merge reason: Pick up the latest fixes before the -v8 perfcounters
	      release.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-06 20:21:28 +02:00
Al Viro
72a43d63cb ext3/4 with synchronous writes gets wedged by Postfix
OK, that's probably the easiest way to do that, as much as I don't like it...
Since iget() et.al. will not accept I_FREEING (will wait to go away
and restart), and since we'd better have serialization between new/free
on fs data structures anyway, we can afford simply skipping I_FREEING
et.al. in insert_inode_locked().

We do that from new_inode, so it won't race with free_inode in any interesting
ways and it won't race with iget (of any origin; nfsd or in case of fs
corruption a lookup) since both still will wait for I_LOCK.

Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Tested-by: David Watson <dbwatson@ukfsn.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06 06:17:26 -04:00
Theodore Ts'o
460bcf57b1 Fix nobh_truncate_page() to not pass stack garbage to get_block()
The nobh_truncate_page() function is used by ext2, exofs, and jfs.  Of
these three, only ext2 and jfs's get_block() function pays attention
to bh->b_size --- which is normally always the filesystem blocksize
except when the get_block() function is called by either
mpage_readpage(), mpage_readpages(), or the direct I/O routines in
fs/direct_io.c.

Unfortunately, nobh_truncate_page() does not initialize map_bh before
calling the filesystem-supplied get_block() function.  So ext2 and jfs
will try to calculate the number of blocks to map by taking stack
garbage and shifting it left by inode->i_blkbits.  This should be
*mostly* harmless (except the filesystem will do some unnneeded work)
unless the stack garbage is less than filesystem's blocksize, in which
case maxblocks will be zero, and the attempt to find out whether or
not the filesystem has a hole at a given logical block will fail, and
the page cache entry might not get zero'ed out.

Also if the stack garbage in in map_bh->state happens to have the
BH_Mapped bit set, there could be an attempt to call readpage() on a
non-existent page, which could cause nobh_truncate_page() to return an
error when it should not.

Fix this by initializing map_bh->state and map_bh->size.

Fortunately, it's probably fairly unlikely that ext2 and jfs users
mount with nobh these days.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06 06:17:25 -04:00
Linus Torvalds
064e38aade Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Fix oops and use after free during space balancing
  Btrfs: set device->total_disk_bytes when adding new device
2009-06-05 11:54:28 -07:00
Steven Whitehouse
f6d03139d7 GFS2: Fix locking issue mounting gfs2meta fs
This patch uses sget() to get a reference to the
existing gfs2 sb when mouting the gfs2meta filesystem
(in fact thats just another mount of the gfs2
filesystem with a different root and this interface
is for backward compatibility).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Tested-by: Benjamin Marzinski <bmarzins@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
2009-06-05 08:35:15 +01:00
Aneesh Kumar K.V
f8514083cd ext4: truncate the file properly if we fail to copy data from userspace
In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size.  We should truncate the file in the above
case so that we don't have blocks allocated outside inode->i_size.  Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the
truncate.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 00:56:49 -04:00
Aneesh Kumar K.V
1938a150c2 ext4: Avoid leaking blocks after a block allocation failure
We should add inode to the orphan list in the same transaction
as block allocation.  This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 01:00:26 -04:00
Eric Sandeen
b31e15527a ext4: Change all super.c messages to print the device
This patch changes ext4 super.c to include the device name with all 
warning/error messages, by using a new utility function ext4_msg. 
It's a rather large patch, but very mechanic. I left debug printks
alone.

This is a straightforward port of a patch which Andi Kleen did for
ext3.

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04 17:36:36 -04:00
Jan Kara
03f5d8bcf0 ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This
seems to be a relict from some old days and setting disksize in this
function does not make much sense.  Currently it was set only by
ext4_getblk().  Since the parameter has some effect only if create ==
1, it is easy to check by grepping through the sources that the three
callers which end up calling ext4_getblk() with create == 1
(ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set
disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 00:17:05 -04:00
Jens Axboe
172124e220 Revert "block: implement blkdev_readpages"
This reverts commit db2dbb12dc.

It apparently causes problems with partition table read-ahead
on archs with large page sizes. Until that problem is diagnosed
further, just drop the readpages support on block devices.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-04 22:34:44 +02:00
Chris Mason
44fb551163 Btrfs: Fix oops and use after free during space balancing
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 15:41:27 -04:00
Yan Zheng
2cc3c559fb Btrfs: set device->total_disk_bytes when adding new device
It was not being properly initialized, and so the size saved to
disk was not correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 09:23:57 -04:00
Aneesh Kumar K.V
b767e78a17 ext4: Don't look at buffer_heads outside i_size.
Buffer heads outside i_size will be unmapped. So when we
are doing "walk_page_buffers" limit ourself to i_size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Josef Bacik <jbacik@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
----
2009-06-04 08:06:06 -04:00
Johann Lombardi
e6462869e4 ext4: Fix goal inum check in the inode allocator
The goal inode is specificed by inode number which belongs
to [1; s_inodes_count].

Signed-off-by: Johann Lombardi <johann@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:45:11 -04:00
Theodore Ts'o
5adfee9c17 ext4: fix no journal corruption with locale-gen
If there is no journal, ext4_should_writeback_data() should return
TRUE.  This will fix ext4_set_aops() to set ext4_da_ops in the case of
delayed allocation; otherwise ext4_journaled_aops gets used by
default, which doesn't handle delayed allocation properly.

The advantage of using ext4_should_writeback_data() approach is that
it should handle nobh better as well.

Thanks to Curt Wohlgemuth for investigating this problem, and Aneesh
Kumar for suggesting this approach.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-08 17:11:24 -04:00
Aneesh Kumar K.V
5887e98b60 ext4: Calculate required journal credits for inserting an extent properly
When we have space in the extent tree leaf node we should be able to
insert the extent with much less journal credits. The code was doing
proper calculation but missed a return statement.

Reported-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 23:12:04 -04:00
Jan Kara
ffacfa7a79 ext4: Fix truncation of symlinks after failed write
Contents of long symlinks is written via standard write methods. So
when the write fails, we add inode to orphan list. But symlinks don't
have .truncate method defined so nobody properly removes them from the
on disk orphan list.

Fix this by calling ext4_truncate() directly instead of calling
vmtruncate() (which is saner anyway since we don't need anything
vmtruncate() does except from calling .truncate in these paths).  We
also add inode to orphan list only if ext4_can_truncate() is true
(currently, it can be false for symlinks when there are no blocks
allocated) - otherwise orphan list processing will complain and
ext4_truncate() will not remove inode from on-disk orphan list.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 16:22:22 -04:00
Jan Kara
f91d1d0417 jbd2: Fix a race between checkpointing code and journal_get_write_access()
The following race can happen:

 CPU1                          CPU2
                               checkpointing code checks the buffer, adds
                                 it to an array for writeback
 do_get_write_access()
 ...
 lock_buffer()
 unlock_buffer()
                               flush_batch() submits the buffer for IO
 __jbd2_journal_file_buffer()

So a buffer under writeout is returned from
do_get_write_access(). Since the filesystem code relies on the fact
that journaled buffers cannot be written out, it does not take the
buffer lock and so it can modify buffer while it is under
writeout. That can lead to a filesystem corruption if we crash at the
right moment.

We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty
bit regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua <dingdinghua85@gmail.com>.

Reported-by: dingdinghua <dingdinghua85@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 16:16:20 -04:00
Jesper Dangaard Brouer
3e03f9ca6a ext4: Use rcu_barrier() on module unload.
The ext4 module uses rcu_call() thus it should use rcu_barrier()on
module unload.

The kmem cache ext4_pspace_cachep is sometimes free'ed using
call_rcu() callbacks.  Thus, we must wait for completion of call_rcu()
before doing kmem_cache_destroy().

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 22:29:27 -04:00
Eric Sandeen
726447d803 ext4: naturally align struct ext4_allocation_request
As Ted noted, the ext4_allocation_request isn't well aligned.  Looking
at it with pahole we're wasting space on 64-bit arches:

struct ext4_allocation_request {
        struct inode *             inode;              /*     0     8 */
        ext4_lblk_t                logical;            /*     8     4 */

        /* XXX 4 bytes hole, try to pack */

        ext4_fsblk_t               goal;               /*    16     8 */
        ext4_lblk_t                lleft;              /*    24     4 */

        /* XXX 4 bytes hole, try to pack */

        ext4_fsblk_t               pleft;              /*    32     8 */
        ext4_lblk_t                lright;             /*    40     4 */

        /* XXX 4 bytes hole, try to pack */

        ext4_fsblk_t               pright;             /*    48     8 */
        unsigned int               len;                /*    56     4 */
        unsigned int               flags;              /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */

        /* size: 64, cachelines: 1, members: 9 */
        /* sum members: 52, holes: 3, sum holes: 12 */
};

Grouping 32-bit members together closes these holes and shrinks the
structure by 12 bytes. which is important since ext4 can get on the
hairy edge of stack overruns.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-13 10:24:17 -04:00
Eric Sandeen
089ceecc1e ext4: mark several more functions in mballoc.c as noinline
Ted noticed a stack-deep callchain through
writepages->ext4_mb_regular_allocator->ext4_mb_init_cache->submit_bh ...

With all the static functions in mballoc.c, gcc helpfully
inlines for us, and we get something like this:

ext4_mb_regular_allocator	(232 bytes stack)
	ext4_mb_init_cache	(232 bytes stack)
		submit_bh	(starts 464 deeper)

the 2 ext4 functions here get several others inlined; by telling
gcc not to inline them, we can save stack space for when we
head off into submit_bh land and associated block layer callchains.
The following noinlined functions are only called once, so this
won't impact any other callchains:

ext4_mb_regular_allocator 			(104) (was 232)
	ext4_mb_find_by_goal			 (56) (noinlined)
	ext4_mb_init_group			 (24) (noinlined)
		ext4_mb_init_cache		(136) (was 232)
			ext4_mb_generate_buddy	 (88) (noinlined)
			ext4_mb_generate_from_pa (40) (noinlined)
			submit_bh
	ext4_mb_simple_scan_group		 (24) (noinlined)
	ext4_mb_scan_aligned			 (56) (noinlined)
	ext4_mb_complex_scan_group		 (40) (noinlined)
	ext4_mb_try_best_found			 (24) (noinlined)

now when we head off into submit_bh() we're only 264 bytes deeper
in stack than when we entered ext4_mb_regular_allocator()
(vs. 464 bytes before).  Every 200 bytes helps.  :)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 22:17:31 -04:00
Theodore Ts'o
f4a01017d6 ext4: Fix potential reclaim deadlock when truncating partial block
The ext4_block_truncate_page() function previously called
grab_cache_page(), which called find_or_create_page() with the
__GFP_FS flag potentially set.  This could cause a deadlock if the
system is low on memory and it attempts a memory reclaim, which could
potentially call back into ext4.  So we need to call
find_or_create_page() directly, and remove the __GFP_FP flag to avoid
this potential deadlock.

Thanks to Roland Dreier for reporting a lockdep warning which showed
this problem.

[20786.363249] =================================
[20786.363257] [ INFO: inconsistent lock state ]
[20786.363265] 2.6.31-2-generic #14~rbd4gitd960eea9
[20786.363270] ---------------------------------
[20786.363276] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[20786.363285] http/8397 [HC0[0]:SC0[0]:HE1:SE1] takes:
[20786.363291]  (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150
[20786.363314] {IN-RECLAIM_FS-W} state was registered at:
[20786.363320]   [<ffffffff8108bef6>] mark_irqflags+0xc6/0x1a0
[20786.363334]   [<ffffffff8108d347>] __lock_acquire+0x287/0x430
[20786.363345]   [<ffffffff8108d595>] lock_acquire+0xa5/0x150
[20786.363355]   [<ffffffff812008da>] jbd2_journal_start+0xfa/0x150
[20786.363365]   [<ffffffff811d98a8>] ext4_journal_start_sb+0x58/0x90
[20786.363377]   [<ffffffff811cce85>] ext4_delete_inode+0xc5/0x2c0
[20786.363389]   [<ffffffff81146fa3>] generic_delete_inode+0xd3/0x1a0
[20786.363401]   [<ffffffff81147095>] generic_drop_inode+0x25/0x30
[20786.363411]   [<ffffffff81145ce2>] iput+0x62/0x70
[20786.363420]   [<ffffffff81142878>] dentry_iput+0x98/0x110
[20786.363429]   [<ffffffff81142a00>] d_kill+0x50/0x80
[20786.363438]   [<ffffffff811444c5>] dput+0x95/0x180
[20786.363447]   [<ffffffff8120de4b>] ecryptfs_d_release+0x2b/0x70
[20786.363459]   [<ffffffff81142978>] d_free+0x28/0x60
[20786.363468]   [<ffffffff81142a18>] d_kill+0x68/0x80
[20786.363477]   [<ffffffff81142ad3>] prune_one_dentry+0xa3/0xc0
[20786.363487]   [<ffffffff81142d61>] __shrink_dcache_sb+0x271/0x290
[20786.363497]   [<ffffffff81142e89>] prune_dcache+0x109/0x1b0
[20786.363506]   [<ffffffff81142f6f>] shrink_dcache_memory+0x3f/0x50
[20786.363516]   [<ffffffff810f6d3d>] shrink_slab+0x12d/0x190
[20786.363527]   [<ffffffff810f97d7>] balance_pgdat+0x4d7/0x640
[20786.363537]   [<ffffffff810f9a57>] kswapd+0x117/0x170
[20786.363546]   [<ffffffff810773ce>] kthread+0x9e/0xb0
[20786.363558]   [<ffffffff8101430a>] child_rip+0xa/0x20
[20786.363569]   [<ffffffffffffffff>] 0xffffffffffffffff
[20786.363598] irq event stamp: 15997
[20786.363603] hardirqs last  enabled at (15997): [<ffffffff81125f9d>] kmem_cache_alloc+0xfd/0x1a0
[20786.363617] hardirqs last disabled at (15996): [<ffffffff81125f01>] kmem_cache_alloc+0x61/0x1a0
[20786.363628] softirqs last  enabled at (15966): [<ffffffff810631ea>] __do_softirq+0x14a/0x220
[20786.363641] softirqs last disabled at (15861): [<ffffffff8101440c>] call_softirq+0x1c/0x30
[20786.363651] 
[20786.363653] other info that might help us debug this:
[20786.363660] 3 locks held by http/8397:
[20786.363665]  #0:  (&sb->s_type->i_mutex_key#8){+.+.+.}, at: [<ffffffff8112ed24>] do_truncate+0x64/0x90
[20786.363685]  #1:  (&sb->s_type->i_alloc_sem_key#5){+++++.}, at: [<ffffffff81147f90>] notify_change+0x250/0x350
[20786.363707]  #2:  (jbd2_handle){+.+.?.}, at: [<ffffffff812008bb>] jbd2_journal_start+0xdb/0x150
[20786.363724] 
[20786.363726] stack backtrace:
[20786.363734] Pid: 8397, comm: http Tainted: G         C 2.6.31-2-generic #14~rbd4gitd960eea9
[20786.363741] Call Trace:
[20786.363752]  [<ffffffff8108ad7c>] print_usage_bug+0x18c/0x1a0
[20786.363763]  [<ffffffff8108b0c0>] ? check_usage_backwards+0x0/0xb0
[20786.363773]  [<ffffffff8108bad2>] mark_lock_irq+0xf2/0x280
[20786.363783]  [<ffffffff8108bd97>] mark_lock+0x137/0x1d0
[20786.363793]  [<ffffffff8108c03c>] mark_held_locks+0x6c/0xa0
[20786.363803]  [<ffffffff8108c11f>] lockdep_trace_alloc+0xaf/0xe0
[20786.363813]  [<ffffffff810efbac>] __alloc_pages_nodemask+0x7c/0x180
[20786.363824]  [<ffffffff810e9411>] ? find_get_page+0x91/0xf0
[20786.363835]  [<ffffffff8111d3b7>] alloc_pages_current+0x87/0xd0
[20786.363845]  [<ffffffff810e9827>] __page_cache_alloc+0x67/0x70
[20786.363856]  [<ffffffff810eb7df>] find_or_create_page+0x4f/0xb0
[20786.363867]  [<ffffffff811cb3be>] ext4_block_truncate_page+0x3e/0x460
[20786.363876]  [<ffffffff812008da>] ? jbd2_journal_start+0xfa/0x150
[20786.363885]  [<ffffffff812008bb>] ? jbd2_journal_start+0xdb/0x150
[20786.363895]  [<ffffffff811c6415>] ? ext4_meta_trans_blocks+0x75/0xf0
[20786.363905]  [<ffffffff811e8d8b>] ext4_ext_truncate+0x1bb/0x1e0
[20786.363916]  [<ffffffff811072c5>] ? unmap_mapping_range+0x75/0x290
[20786.363926]  [<ffffffff811ccc28>] ext4_truncate+0x498/0x630
[20786.363938]  [<ffffffff8129b4ce>] ? _raw_spin_unlock+0x5e/0xb0
[20786.363947]  [<ffffffff81107306>] ? unmap_mapping_range+0xb6/0x290
[20786.363957]  [<ffffffff8108c3ad>] ? trace_hardirqs_on+0xd/0x10
[20786.363966]  [<ffffffff811ffe58>] ? jbd2_journal_stop+0x1f8/0x2e0
[20786.363976]  [<ffffffff81107690>] vmtruncate+0xb0/0x110
[20786.363986]  [<ffffffff81147c05>] inode_setattr+0x35/0x170
[20786.363995]  [<ffffffff811c9906>] ext4_setattr+0x186/0x370
[20786.364005]  [<ffffffff81147eab>] notify_change+0x16b/0x350
[20786.364014]  [<ffffffff8112ed30>] do_truncate+0x70/0x90
[20786.364021]  [<ffffffff8112f48b>] T.657+0xeb/0x110
[20786.364021]  [<ffffffff8112f4be>] sys_ftruncate+0xe/0x10
[20786.364021]  [<ffffffff81013132>] system_call_fastpath+0x16/0x1b

Reported-by: Roland Dreier <roland@digitalvampire.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-07-05 22:08:16 -04:00
Theodore Ts'o
b574480507 jbd2: Remove GFP_ATOMIC kmalloc from inside spinlock critical region
Fix jbd2_dev_to_name(), a function used when pretty-printting jbd2 and
ext4 tracepoints.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-20 23:34:44 -04:00
Tao Ma
06c59bb896 ocfs2: Remove redundant gotos in ocfs2_mount_volume()
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:20:15 -07:00
Joel Becker
73be192b17 ocfs2: Add statistics for the checksum and ecc operations.
It would be nice to know how often we get checksum failures.  Even
better, how many of them we can fix with the single bit ecc.  So, we add
a statistics structure.  The structure can be installed into debugfs
wherever the user wants.

For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions.  The stats are only
registered with debugfs when the filesystem supports metadata ecc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:15:36 -07:00
Srinivas Eeda
15633a220f ocfs2 patch to track delayed orphan scan timer statistics
Patch to track delayed orphan scan timer statistics.

Modifies ocfs2_osb_dump to print the following:
  Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Srinivas Eeda
83273932fb ocfs2: timer to queue scan of all orphan slots
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock.  When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert.  The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.

This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Jan Kara
edd45c0849 ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories
We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin().
So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir()
to also use this lock ordering.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:30 -07:00
Jan Kara
80d73f15d1 ocfs2: Fix possible deadlock in quota recovery
In ocfs2_finish_quota_recovery() we acquired global quota file lock and started
recovering local quota file. During this process we need to get quota
structures, which calls ocfs2_dquot_acquire() which gets global quota file lock
again. This second lock can block in case some other node has requested the
quota file lock in the mean time. Fix the problem by moving quota file locking
down into the function where it is really needed.  Then dqget() or dqput()
won't be called with the lock held.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:30 -07:00
Jan Kara
65bac575e3 ocfs2: Fix possible deadlock with quotas in ocfs2_setattr()
We called vfs_dq_transfer() with global quota file lock held. This can lead
to deadlocks as if vfs_dq_transfer() has to allocate new quota structure,
it calls ocfs2_dquot_acquire() which tries to get quota file lock again and
this can block if another node requested the lock in the mean time.

Since we have to call vfs_dq_transfer() with transaction already started
and quota file lock ranks above the transaction start, we cannot just rely
on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock
if they need it. We fix the problem by acquiring pointers to all quota
structures needed by vfs_dq_transfer() already before calling the function.
By this we are sure that all quota structures are properly allocated and
they can be freed only after we drop references to them. Thus we don't need
quota file lock anywhere inside vfs_dq_transfer().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:29 -07:00
Jan Kara
b4c30de39a ocfs2: Fix lock inversion in ocfs2_local_read_info()
This function is called with dqio_mutex held but it has to acquire lock
from global quota file which ranks above this lock. This is not deadlockable
lock inversion since this code path is take only during mount when noone
else can race with us but let's clean this up to silence lockdep.

We just drop the dqio_mutex in the beginning of the function and reacquire
it in the end since we don't need it - noone can race with us at this moment.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:29 -07:00
Jan Kara
4e8a301929 ocfs2: Fix possible deadlock in ocfs2_global_read_dquot()
It is not possible to get a read lock and then try to get the same write lock
in one thread as that can block on downconvert being requested by other node
leading to deadlock. So first drop the quota lock for reading and only after
that get it for writing.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:28 -07:00
Andreas Dilger
0b8e58a140 ext4: super.c whitespace cleanup
Cleanup of whitespace and formatting.  Initially driven by confusing indents
for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd
cleanup some other 80-column wrapping and other indenting problems at the
same time.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-03 17:59:28 -04:00
Alberto Bertogli
bfcd3555af jbd2: Fix minor typos in comments in fs/jbd2/journal.c
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2009-06-09 00:06:20 -04:00
Denis Karpov
85c7859190 FAT: add 'errors' mount option
On severe errors FAT remounts itself in read-only mode. Allow to
specify FAT fs desired behavior through 'errors' mount option:
panic, continue or remount read-only.

`mount -t [fat|vfat] -o errors=[panic,remount-ro,continue] \
	<bdev> <mount point>`

This is analog to ext2 fs 'errors' mount option.

Signed-off-by: Denis Karpov <ext-denis.2.karpov@nokia.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-04 02:34:51 +09:00
Steven Whitehouse
e09f9446b9 GFS2: Remove unused variable
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-03 10:07:44 +01:00
Linus Torvalds
4157fd85fc Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: prevent deadlock in xfs_qm_shake()
  xfs: fix overflow in xfs_growfs_data_private
  xfs: fix double unlock in xfs_swap_extents()
2009-06-02 09:47:21 -07:00
Jeff Layton
50b64e3b77 cifs: fix IPv6 address length check
For IPv6 the userspace mount helper sends an address in the "ip="
option.  This check fails if the length is > 35 characters. I have no
idea where the magic 35 character limit came from, but it's clearly not
enough for IPv6. Fix it by making it use the INET6_ADDRSTRLEN #define.

While we're at it, use the same #define for the address length in SPNEGO
upcalls.

Reported-by: Charles R. Anderson <cra@wpi.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-02 15:45:40 +00:00
Artem Bityutskiy
8379ea31e9 UBIFS: allow sync option in rootflags
When passing UBIFS parameters via kernel command line, the
sync option will be passed to UBIFS as a string, not as an
MS_SYNCHRONOUS flag. Teach UBIFS interpreting this flag.

Reported-by: Aurélien GÉRÔME <ag@debian.org>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-02 11:08:07 +03:00
Abhijith Das
a12af1ebe6 GFS2: smbd proccess hangs with flock() call.
GFS2 currently does not support mandatory flocks. An flock() call with
LOCK_MAND triggers unexpected behavior because gfs2 is not checking for
this lock type. This patch corrects that.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-02 08:01:12 +01:00
Felix Blyakher
1b17d76646 xfs: prevent deadlock in xfs_qm_shake()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:45 -05:00
Eric Sandeen
e6da7c9fed xfs: fix overflow in xfs_growfs_data_private
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number.  If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:

 # xfs_io -f -c "truncate 19998630180864" fsfile
 # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
 # mount -o loop fsfile /mnt
 # xfs_growfs /mnt

meta-data=/dev/loop0             isize=256    agcount=52,
agsize=76288719 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3905982455, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:38 -05:00
Felix Blyakher
1f23920dbf xfs: fix double unlock in xfs_swap_extents()
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:29 -05:00
Yu Zhiguo
0a93a47f04 NFSv4: kill off complicated macro 'PROC'
J. Bruce Fields wrote:
...
> (This is extremely confusing code to track down: note that
> proc->pc_decode is set to nfs4svc_decode_compoundargs() by the PROC()
> macro at the end of fs/nfsd/nfs4proc.c.  Which means, for example, that
> grepping for nfs4svc_decode_compoundargs() gets you nowhere.  Patches to
> kill off that macro would be welcomed....)

the macro 'PROC' is complicated and obscure, it had better
be killed off in order to make the code more clear.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-01 18:09:20 -04:00
Yu Zhiguo
3c8e03166a NFSv4: do exact check about attribute specified
Server should return NFS4ERR_ATTRNOTSUPP if an attribute specified is
not supported in current environment.
Operations CREATE, NVERIFY, OPEN, SETATTR and VERIFY should do this check.

This bug is found when do newpynfs tests. The names of the tests that failed
are following:
  CR12 NVF7a NVF7b NVF7c NVF7d NVF7f NVF7r NVF7s
  OPEN15 VF7a VF7b VF7c VF7d VF7f VF7r VF7s

Add function do_check_fattr() to do exact check:
1, Check attribute specified is supported by the NFSv4 server or not.
2, Check FATTR4_WORD0_ACL & FATTR4_WORD0_FS_LOCATIONS are supported
   in current environment or not.
3, Check attribute specified is writable or not.

step 1 and 3 are done in function nfsd4_decode_fattr() but removed
to this function now.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-06-01 18:01:54 -04:00
Felix Blyakher
4156e735d3 xfs: prevent deadlock in xfs_qm_shake()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 13:13:24 -05:00
Ingo Molnar
23db9f430b Merge branch 'linus' into perfcounters/core
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
              based - to pick up the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 10:01:39 +02:00
Linus Torvalds
b4566ac524 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function
2009-05-30 08:04:15 -07:00
Ryusuke Konishi
62013ab5d5 nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function
The nilfs_cpfile_delete_checkpoints() wrongly skips brelse() for the
header block of checkpoint file in case of errors.  This fixes the
leak bug.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-30 22:07:50 +09:00
Linus Torvalds
3218911f83 Merge git://git.infradead.org/~dwmw2/mtd-2.6.30
* git://git.infradead.org/~dwmw2/mtd-2.6.30:
  jffs2: Fix corruption when flash erase/write failure
  mtd: MXC NAND driver fixes (v5)
2009-05-29 08:52:13 -07:00
Linus Torvalds
deeb103412 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  Driver Core: do not oops when driver_unregister() is called for unregistered drivers
  sysfs: file.c: use create_singlethread_workqueue()
2009-05-29 08:49:52 -07:00
Linus Torvalds
c8bce3d3bd Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
  svcrdma: dma unmap the correct length for the RPCRDMA header page.
  nfsd: Revert "svcrpc: take advantage of tcp autotuning"
  nfsd: fix hung up of nfs client while sync write data to nfs server
2009-05-29 08:49:09 -07:00
Oskar Schirmer
c3dc5bec05 flat: fix data sections alignment
The flat loader uses an architecture's flat_stack_align() to align the
stack but assumes word-alignment is enough for the data sections.

However, on the Xtensa S6000 we have registers up to 128bit width
which can be used from userspace and therefor need userspace stack and
data-section alignment of at least this size.

This patch drops flat_stack_align() and uses the same alignment that
is required for slab caches, ARCH_SLAB_MINALIGN, or wordsize if it's
not defined by the architecture.

It also fixes m32r which was obviously kaput, aligning an
uninitialized stack entry instead of the stack pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oskar Schirmer <os@emlix.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Johannes Weiner <jw@emlix.com>
Acked-by: Mike Frysinger <vapier.adi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
KOSAKI Motohiro
bd6daba909 procfs: make errno values consistent when open pident vs exit(2) race occurs
proc_pident_instantiate() has following call flow.

proc_pident_lookup()
  proc_pident_instantiate()
    proc_pid_make_inode()

And, proc_pident_lookup() has following error handling.

	const struct pid_entry *p, *last;
	error = ERR_PTR(-ENOENT);
	if (!task)
		goto out_no_task;

Then, proc_pident_instantiate should return ENOENT too when racing against
exit(2) occur.

EINAL has two bad reason.
  - it implies caller is wrong. bad the race isn't caller's mistake.
  - man 2 open don't explain EINVAL. user often don't handle it.

Note: Other proc_pid_make_inode() caller already use ENOENT properly.

Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
Kevin Cernekee
668ff9ab45 mtd: Handle compat ioctls directly; remove all trace from compat_ioctl.c
Remove all references to MTD ioctls from fs/compat_ioctl.c and let
them all be handled by mtd_compat_ioctl().

Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:58:25 +01:00
Kevin Cernekee
aea7cea9fa mtd: add OOB ioctls for >4GiB devices
Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:27:07 +01:00
Kevin Cernekee
9771854040 mtd: compat_ioctl cleanup
1) Move the MEMREADOOB/MEMWRITEOOB compat_ioctl wrappers from
fs/compat_ioctl.c into mtdchar.c .  Original request was here:

http://lkml.org/lkml/2009/4/1/295

2) Add missing COMPATIBLE_IOCTL lines, so that mtd-utils does not error
out when running in 64/32 compatibility mode.

LKML-Reference: <200904011650.22928.arnd@arndb.de>
Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:24:48 +01:00
Kevin Cernekee
0dc54e9f33 mtd: add MEMERASE64 ioctl for >4GiB devices
New MEMERASE/MEMREADOOB/MEMWRITEOOB ioctls are needed in order to support
64-bit offsets into large NAND flash devices.

Signed-off-by: Kevin Cernekee <kpc.mtd@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 15:13:47 +01:00
Artem Bityutskiy
428ff9d2e3 UBIFS: remove dead code
UBIFS assumes that @c->min_io_size is 8 in case of NOR flash. This
is because UBIFS alignes all nodes to 8-byte boundary, and maintaining
@c->min_io_size introduced unnecessary complications.

This patch removes senseless constructs like:

if (c->min_io_size == 1)
	NOR-specific code

Also, few commentaries amendments.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-29 14:38:37 +03:00
Joakim Tjernlund
81e2962801 jffs2: Fix corruption when flash erase/write failure
Erase errors such as:
"Newly-erased block contained word 0xa4ef223e at offset 0x0296a014"
and failure to write the clean marker,
moves the offending erase block to erasing list before calling
jffs2_erase_failed(). This is bad as jffs2_erase_failed() will
also move the block to the bad_list, but is now moving the
wrong block, causing FS corruption.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 10:44:46 +01:00
Andrew Morton
086a377edc sysfs: file.c: use create_singlethread_workqueue()
We don't need a kernel thread per CPU for this application.

Acked-by: Alex Chiang <achiang@hp.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-28 14:24:07 -07:00
Christoph Hellwig
b96d31a62f cifs: clean up set_cifs_acl interfaces
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 18:41:56 +00:00
Christoph Hellwig
1bf4072da6 cifs: reorganize get_cifs_acl
Thus spake Christoph:

"But this whole set_cifs_acl function is a real mess anyway and needs
some splitting up."

With this change too, it's possible to call acl_to_uid_mode() with a
NULL inode pointer. That (or something close to it) will eventually be
necessary when cifs_get_inode_info is reorganized.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 17:08:02 +00:00
Steve French
c5077ec423 [CIFS] Update readme to indicate change to default mount (serverino)
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 15:09:04 +00:00
Jeff Layton
a0c9217f64 cifs: make serverino the default when mounting
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 15:04:17 +00:00
Jeff Layton
bd433d4cf4 cifs: rename cifs_iget to cifs_root_iget
The current cifs_iget isn't suitable for anything but the root inode.
Rename it with a more appropriate name.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:57:29 +00:00
Jeff Layton
c4a2c08db7 cifs: make cnvrtDosUnixTm take a little-endian args and an offset
The callers primarily end up converting the args from le anyway. Also,
most of the callers end up needing to add an offset to the result. The
exception to these rules is cnvrtDosCifsTm, but there are no callers of
that function, so we might as well remove it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:57:20 +00:00
Jeff Layton
07119a4df8 cifs: have cifs_NTtimeToUnix take a little-endian arg
...and just have the function call le64_to_cpu.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:32:31 +00:00
Mimi Zohar
14dba5331b integrity: nfsd imbalance bug fix
An nfsd exported file is opened/closed by the kernel causing the
integrity imbalance message.

Before a file is opened, there normally is permission checking, which
is done in inode_permission().  However, as integrity checking requires
a dentry and mount point, which is not available in inode_permission(),
the integrity (permission) checking must be called separately.

In order to detect any missing integrity checking calls, we keep track
of file open/closes.  ima_path_check() increments these counts and
does the integrity (permission) checking. As a result, the number of
calls to ima_path_check()/ima_file_free() should be balanced.  An extra
call to fput(), indicates the file could have been accessed without first
calling ima_path_check().

In nfsv3 permission checking is done once, followed by multiple reads,
which do an open/close for each read.  The integrity (permission) checking
call should be in nfsd_permission() after the inode_permission() call, but
as there is no correlation between the number of permission checking and
open calls, the integrity checking call should not increment the counters,
but defer it to when the file is actually opened.

This patch adds:
- integrity (permission) checking for nfsd exported files in nfsd_permission().
- a call to increment counts for files opened by nfsd.

This patch has been updated to return the nfs error types.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-28 09:32:43 +10:00
Wei Yongjun
a0d24b295a nfsd: fix hung up of nfs client while sync write data to nfs server
Commit 'Short write in nfsd becomes a full write to the client'
(31dec2538e) broken the sync write.
With the following commands to reproduce:

  $ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt
  $ cd /mnt
  $ echo aaaa > temp.txt

Then nfs client is hung up.

In SYNC mode the server alaways return the write count 0 to the
client. This is because the value of host_err in nfsd_vfs_write()
will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);',
and then we return host_err(which is now 0) as write count.

This patch fixed the problem.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 17:40:06 -04:00
Greg Banks
1dbd0d53f3 knfsd: remove unreported filehandle stats counters
The file nfsfh.c contains two static variables nfsd_nr_verified and
nfsd_nr_put.  These are counters which are incremented as a side
effect of the fh_verify() fh_compose() and fh_put() operations,
i.e. at least twice per NFS call for any non-trivial workload.
Needless to say this makes the cacheline that contains them (and any
other innocent victims) a very hot contention point indeed under high
call-rate workloads on multiprocessor NFS server.  It also turns out
that these counters are not used anywhere.  They're not reported to
userspace, they're not used in logic, they're not even exported from
the object file (let alone the module).  All they do is waste CPU time.

So this patch removes them.

Tests on a 16 CPU Altix A4700 with 2 10gige Myricom cards, configured
separately (no bonding).  Workload is 640 client threads doing directory
traverals with random small reads, from server RAM.

Before
======

Kernel profile:

  %   cumulative   self              self     total
 time   samples   samples    calls   1/call   1/call  name
  6.05   2716.00  2716.00    30406     0.09     1.02  svc_process
  4.44   4706.00  1990.00     1975     1.01     1.01  spin_unlock_irqrestore
  3.72   6376.00  1670.00     1666     1.00     1.00  svc_export_put
  3.41   7907.00  1531.00     1786     0.86     1.02  nfsd_ofcache_lookup
  3.25   9363.00  1456.00    10965     0.13     1.01  nfsd_dispatch
  3.10  10752.00  1389.00     1376     1.01     1.01  nfsd_cache_lookup
  2.57  11907.00  1155.00     4517     0.26     1.03  svc_tcp_recvfrom
  ...
  2.21  15352.00  1003.00     1081     0.93     1.00  nfsd_choose_ofc  <----
  ^^^^

Here the function nfsd_choose_ofc() reads a global variable
which by accident happened to be located in the same cacheline as
nfsd_nr_verified.

Call rate:

nullarbor:~ # pmdumptext nfs3.server.calls
...
Thu Dec 13 00:15:27     184780.663
Thu Dec 13 00:15:28     184885.881
Thu Dec 13 00:15:29     184449.215
Thu Dec 13 00:15:30     184971.058
Thu Dec 13 00:15:31     185036.052
Thu Dec 13 00:15:32     185250.475
Thu Dec 13 00:15:33     184481.319
Thu Dec 13 00:15:34     185225.737
Thu Dec 13 00:15:35     185408.018
Thu Dec 13 00:15:36     185335.764

After
=====

kernel profile:

  %   cumulative   self              self     total
 time   samples   samples    calls   1/call   1/call  name
  6.33   2813.00  2813.00    29979     0.09     1.01  svc_process
  4.66   4883.00  2070.00     2065     1.00     1.00  spin_unlock_irqrestore
  4.06   6687.00  1804.00     2182     0.83     1.00  nfsd_ofcache_lookup
  3.20   8110.00  1423.00    10932     0.13     1.00  nfsd_dispatch
  3.03   9456.00  1346.00     1343     1.00     1.00  nfsd_cache_lookup
  2.62  10622.00  1166.00     4645     0.25     1.01  svc_tcp_recvfrom
[...]
  0.10  42586.00    44.00       74     0.59     1.00  nfsd_choose_ofc  <--- HA!!
  ^^^^

Call rate:

nullarbor:~ # pmdumptext nfs3.server.calls
...
Thu Dec 13 01:45:28     194677.118
Thu Dec 13 01:45:29     193932.692
Thu Dec 13 01:45:30     194294.364
Thu Dec 13 01:45:31     194971.276
Thu Dec 13 01:45:32     194111.207
Thu Dec 13 01:45:33     194999.635
Thu Dec 13 01:45:34     195312.594
Thu Dec 13 01:45:35     195707.293
Thu Dec 13 01:45:36     194610.353
Thu Dec 13 01:45:37     195913.662
Thu Dec 13 01:45:38     194808.675

i.e. about a 5.3% improvement in call rate.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Reviewed-by: David Chinner <dgc@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 14:14:03 -04:00
Greg Banks
cf0a586cf4 knfsd: fix reply cache memory corruption
Fix a regression in the reply cache introduced when the code was
converted to use proper Linux lists.  When a new entry needs to be
inserted, the case where all the entries are currently being used
by threads is not correctly detected.  This can result in memory
corruption and a crash.  In the current code this is an extremely
unlikely corner case; it would require the machine to have 1024
nfsd threads and all of them to be busy at the same time.  However,
upcoming reply cache changes make this more likely; a crash due to
this problem was actually observed in field.

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 14:14:02 -04:00
Greg Banks
fca4217c5b knfsd: reply cache cleanups
Make REQHASH() an inline function.  Rename hash_list to cache_hash.
Fix an obsolete comment.

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 14:14:02 -04:00
David Howells
911e690e70 CacheFiles: Fixup renamed filenames in comments in internal.h
Fix up renamed filenames in comments in fs/cachefiles/internal.h.

Originally, the files were all called cf-xxx.c, but they got renamed to
just xxx.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-27 10:20:13 -07:00
David Howells
348ca1029e FS-Cache: Fixup renamed filenames in comments in internal.h
Fix up renamed filenames in comments in fs/fscache/internal.h.

Originally, the files were all called fsc-xxx.c, but they got renamed to
just xxx.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-27 10:20:13 -07:00
Eric Sandeen
096324873f xfs: fix overflow in xfs_growfs_data_private
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number.  If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:

 # xfs_io -f -c "truncate 19998630180864" fsfile
 # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
 # mount -o loop fsfile /mnt
 # xfs_growfs /mnt

meta-data=/dev/loop0             isize=256    agcount=52,
agsize=76288719 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3905982455, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-05-26 17:46:37 -05:00
Jeff Layton
f55ed1a83d cifs: tighten up default file_mode/dir_mode
The current default file mode is 02767 and dir mode is 0777. This is
extremely "loose". Given that CIFS is a single-user protocol, these
permissions allow anyone to use the mount -- in effect, giving anyone on
the machine access to the credentials used to mount the share.

Change this by making the default permissions restrict write access to
the default owner of the mount. Give read and execute permissions to
everyone else. These are the same permissions that VFAT mounts get by
default so there is some precedent here.

Note that this patch also removes the mandatory locking flags from the
default file_mode. After having looked at how these flags are used by
the kernel, I don't think that keeping them as the default offers any
real benefit. That flag combination makes it so that the kernel enforces
mandatory locking.

Since the server is going to do that for us anyway, I don't think we
want the client to enforce this by default on applications that just
want advisory locks. Anyone that does want this behavior can always
enable it by setting the file_mode appropriately.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-26 21:10:55 +00:00
Jeff Layton
46a7574caf cifs: fix artificial limit on reading symlinks
There's no reason to limit the size of a symlink that we can read to
4000 bytes. That may be nowhere near PATH_MAX if the server is sending
UCS2 strings. CIFS should be able to read in a symlink up to the size of
the buffer. The size of the header has already been accounted for when
creating the slabcache, so CIFSMaxBufSize should be the correct size to
pass in.

Fixes samba bug #6384.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-26 21:09:14 +00:00
Trond Myklebust
95baa25c73 NFSv4: Fix the case where NFSv4 renewal fails
If the asynchronous lease renewal fails (usually due to a soft timeout),
then we _must_ schedule state recovery in order to ensure that we don't
lose the lease unnecessarily or, if the lease is already lost, that we
recover the locking state promptly...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-26 14:51:00 -04:00
Sam Ravnborg
d0367a508a nfs: fix build error in nfsroot with initconst
fix build error with latest kbuild adjustments to initconst.

The commit a447c09324 ("vfs: Use
const for kernel parser table") changed:

    static match_table_t __initdata tokens = {
to
    static match_table_t __initconst tokens = {

But the missing const causes popwerpc to fail with latest
updates to __initconst like this:

fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict
fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict

The bug is only present with kbuild-next.
Following patch has been build tested.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-26 14:51:00 -04:00
Steven Whitehouse
f6eb53498e GFS2: Remove args subdir from gfs2 sysfs files
Since we can cat /proc/mounts there is no need to have this
subdirectory in the gfs2 sysfs files. In fact this does not
reflect the full range of possible mount argumenmts, where
as /proc/mounts does.

There was only one userland user of this set of sysfs files
and it will function perfectly well without these files
being present (in fact that subcommand of gfs2_tool is
obsolete anyway).

The tune/* subdirectory is also considered mostly obsolete,
but there are a few uses of this until mount arguments can
be added for the last few functions for which there are no
equivalents currently. However the tune/* directory is still
in my sights and new code should avoid using it. Only the gfs2_quota
and gfs2_tool programs are know to use tune/* at the moment.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-26 15:50:25 +01:00
Steven Whitehouse
e1b28aab58 GFS2: Remove lockstruct subdir from gfs2 sysfs files
The lockstruct sub directory contained two entries, both of
which are duplicated elsewhere in the gfs2 sysfs files as
well as being available via /proc/mounts. There is no userland program
using either of them, so this patch removes them.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-26 15:41:27 +01:00
Artem Bityutskiy
7c83f5cb55 UBIFS: use anonymous device
UBIFS has erroneuosly set 'sb->s_dev' to the UBI volume
character device major/minor. This may lead to clashes
if there is another FS mounted to a block device with
the same major/minor numbers. User-space programs which
use 'stat->st_dev' may get confused because of this.

This problem was found by Al Viro. He also pointed the
way to fix the problem - use 'set_anon_super()' and
'kill_anon_super()' VFS helpers.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-26 12:37:11 +03:00
Theodore Ts'o
88b6edd17c ext4: Clean up calls to ext4_get_group_desc()
If the caller isn't planning on modifying the block group descriptors,
there's no need to pass in a pointer to a struct buffer_head.  Nuking
this saves a tiny amount of CPU time and stack space usage.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:50:39 -04:00
Theodore Ts'o
759d427aa5 ext4: remove unused function __ext4_write_dirty_metadata
The __ext4_write_dirty_metadata() function was introduced by commit
0390131b, "ext4: Allow ext4 to run without a journal", but nothing
ever used the function, either then or since.  So let's remove it and
save a bit of space.

Cc: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:51:00 -04:00
Corentin Chary
8eec2f36fb UBIFS: return proper error code if the compr is not present
If the compressor is not present, mount_ubifs need
to return an error code. This way ubifs_fill_super
will stop and handle the error.

Signed-off-by: Corentin Chary <corentincj@iksaif.net>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-25 12:28:27 +03:00
Dave Kleikamp
79f52b77b8 jfs: Add missing mutex_unlock call to error path
Jan Kucera found an missing call to mutex_unlock() with his static code
checker.  It's an unlikely error path to hit in the real world, but it
should be fixed.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Jan Kucera <kucera.jan.cz@gmail.com>
2009-05-23 20:28:41 -05:00
Linus Torvalds
3eb9c8be0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Avoid open on possible directories since Samba now rejects them
2009-05-23 13:42:53 -07:00
Steve French
8db14ca125 [CIFS] Avoid open on possible directories since Samba now rejects them
Small change (mostly formatting) to limit lookup based open calls to
file create only.

After discussion yesteday on samba-technical about the posix lookup
regression,  and looking at a problem with cifs posix open to one
particular Samba version, Jeff and JRA realized that Samba server's
behavior changed in this area (posix open behavior on files vs.
directories).   To make this behavior consistent, JRA just made a
fix to Samba server to alter how it handles open of directories (now
returning the equivalent of EISDIR instead of success). Since we don't
know at lookup time whether the inode is a directory or file (and
thus whether posix open will succeed with most current Samba server),
this change avoids the posix open code on lookup open (just issues
posix open on creates).    This gets the semantic benefits we want
(atomicity, posix byte range locks, improved write semantics on newly
created files) and file create still is fast, and we avoid the problem
that Jeff noticed yesterday with "openat" (and some open directory
calls) of non-cached directories to one version of Samba server, and
will work with future Samba versions (which include the fix jra just
pushed into Samba server).  I confirmed this approach with jra
yesterday and with Shirish today.

Posix open is only called (at lookup time) for file create now.
For opens (rather than creates), because we do not know if it
is a file or directory yet, and current Samba no longer allows
us to do posix open on dirs, we could end up wasting an open call
on what turns out to be a dir. For file opens, we wait to call posix
open till cifs_open.  It could be added here (lookup) in the future
but the performance tradeoff of the extra network request when EISDIR
or EACCES is returned would have to be weighed against the 50%
reduction in network traffic in the other paths.

Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
CC: Jeremy Allison <jra@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-23 18:57:25 +00:00
Martin K. Petersen
c72758f337 block: Export I/O topology for block devices and partitions
To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment.  This patch adds support for exposing
I/O topology characteristics as devices are stacked.

  logical_block_size is the smallest unit the device can address.

  physical_block_size indicates the smallest I/O the device can write
  without incurring a read-modify-write penalty.

  The io_min parameter is the smallest preferred I/O size reported by
  the device.  In many cases this is the same as the physical block
  size.  However, the io_min parameter can be scaled up when stacking
  (RAID5 chunk size > physical block size).

  The io_opt characteristic indicates the optimal I/O size reported by
  the device.  This is usually the stripe width for arrays.

  The alignment_offset parameter indicates the number of bytes the start
  of the device/partition is offset from the device's natural alignment.
  Partition tools and MD/DM utilities can use this to pad their offsets
  so filesystems start on proper boundaries.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:55 +02:00
Martin K. Petersen
ae03bf639a block: Use accessor functions for queue limits
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Martin K. Petersen
e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Jens Axboe
9bd7de51ee Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/ide/ide-io.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:28:35 +02:00
Jens Axboe
e4b636366c Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/block/hd.c
	drivers/block/mg_disk.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:25:34 +02:00
Linus Torvalds
6a44587ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix memory leak in nilfs_ioctl_clean_segments
2009-05-22 08:41:13 -07:00
Ryusuke Konishi
d504685363 nilfs2: fix memory leak in nilfs_ioctl_clean_segments
This fixes a new memory leak problem in garbage collection.  The
problem was brought by the bugfix patch ("nilfs2: fix lock order
reversal in nilfs_clean_segments ioctl").

Thanks to Kentaro Suzuki for finding this problem.

Reported-by: Kentaro Suzuki <k_suzuki@ms.sylc.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-22 20:49:04 +09:00
Steven Whitehouse
87ec217411 GFS2: Move gfs2_unlink_ok into ops_inode.c
Another function which is only called from one ops_inode.c so
we move it and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:54:50 +01:00
Steven Whitehouse
536baf02f6 GFS2: Move gfs2_readlinki into ops_inode.c
Move gfs2_readlinki into ops_inode.c and make it static

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:48:59 +01:00
Steven Whitehouse
2286dbfad1 GFS2: Move gfs2_rmdiri into ops_inode.c
Move gfs2_rmdiri() into ops_inode.c and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:45:09 +01:00
Steven Whitehouse
9e6e0a128b GFS2: Merge mount.c and ops_super.c into super.c
mount.c only contained a single function, so is not really
worth retaining on its own. All of the super related code
is now either in super.c or ops_fstype.c

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:36:01 +01:00
Steven Whitehouse
b1e71b0622 GFS2: Clean up some file names
This patch renames the ops_*.c files which have no counterpart
without the ops_ prefix in order to shorten the name and make
it more readable. In addition, ops_address.h (which was very
small) is moved into inode.h and inode.h is cleaned up by
adding extern where required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:01:55 +01:00
James Morris
2c9e703c61 Merge branch 'master' into next
Conflicts:
	fs/exec.c

Removed IMA changes (the IMA checks are now performed via may_open()).

Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 18:40:59 +10:00
Mimi Zohar
c9d9ac525a integrity: move ima_counts_get
Based on discussion on lkml (Andrew Morton and Eric Paris),
move ima_counts_get down a layer into shmem/hugetlb__file_setup().
Resolves drm shmem_file_setup() usage case as well.

HD comment:
  I still think you're doing this at the wrong level, but recognize
  that you probably won't be persuaded until a few more users of
  alloc_file() emerge, all wanting your ima_counts_get().

  Resolving GEM's shmem_file_setup() is an improvement, so I'll say

Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 09:45:33 +10:00
Mimi Zohar
b9fc745db8 integrity: path_check update
- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 09:43:41 +10:00
Steve French
703a3b8e5c [CIFS] fix posix open regression
Posix open code was not properly adding the file to the
list of open files.  Fix  allocating cifsFileInfo
more than once, and adding twice to flist and tlist.
Also fix mode setting to be done in one place in these
paths.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
2009-05-21 22:38:08 +00:00
Steven Whitehouse
1ce97e564b GFS2: Be more aggressive in reclaiming unlinked inodes
This patch increases the frequency with which gfs2 looks
for unlinked, but still allocated inodes. Its the equivalent
operation to ext3's orphan list, but done with bitmaps in
the resource groups.

This also fixes a bug where a field in the rgrp was too small.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-21 15:18:19 +01:00
Steven Whitehouse
60a0b8f936 GFS2: Add a rgrp bitmap full flag
During block allocation, it is useful to know if sections of disk
are full on a finer grained basis than a single resource group.
This can make a performance difference when resource groups have
larger numbers of bitmap blocks, since we no longer have to search
them all block by block in each individual bitmap.

The full flag is set on a per-bitmap basis when it has been
searched and found to have no free space. It is then skipped in
subsequent searches until the flag is reset. The resetting
occurs if we have to drop the glock on the resource group for any
reason, or if we deallocate some blocks within that resource
group and thus free up some space.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-21 12:23:12 +01:00
Linus Torvalds
929a8651f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)
2009-05-20 08:36:53 -07:00
Steven Whitehouse
0901097834 GFS2: Improve resource group error handling
This patch improves the error handling in the case where we
discover that the summary information in the resource group
doesn't match the bitmap information while in the process of
allocating blocks. Originally this resulted in a kernel bug,
but this patch changes that so that we return -EIO and print
some messages explaining what went wrong, and how to fix it.

We also remember locally not to try and allocate from the
same rgrp again, so that a subsequent allocation in a
different rgrp should succeed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-20 10:48:47 +01:00
Jeff Layton
8b6427a2a8 cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)
This is the third respin of the patch posted yesterday to fix the error
handling in cifs_follow_symlink. It also includes a fix for a bogus NULL
pointer check in CIFSSMBQueryUnixSymLink that Jeff Moyer spotted.

It's possible for CIFSSMBQueryUnixSymLink to return without setting
target_path to a valid pointer. If that happens then the current value
to which we're initializing this pointer could cause an oops when it's
kfree'd.

This patch is a little more comprehensive than the last patches. It
reorganizes cifs_follow_link a bit for (hopefully) better readability.
It should also eliminate the uneeded allocation of full_path on servers
without unix extensions (assuming they can get to this point anyway, of
which I'm not convinced).

On a side note, I'm not sure I agree with the logic of enabling this
query even when unix extensions are disabled on the client. It seems
like that should disable this as well. But, changing that is outside the
scope of this fix, so I've left it alone for now.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@inraded.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-19 15:31:20 +00:00
Steven Whitehouse
ef9e8b14a5 GFS2: Don't warn when delete inode fails on ro filesystem
If the filesystem is read-only, then we expect that delete inode
will fail, so there is no need to warn about it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-19 14:25:16 +01:00
Miklos Szeredi
b2858d7d16 splice: fix kmaps in default_file_splice_write()
Unfortunately multiple kmap() within a single thread are deadlockable,
so writing out multiple buffers with writev() isn't possible.

Change the implementation so that it does a separate write() for each
buffer.  This actually simplifies the code a lot since the
splice_from_pipe() helper can be used.

This limitation is caused by HIGHMEM pages, and so only affects a
subset of architectures and configurations.  In the future it may be
worth to implement default_file_splice_write() in a more efficient way
on configs that allow it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-19 11:37:46 +02:00
Tejun Heo
4fc981ef9e bio: always copy back data for copied kernel requests
When a read bio_copy_kern() request fails, the content of the bounce
buffer is not copied back.  However, as request failure doesn't
necessarily mean complete failure, the buffer state can be useful.
This behavior is also inconsistent with the user map counterpart and
causes the subtle difference between bounced and unbounced IO causes
confusion.

This patch makes bio_copy_kern_endio() ignore @err and always copy
back data on request completion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-19 11:36:08 +02:00
Steven Whitehouse
fe64d517df GFS2: Umount recovery race fix
This patch fixes a race condition where we can receive recovery
requests part way through processing a umount. This was causing
problems since the recovery thread had already gone away.

Looking in more detail at the recovery code, it was really trying
to implement a slight variation on a work queue, and that happens to
align nicely with the recently introduced slow-work subsystem. As a
result I've updated the code to use slow-work, rather than its own home
grown variety of work queue.

When using the wait_on_bit() function, I noticed that the wait function
that was supplied as an argument was appearing in the WCHAN field, so
I've updated the function names in order to produce more meaningful
output.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-19 10:01:18 +01:00
Hunter Adrian
8b3884a841 UBIFS: return error if link and unlink race
Consider a scenario when 'vfs_link(dirA/fileA)' and
'vfs_unlink(dirA/fileA, dirB/fileB)' race. 'vfs_link()' does not
lock 'dirA->i_mutex', so this is possible. Both of the functions
lock 'fileA->i_mutex' though. Suppose 'vfs_unlink()' wins, and takes
'fileA->i_mutex' mutex first. Suppose 'fileA->i_nlink' is 1. In this
case 'ubifs_unlink()' will drop the last reference, and put 'inodeA'
to the list of orphans. After this, 'vfs_link()' will link
'dirB/fileB' to 'inodeA'. Thir is a problem because, for example,
the subsequent 'vfs_unlink(dirB/fileB)' will add the same inode
to the list of orphans.

This problem was reported by J. R. Okajima <hooanon05@yahoo.co.jp>

[Artem: add more comments, amended commit message]

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-19 11:01:31 +03:00
Frank Filz
7ee2cb7f32 nfs: Fix NFS v4 client handling of MAY_EXEC in nfs_permission.
The problem is that permission checking is skipped if atomic open is
possible, but when exec opens a file, it just opens it O_READONLY which
means EXEC permission will not be checked at that time.

This problem is observed by the following sequence (executed as root):

  mount -t nfs4 server:/ /mnt4
  echo "ls" >/mnt4/foo
  chmod 744 /mnt4/foo
  su guest -c "mnt4/foo"

Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Tested-by: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-18 20:11:12 -07:00
Ingo Molnar
dc3f81b129 Merge commit 'v2.6.30-rc6' into perfcounters/core
Merge reason: this branch was on an -rc4 base, merge it up to -rc6
              to get the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 07:37:49 +02:00
Manish Katiyar
0f7ee7c172 ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:51 -04:00
Manish Katiyar
de5ce03730 ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:47 -04:00
Manish Katiyar
f68301656b ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:44 -04:00
Theodore Ts'o
0568c51893 ext4: down i_data_sem only for read when walking tree for fiemap
Not sure why I put this in as down_write originally; all we are
doing is walking the tree, nothing will change under us and
concurrent reads should be no problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:31:23 -04:00
Theodore Ts'o
6fd058f779 ext4: Add a comprehensive block validity check to ext4_get_blocks()
To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree.  This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 15:38:01 -04:00
Jeff Mahoney
b83674c0da reiserfs: fixup perms when xattrs are disabled
This adds CONFIG_REISERFS_FS_XATTR protection from reiserfs_permission.

This is needed to avoid warnings during file deletions and chowns with
xattrs disabled.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
Jeff Mahoney
ceb5edc457 reiserfs: deal with NULL xattr root w/ xattrs disabled
This avoids an Oops in open_xa_root that can occur when deleting a file
with xattrs disabled.  It assumes that the xattr root will be there, and
that is not guaranteed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
Jeff Mahoney
12abb35a03 reiserfs: clean up ifdefs
With xattr cleanup even with xattrs disabled, much of the initial setup
is still performed.  Some #ifdefs are just not needed since the options
they protect wouldn't be available anyway.

This cleans those up.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
David Teigland
748285ccf7 dlm: use more NOFS allocation
Change some GFP_KERNEL allocations to use either GFP_NOFS or
ls_allocation (when available) which the fs sets to GFP_NOFS.
The point is to prevent allocations from going back into the
cluster fs in places where that might lead to deadlock.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-15 11:24:59 -05:00
Linus Torvalds
5d41343ac8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix race in ext4_inode_info.i_cached_extent
  ext4: Clear the unwritten buffer_head flag after the extent is initialized
  ext4: Use a fake block number for delayed new buffer_head
  ext4: Fix sub-block zeroing for writes into preallocated extents
2009-05-15 08:07:25 -07:00
Sukadev Bhattiprolu
1f71ebedb3 devpts: correctly set default options
devpts_get_sb() calls memset(0) to clear mount options and calls
parse_mount_options() if user specified any mount options.

The memset(0) is bogus since the 'mode' and 'ptmxmode' options are
non-zero by default.  parse_mount_options() restores options to default
anyway and can properly deal with NULL mount options.

So in devpts_get_sb() remove memset(0) and call parse_mount_options() even
for NULL mount options.

Bug reported by Eric Paris: http://lkml.org/lkml/2009/5/7/448.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Reported-by: Eric Paris <eparis@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Reviewed-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-15 08:03:23 -07:00
Christine Caulfield
391fbdc5d5 dlm: connect to nodes earlier
Make network connections to other nodes earlier, in the context of
dlm_recoverd.  This avoids connecting to nodes from dlm_send where we
try to avoid allocations which could possibly deadlock if memory reclaim
goes into the cluster fs which may try to do a dlm operation.

Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-15 09:34:12 -05:00
Thomas Gleixner
2d02494f5a sched, timers: cleanup avenrun users
avenrun is an rough estimate so we don't have to worry about
consistency of the three avenrun values. Remove the xtime lock
dependency and provide a function to scale the values. Cleanup the
users.

[ Impact: cleanup ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
Theodore Ts'o
2ec0ae3ace ext4: Fix race in ext4_inode_info.i_cached_extent
If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time.  This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development.  Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult.  Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-05-15 09:07:28 -04:00
Linus Torvalds
bd67ce0f66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix error handling in parse_DFS_referrals
2009-05-14 19:20:04 -07:00
Linus Torvalds
5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Aneesh Kumar K.V
2a8964d63d ext4: Clear the unwritten buffer_head flag after the extent is initialized
The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area.  That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.

When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate.  Hence, we need to make sure the
BH_Unwritten is not left set, since the combination of BH_Mapped and
BH_Unwritten is not allowed; among other things, it will result ext4's
get_block() to be called over and over again during the write_begin
phase of write(2).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 17:05:39 -04:00
Sankar P
9f55684c2d Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Signed-off-by: Sankar P <sankar.curiosity@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Sage Weil
6b65c5c61b Btrfs: make show_options result match actual option names
The notreelog and flushoncommit mount options were being printed slightly
differently.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Li Hong
5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
cc7b0c9b70 Btrfs: remove some WARN_ONs in the IO failure path
These debugging WARN_ONs make too much console noise during regular
IO failures.  An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
76a05b35a3 Btrfs: Don't loop forever on metadata IO failures
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:32 -04:00
Chris Mason
2757495c90 Btrfs: init inode ordered_data_close flag properly
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:31 -04:00
Theodore Ts'o
2ac3b6e00a ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
The ext4_get_blocks() function was depending on the value of
bh_result->b_state as an input parameter to decide whether or not
update the delalloc accounting statistics by calling
ext4_da_update_reserve_space().  We now use a separate flag,
EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE, to requests this update, so that
all callers of ext4_get_blocks() can clear map_bh.b_state before
calling ext4_get_blocks() without worrying about any consistency
issues.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 13:57:08 -04:00
Jeff Layton
d8e2f53ac9 cifs: fix error handling in parse_DFS_referrals
cifs_strndup_from_ucs returns NULL on error, not an ERR_PTR

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-14 13:55:32 +00:00
Theodore Ts'o
2fa3cdfb31 ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
The static function ext4_da_get_block_write() was only used by
mpage_da_map_blocks().  So to simplify the code, merge that function
into mpage_da_map_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 09:29:45 -04:00
Andrew Morton
77f6bf57ba splice: fix error return code
fs/splice.c: In function 'default_file_splice_read':
fs/splice.c:566: warning: 'error' may be used uninitialized in this function

which is sort-of true.  The code will in fact return -ENOMEM instead of the
kernel_readv() return value.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-14 09:49:44 +02:00
Linus Torvalds
210af919c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: cody tidying, remove commented out line in Makefile
  Squashfs: check page size is not larger than the filesystem block size
  Squashfs: fix breakage when page size > metadata block size
2009-05-13 16:33:25 -07:00
Linus Torvalds
a6aeeebf51 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: destroy bdi on error
2009-05-13 16:32:57 -07:00
Linus Torvalds
014049a1c7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: check size of array structured data exchanged via ioctls
  nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
  nilfs2: fix possible circular locking for get information ioctls
  nilfs2: ensure to clear dirty state when deleting metadata file block
  nilfs2: fix circular locking dependency of writer mutex
  nilfs2: fix possible recovery failure due to block creation without writer
2009-05-13 16:32:16 -07:00
Randy Dunlap
dd4dc82d4c lockd: fix FILE_LOCKING=n build error
lockd/svclock.c is missing a header file <linux/fs.h>.

<linux/fs.h> is missing a definition of locks_release_private()
for the config case of FILE_LOCKING=n, causing a build error:

fs/lockd/svclock.c:330: error: implicit declaration of function 'locks_release_private'

lockd without FILE_LOCKING doesn't make sense, so make LOCKD and LOCKD_V4
depend on FILE_LOCKING, and make NFS depend on FILE_LOCKING.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-13 15:59:10 -04:00
Mel Gorman
f2deae9d4e Remove implementation of readpage from the hugetlbfs_aops
The core VM assumes the page size used by the address_space in
inode->i_mapping is PAGE_SIZE but hugetlbfs breaks this assumption by
inserting pages into the page cache at offsets the core VM considers
unexpected.

This would not be a problem except that hugetlbfs also provide a
->readpage implementation.  As it exists, the core VM can assume the
base page size is being used, allocate pages on behalf of the
filesystem, insert them into the page cache and call ->readpage to
populate them.  These pages are the wrong size and at the wrong offset
for hugetlbfs causing confusion.

This patch deletes the ->readpage implementation for hugetlbfs on the
grounds the core VM should not be allocating and populating pages on
behalf of hugetlbfs.  There should be no existing users of the
->readpage implementation so it should not cause a regression.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-13 08:04:45 -07:00
Steven Whitehouse
9582d41135 GFS2: Remove a couple of unused sysfs entries
These two tunables are pointless and would never need to be
changed anyway. There is also a race between them and umount
as the deamons which they refer to might have gone away. The
easiest way to fix the race is to remove the interface.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 15:06:25 +01:00
Steven Whitehouse
48c2b61361 GFS2: Add commit= mount option
It has always been possible to adjust the gfs2 log commit
interval, but only from the sysfs interface. This adds a
mount option, commit=<nn>, which will be familar to ext3
users.

The sysfs interface continues to be available as well, although
this might be removed in the future.

Also this patch cleans up some duplicated structures in the GFS2
sysfs code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 14:49:48 +01:00
Steven Whitehouse
a1c0643ff9 GFS2: Move journal live test at transaction start
There seems little point grabbing the transaction glock
only to have to release it again if the journal isn't
live. This moves the test earlier to avoid grabbing the lock
when we don't need it in the first place.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 10:56:52 +01:00
Jens Axboe
4f23122858 splice: fix repeated kmap()'s in default_file_splice_read()
We cannot reliably map more than one page at the time, or we risk
deadlocking. Just allocate the pages from low mem instead.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-13 08:35:35 +02:00
Phillip Lougher
e5d287539d Squashfs: cody tidying, remove commented out line in Makefile
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 03:25:20 +01:00
Phillip Lougher
fffb47b80e Squashfs: check page size is not larger than the filesystem block size
Normally the block size (by default 128K) will be larger than the
page size, unless a non-standard block size has been specified in
Mksquashfs, and the page size is larger than 4K.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 02:59:26 +01:00
Doug Chapman
a37b06d589 Squashfs: fix breakage when page size > metadata block size
Squashfs is broken on any system where the page size is larger than
the metadata size (8192).  This is easily fixed by ensuring cache->pages
is always > 0.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 02:56:39 +01:00
Linus Torvalds
2ea3f86848 Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
  nfsd: silence lockdep warning
  lockd: fix list corruption on lockd restart
  nfsd4: check for negative dentry before use in nfsv4 readdir
  nfsd41: slots are freed with session
  svcrdma: clean up error paths.
  svcrdma: Fix dma map direction for rdma read targets
2009-05-12 17:11:56 -07:00
Davide Libenzi
bfe3891a5f epoll: fix size check in epoll_create()
Fix a size check WRT the manual pages.  This was inadvertently broken by
commit 9fe5ad9c8c ("flag parameters
add-on: remove epoll_create size param").

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <Hiroyuki.Mach@gmail.com>
Cc: rohit verma <rohit.170309@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-12 14:11:35 -07:00
Aneesh Kumar K.V
33b9817e2a ext4: Use a fake block number for delayed new buffer_head
Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 14:40:37 -04:00
Aneesh Kumar K.V
9c1ee184a3 ext4: Fix sub-block zeroing for writes into preallocated extents
We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.

Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 18:36:58 -04:00
Theodore Ts'o
a2dc52b5d1 ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
Enforce that noalloc_get_block_write() is only called to map one block
at a time, and that it always is successful in finding a mapping for
given an inode's logical block block number if it is called with
create == 1.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 13:51:29 -04:00
Theodore Ts'o
b920c75502 ext4: Add documentation to the ext4_*get_block* functions
This adds more documentation to various internal functions in
fs/ext4/inode.c, most notably ext4_ind_get_blocks(),
ext4_da_get_block_write(), ext4_da_get_block_prep(),
ext4_normal_get_block_write().

In addition, the static function ext4_normal_get_block_write() has
been renamed noalloc_get_block_write(), since it is used in many
places far beyond ext4_normal_writepage().

Plenty of warnings have been added to the noalloc_get_block_write()
function, since the way it is used is amazingly fragile.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:54:29 -04:00
Theodore Ts'o
c217705733 ext4: Define a new set of flags for ext4_get_blocks()
The functions ext4_get_blocks(), ext4_ext_get_blocks(), and
ext4_ind_get_blocks() used an ad-hoc set of integer variables used as
boolean flags passed in as arguments.  Use a single flags parameter
and a setandard set of bitfield flags instead.  This saves space on
the call stack, and it also makes the code a bit more understandable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:58:52 -04:00
Theodore Ts'o
12b7ac1768 ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks()
Another function rename for clarity's sake.  The _wrap prefix simply
confuses people, and didn't add much people trying to follow the code
paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:57:44 -04:00
Abhijith Das
7537d81aa7 GFS2: Fix timestamps on write
This patch copies the timestamps from the vfs inode into gfs2 and syncs
it to the disk inode during writes.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-12 16:14:05 +01:00
Theodore Ts'o
e4d996ca80 ext4: Rename ext4_get_blocks_handle() to be ext4_ind_get_blocks()
The static function ext4_get_blocks_handle() is badly named.  Of
*course* it takes a handle.  Since its counterpart for extent-based
file is ext4_ext_get_blocks(), rename it to be ext4_ind_get_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 00:25:28 -04:00
Theodore Ts'o
f888e652d7 ext4: Simplify function signature for ext4_da_get_block_write()
The function ext4_da_get_block_write() is called in exactly one write,
and the last argument, create, is always 1.  Remove it to simplify the
code slightly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 00:21:29 -04:00
Vincent Minet
bc8e67409c ext4: Fix spinlock assertions on UP systems
On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails
which triggers a BUG_ON() call.
This patch fixes it by using assert_spin_locked instead.

Signed-off-by: Vincent Minet <vincent@vincent-minet.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-15 08:33:18 -04:00
J. Bruce Fields
8daed1e549 nfsd: silence lockdep warning
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-11 17:23:14 -04:00
Jeff Mahoney
2b79bc4f7e dup2: Fix return value with oldfd == newfd and invalid fd
The return value of dup2 when oldfd == newfd and the fd isn't valid is
not getting properly sign extended.  We end up with 4294967287 instead
of -EBADF.

I've reproduced this on SLE11 (2.6.27.21), openSUSE Factory
(2.6.29-rc5), and Ubuntu 9.04 (2.6.28).

This patch uses a signed int for the error value so it is properly
extended.

Commit 6c5d0512a0 introduced this
regression.

Reported-by: Jiri Dluhos <jdluhos@novell.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-11 12:18:06 -07:00
Ryusuke Konishi
83aca8f480 nilfs2: check size of array structured data exchanged via ioctls
Although some ioctls of nilfs2 exchange data in the form of indirectly
referenced array, some of them lack size check on the array elements.

This inserts the missing checks and rejects requests if data of ioctl
does not have a valid format.

We usually don't have to check size of structures that we associated
with ioctl commands because the size is tested implicitly for
identifying ioctl command; the checks this patch adds are for the
cases where the implicit check is not applied.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-12 01:48:54 +09:00
Miklos Szeredi
0b0a47f5c4 splice: implement default splice_write method
If f_op->splice_write() is not implemented, fall back to a plain write.
Use vfs_writev() to write from the pipe buffers.

This will allow splice on all filesystems and file types.  This
includes "direct_io" files in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
6818173bd6 splice: implement default splice_read method
If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.

This will allow splice and functions using splice, such as the loop
device, to work on all filesystems.  This includes "direct_io" files
in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
7c77f0b3f9 splice: implement pipe to pipe splicing
Allow splice(2) to work when both the input and the output is a pipe.

Based on the impementation of the tee(2) syscall, but instead of
duplicating the buffer references move the buffers from the input pipe
to the output pipe.

Moving the whole buffer only succeeds if the full length of the buffer
is spliced.  Otherwise duplicate the buffer, just like tee(2), set the
length of the output buffer and advance the offset on the input
buffer.

Since splice is operating on two pipes, special care needs to be taken
with locking to prevent AN ABBA deadlock.  Again this is done
similarly to the tee(2) syscall, first preparing the input and output
pipes so there's data to consume and space for that data, and then
doing the move operation while holding both locks.

If other processes are doing I/O on the same pipes parallel to the
splice, then by the time both inodes are locked there might be no
buffers left to move, or no space to move them to.  In this case retry
the whole operation, including the preparation phase.  This could lead
to starvation, but I'm not sure if that's serious enough to worry
about.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:09 +02:00
Steven Whitehouse
48bf2b1711 GFS2: Something nonlinear this way comes!
For some reason GFS2 has been missing support for non-linear
mappings. This patch fixes that, and also avoids taking any
locks for mmap in the O_NOATIME case. In fact we don't actually need
to take the lock here at all - just doing file_accessed() would be
enough, but we have to take the lock eventually and this helps
it hit disk (and thus be seen by other nodes) faster.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:44 +01:00
Steven Whitehouse
4a0f9a321a GFS2: Optimise writepage for metadata
This adds a GFS2 specific writepage for metadata, rather than
continuing to use the VFS function. As a result we now tag all
our metadata I/O with the correct flag so that blktraces will
now be less confusing.

Also, the generic function was checking for a number of corner
cases which cannot happen on the metadata address spaces so that
this should be faster too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:43 +01:00
Steven Whitehouse
c969f58ca4 GFS2: Update the rw flags
After Jens recent updates:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a1f242524c3c1f5d40f1c9c343427e34d1aadd6e
et al. this is a patch to bring gfs2 uptodate with the core
code. Also I've managed to squash another call to ll_rw_block()
along the way.

There is still one part of the GFS2 I/O paths which are not correctly
annotated and that is due to the sharing of the writeback code between
the data and metadata address spaces. I would like to change that too,
but this patch is still worth doing on its own, I think.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:41 +01:00
Tejun Heo
c3a4d78c58 block: add rq->resid_len
rq->data_len served two purposes - the length of data buffer on issue
and the residual count on completion.  This duality creates some
headaches.

First of all, block layer and low level drivers can't really determine
what rq->data_len contains while a request is executing.  It could be
the total request length or it coulde be anything else one of the
lower layers is using to keep track of residual count.  This
complicates things because blk_rq_bytes() and thus
[__]blk_end_request_all() relies on rq->data_len for PC commands.
Drivers which want to report residual count should first cache the
total request length, update rq->data_len and then complete the
request with the cached data length.

Secondly, it makes requests default to reporting full residual count,
ie. reporting that no data transfer occurred.  The residual count is
an exception not the norm; however, the driver should clear
rq->data_len to zero to signify the normal cases while leaving it
alone means no data transfer occurred at all.  This reverse default
behavior complicates code unnecessarily and renders block PC on some
drivers (ide-tape/floppy) unuseable.

This patch adds rq->resid_len which is used only for residual count.

While at it, remove now unnecessasry blk_rq_bytes() caching in
ide_pc_intr() as rq->data_len is not changed anymore.

Boaz	: spotted missing conversion in osd
Sergei	: spotted too early conversion to blk_rq_bytes() in ide-tape

[ Impact: cleanup residual count handling, report 0 resid by default ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:53 +02:00
Ryusuke Konishi
4f6b828837 nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
This is a companion patch to ("nilfs2: fix possible circular locking
for get information ioctls").

This corrects lock order reversal between mm->mmap_sem and
nilfs->ns_segctor_sem in nilfs_clean_segments() which was detected by
lockdep check:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3-nilfs-00003-g360bdc1 #7
 -------------------------------------------------------
 mmap/5294 is trying to acquire lock:
  (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]

 but task is already holding lock:
  (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
        [<c01470a5>] __lock_acquire+0x1066/0x13b0
        [<c01474a9>] lock_acquire+0xba/0xdd
        [<c01836bc>] might_fault+0x68/0x88
        [<c023c61d>] copy_from_user+0x2a/0x111
        [<d0d120d0>] nilfs_ioctl_prepare_clean_segments+0x1d/0xf1 [nilfs2]
        [<d0d0e2aa>] nilfs_clean_segments+0x6d/0x1b9 [nilfs2]
        [<d0d11f68>] nilfs_ioctl+0x2ad/0x318 [nilfs2]
        [<c01a3be7>] vfs_ioctl+0x22/0x69
        [<c01a408e>] do_vfs_ioctl+0x460/0x499
        [<c01a4107>] sys_ioctl+0x40/0x5a
        [<c01031a4>] sysenter_do_call+0x12/0x38
        [<ffffffff>] 0xffffffff

 -> #0 (&nilfs->ns_segctor_sem){++++.+}:
        [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
        [<c01474a9>] lock_acquire+0xba/0xdd
        [<c0433f1d>] down_read+0x2a/0x3e
        [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
        [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
        [<c0183b0b>] __do_fault+0x165/0x376
        [<c01855cd>] handle_mm_fault+0x287/0x5d1
        [<c043712d>] do_page_fault+0x2fb/0x30a
        [<c0435462>] error_code+0x72/0x78
        [<ffffffff>] 0xffffffff

where nilfs_clean_segments() holds:

  nilfs->ns_segctor_sem -> copy_from_user()
                             --> page fault -> mm->mmap_sem

And, page fault path may hold:

  page fault -> mm->mmap_sem
         --> nilfs_page_mkwrite() -> nilfs->ns_segctor_sem

Even though nilfs_clean_segments() does not perform write access on
given user pages, it may cause deadlock because nilfs->ns_segctor_sem
is shared per device and mm->mmap_sem can be shared with other tasks.

To avoid this problem, this patch moves all calls of copy_from_user()
outside the nilfs->ns_segctor_sem lock in the ioctl.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-11 14:54:41 +09:00
Ryusuke Konishi
47eb6b9c8f nilfs2: fix possible circular locking for get information ioctls
This is one of two patches which are to correct possible circular
locking between mm->mmap_sem and nilfs->ns_segctor_sem.

The problem was detected by lockdep check as follows:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3-nilfs-00002-g3552613 #6
 -------------------------------------------------------
 mmap/5418 is trying to acquire lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]

 but task is already holding lock:
 (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
 [<c01470a5>] __lock_acquire+0x1066/0x13b0
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<c01836bc>] might_fault+0x68/0x88
 [<c023c730>] copy_to_user+0x2c/0xfc
 [<d0d11b4f>] nilfs_ioctl_wrap_copy+0x103/0x160 [nilfs2]
 [<d0d11fa9>] nilfs_ioctl+0x30a/0x3b0 [nilfs2]
 [<c01a3be7>] vfs_ioctl+0x22/0x69
 [<c01a408e>] do_vfs_ioctl+0x460/0x499
 [<c01a4107>] sys_ioctl+0x40/0x5a
 [<c01031a4>] sysenter_do_call+0x12/0x38
 [<ffffffff>] 0xffffffff

 -> #0 (&nilfs->ns_segctor_sem){++++.+}:
 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<c0433f1d>] down_read+0x2a/0x3e
 [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
 [<c0183b0b>] __do_fault+0x165/0x376
 [<c01855cd>] handle_mm_fault+0x287/0x5d1
 [<c043712d>] do_page_fault+0x2fb/0x30a
 [<c0435462>] error_code+0x72/0x78
 [<ffffffff>] 0xffffffff

 other info that might help us debug this:

 1 lock held by mmap/5418:
 #0:  (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 stack backtrace:
 Pid: 5418, comm: mmap Not tainted 2.6.30-rc3-nilfs-00002-g3552613 #6
 Call Trace:
 [<c0432145>] ? printk+0xf/0x12
 [<c0145c48>] print_circular_bug_tail+0xaa/0xb5
 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
 [<d0d10149>] ? nilfs_sufile_get_stat+0x1e/0x105 [nilfs2]
 [<c013b59a>] ? up_read+0x16/0x2c
 [<d0d10225>] ? nilfs_sufile_get_stat+0xfa/0x105 [nilfs2]
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<c0433f1d>] down_read+0x2a/0x3e
 [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
 [<c0183b0b>] __do_fault+0x165/0x376
 [<c01855cd>] handle_mm_fault+0x287/0x5d1
 [<c043700a>] ? do_page_fault+0x1d8/0x30a
 [<c013b54f>] ? down_read_trylock+0x39/0x43
 [<c043712d>] do_page_fault+0x2fb/0x30a
 [<c0436e32>] ? do_page_fault+0x0/0x30a
 [<c0435462>] error_code+0x72/0x78
 [<c0436e32>] ? do_page_fault+0x0/0x30a

This makes the lock granularity of nilfs->ns_segctor_sem finer than
that of the mmap semaphore for ioctl commands except
nilfs_clean_segments().

The successive patch ("nilfs2: fix lock order reversal in
nilfs_clean_segments ioctl") is required to fully resolve the problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-11 12:57:46 +09:00
David Howells
107db7c7dd CRED: Guard the setprocattr security hook against ptrace
Guard the setprocattr security hook against ptrace by taking the target task's
cred_guard_mutex around it.  The problem is that setprocattr() may otherwise
note the lack of a debugger, and then perform an action on that basis whilst
letting a debugger attach between the two points.  Holding cred_guard_mutex
across the test and the action prevents ptrace_attach() from doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-11 08:15:39 +10:00
David Howells
5e751e992f CRED: Rename cred_exec_mutex to reflect that it's a guard against ptrace
Rename cred_exec_mutex to reflect that it's a guard against foreign
intervention on a process's credential state, such as is made by ptrace().  The
attachment of a debugger to a process affects execve()'s calculation of the new
credential state - _and_ also setprocattr()'s calculation of that state.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-11 08:15:36 +10:00
Linus Torvalds
93b49d45eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (22 commits)
  Fix the race between capifs remount and node creation
  Fix races around the access to ->s_options
  switch ufs directories to ufs_sync_file()
  Switch open_exec() and sys_uselib() to do_open_filp()
  Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts
  Reduce path_lookup() abuses
  Make checkpatch.pl shut up on fs/inode.c
  NULL noise in fs/super.c:kill_bdev_super()
  romfs: cleanup romfs_fs.h
  ROMFS: romfs_dev_read() error ignored
  fs: dcache fix LRU ordering
  ocfs2: Use nd_set_link().
  Fix deadlock in ipathfs ->get_sb()
  Fix a leak in failure exit in 9p ->get_sb()
  Convert obvious places to deactivate_locked_super()
  New helper: deactivate_locked_super()
  reiserfs: remove privroot hiding in lookup
  reiserfs: dont associate security.* with xattr files
  reiserfs: fixup xattr_root caching
  Always lookup priv_root on reiserfs mount and keep it
  ...
2009-05-10 10:49:08 -07:00
Ryusuke Konishi
843382370e nilfs2: ensure to clear dirty state when deleting metadata file block
This would fix the following failure during GC:

 nilfs_cpfile_delete_checkpoints: cannot delete block
 NILFS: GC failed during preparation: cannot delete checkpoints: err=-2

The problem was caused by a break in state consistency between page
cache and btree; the above block was removed from the btree but the
page buffering the block was remaining in the page cache in dirty
state.

This resolves the inconsistency by ensuring to clear dirty state of
the page buffering the deleted block.

Reported-by: David Arendt <admin@prnet.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-10 17:04:42 +09:00
Al Viro
2a32cebd6c Fix races around the access to ->s_options
Put generic_show_options read access to s_options under rcu_read_lock,
split save_mount_options() into "we are setting it the first time"
(uses in foo_fill_super()) and "we are relacing and freeing the old one",
synchronize_rcu() before kfree() in the latter.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:51:34 -04:00
Al Viro
f9dbd05bc9 switch ufs directories to ufs_sync_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Al Viro
6e8341a11e Switch open_exec() and sys_uselib() to do_open_filp()
... and make path_lookup_open() static

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Al Viro
a44ddbb6d8 Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Al Viro
e24977d45f Reduce path_lookup() abuses
... use kern_path() where possible

[folded a fix from rdd]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Manish Katiyar
6b3304b531 Make checkpatch.pl shut up on fs/inode.c
Code Quality According To Mingo(tm) has been vastly improved,
no code has been damaged^Wchanged^Wdamaged.

[commit message rewritten -- AV]

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:41 -04:00
H Hartley Sweeten
ddbaaf3024 NULL noise in fs/super.c:kill_bdev_super()
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Subrata Modak <subrata@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:41 -04:00
Roel Kluin
774e33e70b ROMFS: romfs_dev_read() error ignored
romfs_dev_read() may return -EIO, but ret is unsigned, so the errorpath
isn't taken.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:41 -04:00
npiggin@suse.de
c490d79bb7 fs: dcache fix LRU ordering
Fix ordering of LRU when moving referenced dentries to the head of the list
(they should go to the head of the list in the same order as they were found
from the tail, rather than reverse order).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Joel Becker
a731d12d6d ocfs2: Use nd_set_link().
ocfs2 was hand-calling vfs_follow_link(), but there's no point to that.
Let's use page_follow_link_light() and nd_set_link().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Al Viro
c96f585737 Fix a leak in failure exit in 9p ->get_sb()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Al Viro
6f5bbff9a1 Convert obvious places to deactivate_locked_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Al Viro
74dbbdd7fd New helper: deactivate_locked_super()
Does equivalent of up_write(&s->s_umount); deactivate_super(s);
However, it does not does not unlock it until it's all over.
As the result, it's safe to use to dispose of new superblock on ->get_sb()
failure exits - nobody will see the sucker until it's all over.
Equivalent using up_write/deactivate_super is safe for that purpose
if superblock is either	safe to use or has NULL ->s_root when we unlock.
Normally filesystems take the required precautions, but
	a) we do have bugs in that area in some of them.
	b) up_write/deactivate_super sequence is extremely common,
so the helper makes sense anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Jeff Mahoney
677c9b2e39 reiserfs: remove privroot hiding in lookup
With Al Viro's patch to move privroot lookup to fs mount, there's no need
 to have special code to hide the privroot in reiserfs_lookup.

 I've also cleaned up the privroot hiding in reiserfs_readdir_dentry and
 removed the last user of reiserfs_xattrs().

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Jeff Mahoney
b82bb72ba7 reiserfs: dont associate security.* with xattr files
The security.* xattrs are ignored for xattr files, so don't create them.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Jeff Mahoney
ab17c4f021 reiserfs: fixup xattr_root caching
The xattr_root caching was broken from my previous patch set. It wouldn't
 cause corruption, but could cause decreased performance due to allocating
 a larger chunk of the journal (~ 27 blocks) than it would actually use.

 This patch loads the xattr root dentry at xattr initialization and creates
 it on-demand. Since we're using the cached dentry, there's no point
 in keeping lookup_or_create_dir around, so that's removed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Al Viro
edcc37a047 Always lookup priv_root on reiserfs mount and keep it
... even if it's a negative dentry.  That way we can set ->d_op on
root before anyone could race with us.  Simplify d_compare(), while
we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:38 -04:00
Jeff Mahoney
5a6059c358 reiserfs: Expand i_mutex to enclose lookup_one_len
2.6.30-rc3 introduced some sanity checks in the VFS code to avoid NFS
 bugs by ensuring that lookup_one_len is always called under i_mutex.

 This patch expands the i_mutex locking to enclose lookup_one_len. This was
 always required, but not not enforced in the reiserfs code since it
 does locking around the xattr interactions with the xattr_sem.

 This is obvious enough, and it survived an overnight 50 thread ACL test.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:38 -04:00
Alessio Igor Bogani
67e55205ec vfs: umount_begin BKL pushdown
Push BKL down into ->umount_begin()

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:38 -04:00
Steven Whitehouse
0c7a531a20 GFS2: Fix glock ref counting bug
Depending on the ordering of events as we go around the
glock shrinker loop, it is possible to drop the ref count
of a glock incorrectly. It doesn't happen very often. This
patch corrects the got_ref variable, fixing the problem.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-09 15:15:17 +01:00
Ryusuke Konishi
201913ed74 nilfs2: fix circular locking dependency of writer mutex
This fixes the following circular locking dependency problem:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3 #5
 -------------------------------------------------------
 segctord/3895 is trying to acquire lock:
  (&nilfs->ns_writer_mutex){+.+...}, at: [<d0d02172>]
   nilfs_mdt_get_block+0x89/0x20f [nilfs2]

 but task is already holding lock:
  (&bmap->b_sem){++++..}, at: [<d0d02d99>]
   nilfs_bmap_propagate+0x14/0x2e [nilfs2]

 which lock already depends on the new lock.

The bugfix is done by replacing call sites of nilfs_get_writer() which
are never called from read-only context with direct dereferencing of
pointer to a writable FS-instance.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-09 13:36:57 +09:00
Ryusuke Konishi
85c2a74fab nilfs2: fix possible recovery failure due to block creation without writer
Some function calls in nilfs_prepare_segment_for_recovery() may fail
because they can create blocks on meta data files without configuring
a writable FS-instance.  Concretely, nilfs_mdt_create_block() routine
of meta data files will fail in that case.

This fixes the problem by temporarily attaching a writable FS-instace
during the function is called.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-09 13:36:56 +09:00
Felix Blyakher
ec91d1335f xfs: fix double unlock in xfs_swap_extents()
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-05-08 00:29:44 -05:00
Linus Torvalds
d7a5926978 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (32 commits)
  [CIFS] Fix double list addition in cifs posix open code
  [CIFS] Allow raw ntlmssp code to be enabled with sec=ntlmssp
  [CIFS] Fix SMB uid in NTLMSSP authenticate request
  [CIFS] NTLMSSP reenabled after move from connect.c to sess.c
  [CIFS] Remove sparse warning
  [CIFS] remove checkpatch warning
  [CIFS] Fix final user of old string conversion code
  [CIFS] remove cifs_strfromUCS_le
  [CIFS] NTLMSSP support moving into new file, old dead code removed
  [CIFS] Fix endian conversion of vcnum field
  [CIFS] Remove trailing whitespace
  [CIFS] Remove sparse endian warnings
  [CIFS] Add remaining ntlmssp flags and standardize field names
  [CIFS] Fix build warning
  cifs: fix length handling in cifs_get_name_from_search_buf
  [CIFS] Remove unneeded QuerySymlink call and fix mapping for unmapped status
  [CIFS] rename cifs_strndup to cifs_strndup_from_ucs
  Added loop check when mounting DFS tree.
  Enable dfs submounts to handle remote referrals.
  [CIFS] Remove older session setup implementation
  ...
2009-05-07 21:13:24 -07:00
Steve French
90e4ee5d31 [CIFS] Fix double list addition in cifs posix open code
Remove adding open file entry twice to lists in the file
Do not fill file info twice in case of posix opens and creates

Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-08 03:04:30 +00:00
David Teigland
8511a2728a dlm: fix use count with multiple joins
When a lockspace was joined multiple times, the global dlm
use count was incremented when it should not have been.  This
caused the global dlm threads to not be stopped when all
lockspaces were eventually be removed.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-07 10:14:42 -05:00
Geert Uytterhoeven
08ce4c91e4 dlm: Make name input parameter of {,dlm_}new_lockspace() const
| fs/gfs2/lock_dlm.c:207: warning: passing argument 1 of 'dlm_new_lockspace' discards qualifiers from pointer target type

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-07 10:14:26 -05:00
Josef Bacik
df3935ffd6 fiemap: fix problem with setting FIEMAP_EXTENT_LAST
Fix a problem where the generic block based fiemap stuff would not
properly set FIEMAP_EXTENT_LAST on the last extent.  I've reworked things
to keep track if we go past the EOF, and mark the last extent properly.
The problem was reported by and tested by Eric Sandeen.

Tested-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-btrfs@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <Joel.Becker@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06 16:36:09 -07:00
Wu Fengguang
381a80e6df inotify: use GFP_NOFS in kernel_event() to work around a lockdep false-positive
There is what we believe to be a false positive reported by lockdep.

inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock

The plan is to fix this via lockdep annotation, but that is proving to be
quite involved.

The patch flips the allocation over to GFP_NFS to shut the warning up, for
the 2.6.30 release.

Hopefully we will fix this for real in 2.6.31.  I'll queue a patch in -mm
to switch it back to GFP_KERNEL so we don't forget.

  =================================
  [ INFO: inconsistent lock state ]
  2.6.30-rc2-next-20090417 #203
  ---------------------------------
  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
  kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
  {RECLAIM_FS-ON-W} state was registered at:
    [<ffffffff81079188>] mark_held_locks+0x68/0x90
    [<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100
    [<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0
    [<ffffffff81130652>] kernel_event+0xe2/0x190
    [<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230
    [<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110
    [<ffffffff8110444d>] vfs_create+0xcd/0x140
    [<ffffffff8110825d>] do_filp_open+0x88d/0xa20
    [<ffffffff810f6b68>] do_sys_open+0x98/0x140
    [<ffffffff810f6c50>] sys_open+0x20/0x30
    [<ffffffff8100c272>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff
  irq event stamp: 690455
  hardirqs last  enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80
  hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0
  softirqs last  enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220
  softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50

  other info that might help us debug this:
  2 locks held by kswapd0/380:
   #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180
   #1:  (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0

  stack backtrace:
  Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
  Call Trace:
   [<ffffffff810789ef>] print_usage_bug+0x19f/0x200
   [<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50
   [<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0
   [<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0
   [<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0
   [<ffffffff810f478c>] ? slob_free+0x10c/0x370
   [<ffffffff8107c0a1>] lock_acquire+0xe1/0x120
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff81562d43>] mutex_lock_nested+0x63/0x420
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff81012fe9>] ? sched_clock+0x9/0x10
   [<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0
   [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
   [<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0
   [<ffffffff8110cb23>] d_kill+0x33/0x60
   [<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350
   [<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0
   [<ffffffff810d0cc5>] shrink_slab+0x125/0x180
   [<ffffffff810d1540>] kswapd+0x560/0x7a0
   [<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0
   [<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40
   [<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10
   [<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0
   [<ffffffff8106555b>] kthread+0x5b/0xa0
   [<ffffffff8100d40a>] child_rip+0xa/0x20
   [<ffffffff8100cdd0>] ? restore_args+0x0/0x30
   [<ffffffff81065500>] ? kthread+0x0/0xa0
   [<ffffffff8100d400>] ? child_rip+0x0/0x20

[eparis@redhat.com: fix audit too]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06 16:36:09 -07:00
J. Bruce Fields
89996df4b5 lockd: fix list corruption on lockd restart
If lockd is signalled soon enough after restart then locks_start_grace()
will try to re-add an entry to a list and trigger a lock corruption
warning.

Thanks to Wang Chen for the problem report and diagnosis.

WARNING: at lib/list_debug.c:26 __list_add+0x27/0x5c()
...
list_add corruption. next->prev should be prev (ef8fe958), but was ef8ff128.  (next=ef8ff128).
...
Pid: 23062, comm: lockd Tainted: G        W  2.6.30-rc2 #3
Call Trace:
[<c042d5b5>] warn_slowpath+0x71/0xa0
[<c0422a96>] ? update_curr+0x11d/0x125
[<c044b12d>] ? trace_hardirqs_on_caller+0x18/0x150
[<c044b270>] ? trace_hardirqs_on+0xb/0xd
[<c051c61a>] ? _raw_spin_lock+0x53/0xfa
[<c051c89f>] __list_add+0x27/0x5c
[<ef8f6daa>] locks_start_grace+0x22/0x30 [lockd]
[<ef8f34da>] set_grace_period+0x39/0x53 [lockd]
[<c06b8921>] ? lock_kernel+0x1c/0x28
[<ef8f3558>] lockd+0x64/0x164 [lockd]
[<c044b12d>] ? trace_hardirqs_on_caller+0x18/0x150
[<c04227b0>] ? complete+0x34/0x3e
[<ef8f34f4>] ? lockd+0x0/0x164 [lockd]
[<ef8f34f4>] ? lockd+0x0/0x164 [lockd]
[<c043dd42>] kthread+0x45/0x6b
[<c043dcfd>] ? kthread+0x0/0x6b
[<c0403c23>] kernel_thread_helper+0x7/0x10

Reported-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: stable@kernel.org
2009-05-06 17:19:36 -04:00
Wang Chen
02cb2858db nfsd: nfs4_stat_init cleanup
Save some loop time.

Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-06 16:22:41 -04:00
J. Bruce Fields
b2c0cea6b1 nfsd4: check for negative dentry before use in nfsv4 readdir
After 2f9092e102 "Fix i_mutex vs.  readdir
handling in nfsd" (and 14f7dd63 "Copy XFS readdir hack into nfsd code"),
an entry may be removed between the first mutex_unlock and the second
mutex_lock. In this case, lookup_one_len() will return a negative
dentry.  Check for this case to avoid a NULL dereference.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Cc: stable@kernel.org
2009-05-06 16:16:36 -04:00
Adrian Hunter
6d6cb0d688 UBIFS: reset no_space flag after inode deletion
When UBIFS runs out of space it spends a lot of time trying to
find more space before returning ENOSPC.  As there is no point
repeating that unless something has changed, UBIFS has an
optimization to record that the file system is 100% full and not
try to find space.  That flag was not being reset when a pending
deletion was finally done.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Reviewed-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-06 10:37:56 +03:00
Steve French
ac68392460 [CIFS] Allow raw ntlmssp code to be enabled with sec=ntlmssp
On mount, "sec=ntlmssp" can now be specified to allow
"rawntlmssp" security to be enabled during
CIFS session establishment/authentication (ntlmssp used to
require specifying krb5 which was counterintuitive).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-06 04:16:04 +00:00
Steve French
844823cb82 [CIFS] Fix SMB uid in NTLMSSP authenticate request
We were not setting the SMB uid in NTLMSSP authenticate
request which could lead to INVALID_PARAMETER error
on 2nd session setup.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-06 00:48:30 +00:00
Coly Li
2b53bc7bff ocfs2: update comments in masklog.h
In the mainline ocfs2 code, the interface for masklog is in files under
/sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h
reference the old /proc interface.  They are out of date.

This patch modifies the comments in cluster/masklog.h, which also provides
a bash script example on how to change the log mask bits.

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-05-05 14:48:11 -07:00
Tao Ma
a46fa684fc ocfs2: Don't printk the error when listing too many xattrs.
Currently the kernel defines XATTR_LIST_MAX as 65536
in include/linux/limits.h.  This is the largest buffer that is used for
listing xattrs.

But with ocfs2 xattr tree, we actually have no limit for the number.  If
filesystem has more names than can fit in the buffer, the kernel
logs will be pollluted with something like this when listing:

(27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34
(27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34

So don't print "ERROR" message as this is not an ocfs2 error.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-05-05 14:43:24 -07:00
Jake Edge
f83ce3e6b0 proc: avoid information leaks to non-privileged processes
By using the same test as is used for /proc/pid/maps and /proc/pid/smaps,
only allow processes that can ptrace() a given process to see information
that might be used to bypass address space layout randomization (ASLR).
These include eip, esp, wchan, and start_stack in /proc/pid/stat as well
as the non-symbolic output from /proc/pid/wchan.

ASLR can be bypassed by sampling eip as shown by the proof-of-concept
code at http://code.google.com/p/fuzzyaslr/ As part of a presentation
(http://www.cr0.org/paper/to-jt-linux-alsr-leak.pdf) esp and wchan were
also noted as possibly usable information leaks as well.  The
start_stack address also leaks potentially useful information.

Cc: Stable Team <stable@kernel.org>
Signed-off-by: Jake Edge <jake@lwn.net>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-04 15:14:23 -07:00
Steve French
0b3cc85800 [CIFS] NTLMSSP reenabled after move from connect.c to sess.c
The NTLMSSP code was removed from fs/cifs/connect.c and merged
(75% smaller, cleaner) into fs/cifs/sess.c

As with the old code it requires that cifs be built with
CONFIG_CIFS_EXPERIMENTAL, the /proc/fs/cifs/Experimental flag
must be set to 2, and mount must turn on extended security
(e.g. with sec=krb5).

Although NTLMSSP encapsulated in SPNEGO is not enabled yet,
"raw" ntlmssp is common and useful in some cases since it
offers more complete security negotiation, and is the
default way of negotiating security for many Windows systems.
SPNEGO encapsulated NTLMSSP will be able to reuse the same
code.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-04 08:37:12 +00:00
Randy Dunlap
9064caae8f nfsd: use C99 struct initializers
Eliminate 56 sparse warnings like this one:

fs/nfsd/nfs4xdr.c:1331:15: warning: obsolete array initializer, use C99 syntax

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-03 15:09:12 -04:00
J. Bruce Fields
63e4863fab nfsd4: make recall callback an asynchronous rpc
As with the probe, this removes the need for another kthread.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-03 15:08:56 -04:00
Andy Adamson
ccecee1e5e nfsd41: slots are freed with session
The session and slots are allocated all in one piece.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-03 14:45:02 -04:00
Trond Myklebust
7fdf523067 NFS: Close page_mkwrite() races
Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
returns with the page lock held, and sets the VM_FAULT_LOCKED flag.

See http://bugzilla.kernel.org/show_bug.cgi?id=12913

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 19:42:39 -07:00
Aneesh Kumar K.V
955ce5f5be ext4: Convert ext4_lock_group to use sb_bgl_lock
We have sb_bgl_lock() and ext4_group_info.bb_state
bit spinlock to protech group information. The later is only
used within mballoc code. Consolidate them to use sb_bgl_lock().
This makes the mballoc.c code much simpler and also avoid
confusion with two locks protecting same info.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 20:35:09 -04:00
Linus Torvalds
b4348f32da Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix getbmap vs mmap deadlock
  xfs: a couple getbmap cleanups
  xfs: add more checks to superblock validation
  xfs_file_last_byte() needs to acquire ilock
2009-05-02 16:52:50 -07:00
Linus Torvalds
afc1e702e8 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs:
  configfs: Fix Trivial Warning in fs/configfs/symlink.c
2009-05-02 16:50:46 -07:00
Linus Torvalds
7e567b44e6 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Change repository in MAINTAINERS.
  ocfs2: Fix a missing credit when deleting from indexed directories.
  ocfs2/trivial: Remove unused variable in ocfs2_rename.
  ocfs2: Add missing iput() during error handling in ocfs2_dentry_attach_lock()
  ocfs2: Fix some printk() warnings.
  ocfs2: Fix 2 warning during ocfs2 make.
  ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir.
2009-05-02 16:30:47 -07:00
Theodore Ts'o
eefd7f03b8 ext4: fix the length returned by fiemap for an unallocated extent
If the file's blocks have not yet been allocated because of delayed
allocation, the length of the extent returned by fiemap is incorrect.
This commit fixes this bug.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 19:05:37 -04:00
Oleg Nesterov
0ae05fb254 ptrace: s/parent/real_parent/ in binfmt_elf_fdpic.c
->real_parent is the parent. ->parent may be the tracer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
KOSAKI Motohiro
00a62ce91e mm: fix Committed_AS underflow on large NR_CPUS environment
The Committed_AS field can underflow in certain situations:

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.

But NR_CPUS proportional isn't good calculation.  In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).

The current kernel has generic percpu-counter stuff.  using it is right
way.  it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.

Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>		[All kernel versions]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
Ivan Kokshaysky
74641f584d alpha: binfmt_aout fix
This fixes the problem introduced by commit 3bfacef412 (get rid of
special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
when binfmt_aout built as module.  That happens because aout binary
handler gets on the top of the binfmt list due to late registration, and
kernel attempts to execute the binary without preparatory work that must
be done by binfmt_loader.

Fixed by changing the registration order of the default binfmt handlers
using list_add_tail() and introducing insert_binfmt() function which
places new handler on the top of the binfmt list.  This might be generally
useful for installing arch-specific frontends for default handlers or just
for overriding them.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
Vitaly Mayatskikh
0816178638 pagemap: require aligned-length, non-null reads of /proc/pid/pagemap
The intention of commit aae8679b0e
("pagemap: fix bug in add_to_pagemap, require aligned-length reads of
/proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a
multiple of 8 bytes, but now it allows to read 0 bytes, which actually
puts some data to user's buffer.  According to POSIX, if count is zero,
read() should return zero and has no other results.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Thomas Tuttle <ttuttle@google.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Nick Piggin
b827e496c8 mm: close page_mkwrite races
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Ian Kent
a8985f3ac5 autofs4: fix incorrect return in autofs4_mount_busy()
Fix an obvious incorrect return status in autofs4_mount_busy().

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Steve French
24d35add2b [CIFS] Remove sparse warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 05:40:39 +00:00
Steve French
989c7e512f [CIFS] remove checkpatch warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 05:32:20 +00:00
Steve French
afe48c31ea [CIFS] Fix final user of old string conversion code
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 05:25:46 +00:00
Jeff Layton
3410602732 [CIFS] remove cifs_strfromUCS_le
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 04:59:34 +00:00
Steve French
2edd6c5b05 [CIFS] NTLMSSP support moving into new file, old dead code removed
Remove dead NTLMSSP support from connect.c prior to addition of
the new code to replace it.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 04:55:39 +00:00
Eric Sandeen
c9877b205f ext4: fix for fiemap last-block test
Carl Henrik Lunde reported and debugged this; the test for the
last allocated block was comparing bytes to blocks in this test:

	if (logical + length - 1 == EXT_MAX_BLOCK ||
	    ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK)
		flags |= FIEMAP_EXTENT_LAST;

so any extent which ended right at 4G was stopping the extent
walk.  Just replacing these values with the extent block &
length should fix it.

Also give blksize_bits a saner type, and reverse the order 
of the tests to make the more likely case tested first.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Tested-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 23:32:06 -04:00
Aneesh Kumar K.V
19ba0559f9 vfs: Enable FS_IOC_FIEMAP and FIGETBSZ for all filetypes
The fiemap and get_blk_size ioctls should be enabled even for
directories.  So move it outisde file_ioctl.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 18:12:05 -04:00
Aneesh Kumar K.V
abc8746eb9 ext4: hook fiemap operation for directories
Add fiemap callback for directories

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 22:54:32 -04:00
Curt Wohlgemuth
f40339031b ext4: Make the length of the mb_history file tunable
In memory-constrained systems with many partitions, the ~68K for each
partition for the mb_history buffer can be excessive.

This patch adds a new mount option, mb_history_length, as well as a
way of setting the default via a module parameter (or via a sysfs
parameter in /sys/module/ext4/parameter/default_mb_history_length).
If the mb_history_length is set to zero, the mb_history facility is
disabled entirely.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 20:27:20 -04:00
J. Bruce Fields
3aea09dc91 nfsd4: track recall retries in nfs4_delegation
Move this out of a local variable into the nfs4_delegation object in
preparation for making this an async rpc call (at which point we'll need
any state like this in a common object that's preserved across function
calls).

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-01 20:11:12 -04:00
J. Bruce Fields
6707bd3d42 nfsd4: remove unused dl_trunc
There's no point in keeping this field around--it's always zero.

(Background: the protocol allows you to tell the client that the file is
about to be truncated, as an optimization to save the client from
writing back dirty pages that will just be discarded.  We don't
implement this hint.  If we do some day, adding this field back in will
be the least of the work involved.)

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-01 19:57:46 -04:00
J. Bruce Fields
b53d40c507 nfsd4: eliminate struct nfs4_cb_recall
The nfs4_cb_recall struct is used only in nfs4_delegation, so its
pointer to the containing delegation is unnecessary--we could just use
container_of().

But there's no real reason to have this a separate struct at all--just
move these fields to nfs4_delegation.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-01 19:50:00 -04:00
Theodore Ts'o
bb23c20a85 ext4: Move fs/ext4/group.h into ext4.h
Move the function prototypes in group.h into ext4.h so they are all
defined in one place.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 19:44:44 -04:00
J. Bruce Fields
c237dc0303 nfsd4: rename callback struct to cb_conn
I want to use the name for a struct that actually does represent a
single callback.

(Actually, I've never been sure it helps to a separate struct for the
callback information.  Some day maybe those fields could just be dumped
into struct nfs4_client.  I don't know.)

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-01 17:31:44 -04:00
Theodore Ts'o
596397b77c ext4: Move fs/ext4/namei.h into ext4.h
The fs/ext4/namei.h header file had only a single function
declaration, and should have never been a standalone file.  Move it
into ext4.h, where should have been from the beginning.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 13:49:15 -04:00
Theodore Ts'o
ca0faba0e8 ext4: Move the ext4_sb.h header file into ext4.h
There is no longer a reason for a separate ext4_sb.h header file, so
move it into ext4.h just to make life easier for developers to find
the relevant data structures and typedefs.  Should also speed up
compiles slightly, too.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-03 16:33:44 -04:00
Theodore Ts'o
d444c3c381 ext4: Move the ext4_i.h header file into ext4.h
There is no longer a reason for a separate ext4_i.h header file, so
move it into ext4.h just to make life easier for developers to find
the relevant data structures and typedefs.  Should also speed up
compiles slightly, too.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 13:44:33 -04:00
Theodore Ts'o
75507efb13 ext4: Don't avoid using BLOCK_UNINIT block groups in mballoc
By avoiding the use of not-yet-used block groups (i.e., block groups
with the BLOCK_UNINIT flag), mballoc had a tendency to create large
files with large non-contiguous gaps.  In addition avoiding the use of
new block groups had a tendency to push regular file data into the
first block group in a flex_bg group, which slows down the speed of
e2fsck pass 2, since it has a tendency to seek much more.  For
example:

               Before Patch                       After Patch
              Time in seconds                   Time in seconds
            Real /  User/  Sys   MB/s      Real /  User/  Sys    MB/s
Pass 1      8.52 / 2.21 / 0.46  20.43      8.84 / 4.97 / 1.11   19.68
Pass 2     21.16 / 1.02 / 1.86  11.30      6.54 / 1.77 / 1.78   36.39
Pass 3      0.01 / 0.00 / 0.00 139.00      0.01 / 0.01 / 0.00  128.90
Pass 4      0.16 / 0.15 / 0.00   0.00      0.17 / 0.17 / 0.00    0.00
Pass 5      2.52 / 1.99 / 0.09   0.79      2.31 / 1.78 / 0.06    0.86
Total      32.40 / 5.11 / 2.49  12.81     17.99 / 8.75 / 2.98   23.01

This was on a sample 80 gig root filesystem which was approximately
50% full.  Note the improved e2fsck pass 2 performance, by over a
factor of 3, due to a decreased number of seeks.  (The total amount of
I/O in pass 2 was unchanged; the layout of the directory blocks was
simply much better from e2fsck's's perspective.)

Other changes as a result of this patch on this sample filesystem:

                             Before Patch    After Patch
# of non-contig files           762             779
# of non-contig directories     571             570
# of BLOCK_UNINIT bg's          307             293
# of INODE_UNINIT bg's          503             503

Out of 640 block groups, of which 333 were in use, this patch caused
an extra 14 block groups to be utilized.  The number of non-contiguous
files did go up slightly, but when measured against the 99.9% of the
files (603,154) which were contiguously allocated, this is pretty
insignificant.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andreas Dilger <adilger@sun.com>
2009-05-01 12:58:36 -04:00
Steve French
051a2a0d32 [CIFS] Fix endian conversion of vcnum field
When multiply mounting from the same client to the same server, with
different userids, we create a vcnum which should be unique if
possible (this is not the same as the smb uid, which is the handle
to the security context).  We were not endian converting additional
(beyond the first which is zero) vcnum properly.

CC: Stable <stable@kernel.org>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 16:25:15 +00:00
Steve French
e836f015b5 [CIFS] Remove trailing whitespace
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 16:20:35 +00:00
Steve French
0e0d2cf327 [CIFS] Remove sparse endian warnings
Removes two sparse CHECK_ENDIAN warnings from Jeffs earlier patch,
and removes the dead readlink code (after noting where in
findfirst we will need to add something like that in the future
to handle the newly discovered unexpected error on FindFirst of NTFS symlinks.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 05:27:32 +00:00
Steve French
e14b2fe1e6 [CIFS] Add remaining ntlmssp flags and standardize field names
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 04:37:43 +00:00
Steve French
cf398e3a11 [CIFS] Fix build warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 03:50:42 +00:00
Jeff Layton
18295796a3 cifs: fix length handling in cifs_get_name_from_search_buf
The earlier patch to move this code to use the new unicode helpers
assumed that the filename strings would be null terminated. That's not
always the case.

Instead of passing "max_len" to the string converter, pass "min(len,
max_len)", which makes it do the right thing while still keeping the
parser confined to the response. Also fix up the prototypes of this
function and the callers so that max_len is unsigned (like len is).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 00:49:23 +00:00
Steve French
9e39b0ae8a [CIFS] Remove unneeded QuerySymlink call and fix mapping for unmapped status
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 21:31:15 +00:00
Joel Becker
dfa13f39b7 ocfs2: Fix a missing credit when deleting from indexed directories.
The ocfs2 directory index updates two blocks when we remove an entry -
the dx root and the dx leaf.  OCFS2_DELETE_INODE_CREDITS was only
accounting for the dx leaf.  This shows up when ocfs2_delete_inode()
runs out of credits in jbd2_journal_dirty_metadata() at
"J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".

The test that caught this was running dirop_file_racer from the
ocfs2-test suite with a 250-character filename PREFIX.  Run on a 512B
blocksize, it forces the orphan dir index to grow large enough to
trigger.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-30 13:21:56 -07:00
Thomas Gleixner
3c56999eec Merge branch 'core/signal' into perfcounters/core
This is necessary to avoid the conflict of syscall numbers.

Conflicts:
	arch/x86/ia32/ia32entry.S
	arch/x86/include/asm/unistd_32.h
	arch/x86/include/asm/unistd_64.h

Fixes up the borked syscall numbers of perfcounters versus
preadv/pwritev as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-30 21:16:49 +02:00
Louis Rilling
420118caa3 configfs: Rework configfs_depend_item() locking and make lockdep happy
configfs_depend_item() recursively locks all inodes mutex from configfs root to
the target item, which makes lockdep unhappy. The purpose of this recursive
locking is to ensure that the item tree can be safely parsed and that the target
item, if found, is not about to leave.

This patch reworks configfs_depend_item() locking using configfs_dirent_lock.
Since configfs_dirent_lock protects all changes to the configfs_dirent tree, and
protects tagging of items to be removed, this lock can be used instead of the
inodes mutex lock chain.
This needs that the check for dependents be done atomically with
CONFIGFS_USET_DROPPING tagging.

Now lockdep looks happy with configfs.

[ Lifted the setting of s_type into configfs_new_dirent() to satisfy the
  atomic setting of CONFIGFS_USET_CREATING  -- Joel ]

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-30 10:48:26 -07:00
Louis Rilling
e74cc06df3 configfs: Silence lockdep on mkdir() and rmdir()
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:

- the VFS allows one to lock first an inode and second one of its
  children (The lock subclasses for this pattern are respectively
  I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
  descending order as long as it stays under a single mountpoint and
  does not follow symlinks.

Unfortunately lockdep does not know (yet?) how to handle such
recursion.

I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.

>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.

I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
   lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
   i_mutexes recursively locked in different classes based on their
   depth from the top-level config_group created. This
   induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
   nesting of configfs default groups whenever lockdep is activated
   but this limit looks reasonably high. Unfortunately, this also
   isolates VFS operations on configfs default groups from the others
   and thus lowers the chances to detect locking issues.

Nobody likes solution 1), which I can understand.

This patch implements solution 2). However lockdep is still not happy with
configfs_depend_item(). Next patch reworks the locking of
configfs_depend_item() and finally makes lockdep happy.

[ Note: This hides a few locking interactions with the VFS from lockdep.
  That was my big concern, because we like lockdep's protection.  However,
  the current state always dumps a spurious warning.  The locking is
  correct, so I tell people to ignore the warning and that we'll keep
  our eyes on the locking to make sure it stays correct.  With this patch,
  we eliminate the warning.  We do lose some of the lockdep protections,
  but this only means that we still have to keep our eyes on the locking.
  We're going to do that anyway.  -- Joel ]

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-30 10:48:23 -07:00
Steve French
d185cda771 [CIFS] rename cifs_strndup to cifs_strndup_from_ucs
In most cases, cifs_strndup is converting from Unicode (UCS2 / UTF-32) to
the configured local code page for the Linux mount (usually UTF8), so
Jeff suggested that to make it more clear that cifs_strndup is doing
a conversion not just memory allocation and copy, rename the function
to including "from_ucs" (ie Unicode)

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 17:45:10 +00:00
Igor Mammedov
5c2503a8e3 Added loop check when mounting DFS tree.
Added loop check when mounting DFS tree. mount will fail with
ELOOP if referral walks exceed MAX_NESTED_LINK count.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 17:24:09 +00:00
Igor Mammedov
1af28ceb92 Enable dfs submounts to handle remote referrals.
Having remote dfs root support in cifs_mount, we can
afford to pass into it UNC that is remote.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 17:24:00 +00:00
Steve French
20418acd68 [CIFS] Remove older session setup implementation
Two years ago, when the session setup code in cifs was rewritten and moved
to fs/cifs/sess.c, we were asked to keep the old code for a release or so
(which could be reenabled at runtime) since it was such a large change and
because the asn (SPNEGO) and NTLMSSP code was not rewritten and needed to
be. This was useful to avoid regressions, but is long overdue to be removed.
Now that the Kerberos (asn/spnego) code is working in fs/cifs/sess.c,
and the NTLMSSP code moved (NTLMSSP blob setup be rewritten with the
next patch in this series) quite a bit of dead code from fs/cifs/connect.c
now can be removed.

This old code should have been removed last year, but the earlier krb5
patches did not move/remove the NTLMSSP code which we had asked to
be done first.  Since no one else volunteered, I am doing it now.

It is extremely important that we continue to examine the documentation
for this area, to make sure our code continues to be uptodate with
changes since Windows 2003.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 16:13:32 +00:00
Jeff Layton
f58841666b cifs: change cifs_get_name_from_search_buf to use new unicode helper
...and remove cifs_convertUCSpath. There are no more callers. Also add a
#define for the buffer used in the readdir path so that we don't have so
many magic numbers floating around.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:01 +00:00
Jeff Layton
460b96960d cifs: change CIFSSMBUnixQuerySymLink to use new helpers
Change CIFSSMBUnixQuerySymLink to use the new unicode helper functions.
Also change the calling conventions so that the allocation of the target
name buffer is done in CIFSSMBUnixQuerySymLink rather than by the caller.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
59140797c5 cifs: fix session setup unicode string saving to use new unicode helpers
...and change decode_unicode_ssetup to be a void function. It never
returns an actual error anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
cc20c031bb cifs: convert CIFSTCon to use new unicode helper functions
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
066ce68994 cifs: rename cifs_strlcpy_to_host and make it use new functions
Rename cifs_strlcpy_to_host to cifs_strndup since that better describes
what this function really does. Then, convert it to use the new string
conversion and measurement functions that work in units of bytes rather
than wide chars.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
69f801fcaa cifs: add new function to get unicode string length in bytes
Working in units of words means we do a lot of unnecessary conversion back
and forth. Standardize on bytes instead since that's more useful for
allocating buffers and such. Also, remove hostlen_fromUCS since the new
function has a similar purpose.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
7fabf0c947 cifs: add replacement for cifs_strtoUCS_le called cifs_from_ucs2
Add a replacement function for cifs_strtoUCS_le. cifs_from_ucs2
takes args for the source and destination length so that we can ensure
that the function is confined within the intended buffers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:44:59 +00:00
Jeff Layton
66345f50f0 cifs: move #defines for mapchars into cifs_unicode.h
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:44:59 +00:00
Steve French
912bc6ae3d Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-04-30 15:36:52 +00:00
Christoph Hellwig
28e211700a xfs: fix getbmap vs mmap deadlock
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap.  This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.

So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.

A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:29:02 -05:00
Christoph Hellwig
5f79ed685f xfs: a couple getbmap cleanups
- reshuffle various conditionals for data vs attr fork to make the code
   more readable
 - do fine-grainded goto-based error handling
 - exit early from conditionals instead of keeping a long else branch around
 - allow kmem_alloc to fail

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:28:31 -05:00
Olaf Weber
b9ec9068d7 xfs: add more checks to superblock validation
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:26:14 -05:00
Lachlan McIlroy
def6b3ba56 xfs_file_last_byte() needs to acquire ilock
We had some systems crash with this stack:

[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60

The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list.  While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:25:25 -05:00
J. Bruce Fields
e300a63ce4 nfsd4: replace callback thread by asynchronous rpc
We don't really need a synchronous rpc, and moving to an asynchronous
rpc allows us to do without this extra kthread.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 17:10:53 -04:00
J. Bruce Fields
3cef9ab266 nfsd4: lookup up callback cred only once
Lookup the callback cred once and then use it for all subsequent
callbacks.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:45:03 -04:00
J. Bruce Fields
ecdd03b791 nfsd4: create rpc callback client from server thread
The code is a little simpler, and it should be easier to avoid races, if
we just do all rpc client creation/destruction from nfsd or laundromat
threads and do only the rpc calls themselves asynchronously.  The rpc
creation doesn't involve any significant waiting (it doesn't call the
client, for example), so there's no reason not to do this.

Also don't bother destroying the client on failure of the rpc null
probe.  We may want to retry the probe later anyway.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:53 -04:00
J. Bruce Fields
e1cab5a589 nfsd4: set cb_client inside setup_callback_client
This is just a minor code simplification.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:47 -04:00
J. Bruce Fields
595947acaa nfsd4: set shorter timeout
We tried to do something overly complicated with the callback rpc
timeouts here.  And they're wrong--the result is that by the time a
single callback times out, it's already too late to tell the client
(using the cb_path_down return to RENEW) that the callback is down.

Use a much shorter, simpler timeout.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:40 -04:00
J. Bruce Fields
f64f79ea5f nfsd4: setclientid_confirm callback-change fixes
This setclientid_confirm case should allow the client to change
callbacks, but it currently has a dummy implementation that just turns
off callbacks completely.  That dummy implementation isn't completely
correct either, though:

	- There's no need to remove any client recovery directory in
	  this case.
	- New clientid confirm verifiers should be generated (and
	  returned) in setclientid; there's no need to generate a new
	  one here.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 16:44:34 -04:00
Christoph Hellwig
6321e3ed2a xfs: fix getbmap vs mmap deadlock
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap.  This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.

So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.

A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 13:25:29 -05:00
Tao Ma
7e31a966ad ocfs2/trivial: Remove unused variable in ocfs2_rename.
With indexed dir enabled, now we use ocfs2_dir_lookup_result to
wrap all the bh used for dir. So remove the 2 unused variables.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-29 10:57:18 -07:00
J. Bruce Fields
b8fd47aefa nfsd: quiet compile warning
Stephen Rothwell said:
"Today's linux-next build (powerpc ppc64_defconfig) produced this new
warning:

fs/nfsd/nfs4state.c: In function 'EXPIRED_STATEID':
fs/nfsd/nfs4state.c:2757: warning: comparison of distinct pointer types lacks a cast

Caused by commit 78155ed75f ("nfsd4:
distinguish expired from stale stateids")."

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
2009-04-29 11:36:17 -04:00
J. Bruce Fields
c654b8a9cb nfsd: support ext4 i_version
ext4 supports a real NFSv4 change attribute, which is bumped whenever
the ctime would be updated, including times when two updates arrive
within a jiffy of each other.  (Note that although ext4 has space for
nanosecond-precision ctime, the real resolution is lower: it actually
uses jiffies as the time-source.)  This ensures clients will invalidate
their caches when they need to.

There is some fear that keeping the i_version up-to-date could have
performance drawbacks, so for now it's turned on only by a mount option.
We hope to do something better eventually.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Theodore Tso <tytso@mit.edu>
2009-04-29 11:35:49 -04:00
J. Bruce Fields
3352d2c2d0 nfsd4: delete obsolete xdr comments
We don't need comments to tell us these macros are ugly.  And we're long
past trying to share any of this code with the BSD's.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 11:35:49 -04:00
J. Bruce Fields
bc749ca4c4 nfsd: eliminate ENCODE_HEAD macro
This macro doesn't serve any useful purpose.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-29 11:35:49 -04:00
Christoph Hellwig
4be4a00fb5 xfs: a couple getbmap cleanups
- reshuffle various conditionals for data vs attr fork to make the code
   more readable
 - do fine-grainded goto-based error handling
 - exit early from conditionals instead of keeping a long else branch around
 - allow kmem_alloc to fail

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 10:00:01 -05:00
Olaf Weber
2ac00af7a6 xfs: add more checks to superblock validation
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 09:24:29 -05:00
Lachlan McIlroy
f25181f598 xfs_file_last_byte() needs to acquire ilock
We had some systems crash with this stack:

[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60

The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list.  While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 09:14:10 -05:00
Ingo Molnar
e7fd5d4b3d Merge branch 'linus' into perfcounters/core
Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
              the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:47:05 +02:00
FUJITA Tomonori
69838727bc bio: fix memcpy corruption in bio_copy_user_iov()
st driver uses blk_rq_map_user() in order to just build a request out
of page frames. In this case, map_data->offset is a non zero value and
iov[0].iov_base is NULL. We need to increase nr_pages for that.

Cc: stable@kernel.org
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 20:24:29 +02:00
Chuck Lever
e06b64050e NFSD: Stricter buffer size checking in fs/nfsd/nfsctl.c
Clean up: For consistency, handle output buffer size checking in a
other nfsctl functions the same way it's done for write_versions().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:30 -04:00
Chuck Lever
261758b5c3 NFSD: Stricter buffer size checking in write_versions()
While it's not likely today that there are enough NFS versions to
overflow the output buffer in write_versions(), we should be more
careful about detecting the end of the buffer.

The number of NFS versions will only increase as NFSv4 minor versions
are added.

Note that this API doesn't behave the same as portlist.  Here we
attempt to display as many versions as will fit in the buffer, and do
not provide any indication that an overflow would have occurred.  I
don't have any good rationale for that.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:30 -04:00
Chuck Lever
3d72ab8fdd NFSD: Stricter buffer size checking in write_recoverydir()
While it's not likely a pathname will be longer than
SIMPLE_TRANSACTION_SIZE, we should be more careful about just
plopping it into the output buffer without bounds checking.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:30 -04:00
Chuck Lever
8435d34dbb SUNRPC: pass buffer size to svc_sock_names()
Adjust the synopsis of svc_sock_names() to pass in the size of the
output buffer.  Add a documenting comment.

This is a cosmetic change for now.  A subsequent patch will make sure
the buffer length is passed to one_sock_name(), where the length will
actually be useful.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:28 -04:00
Chuck Lever
bfba9ab4c6 SUNRPC: pass buffer size to svc_addsock()
Adjust the synopsis of svc_addsock() to pass in the size of the output
buffer.  Add a documenting comment.

This is a cosmetic change for now.  A subsequent patch will make sure
the buffer length is passed to one_sock_name(), where the length will
actually be useful.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-28 13:54:28 -04:00