Commit Graph

15010 Commits

Author SHA1 Message Date
Guillaume Knispel
b2add73dbf poll/select: initialize triggered field of struct poll_wqueues
The triggered field of struct poll_wqueues introduced in commit
5f820f648c ("poll: allow f_op->poll to
sleep").

It was first set to 1 in pollwake() (now __pollwake() ), tested and
later set to 0 in poll_schedule_timeout(), but not initialized before.

As a result when the process needs to sleep, triggered was likely to be
non-zero even if pollwake() is not called before the first
poll_schedule_timeout(), meaning schedule_hrtimeout_range() would not be
called and an extra loop calling all ->poll() would be done.

This patch initialize triggered to 0 in poll_initwait() so the ->poll()
are not called twice before the process goes to sleep when it needs to.

Signed-off-by: Guillaume Knispel <gknispel@proformatique.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-15 18:40:11 -07:00
Benny Halevy
cccddf4f55 nfs: nfs4xdr: optimize low level decoding
do not increment decoding ptr if not needed.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 14:02:26 -04:00
Benny Halevy
c0eae66ece nfs: nfs4xdr: get rid of READ_BUF
Use xdr_inline_decode instead.
Open code debug printout and error return.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 14:02:23 -04:00
Benny Halevy
2460ba57c4 nfs: nfs4xdr: simplify decode_exchange_id by reusing decode_opaque_inline
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 14:02:20 -04:00
Benny Halevy
99398d0655 nfs: nfs4xdr: get rid of COPYMEM
Just directly call memcpy.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 14:02:17 -04:00
Benny Halevy
e78291e4e0 nfs: nfs4xdr: introduce decode_sessionid helper
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 14:02:14 -04:00
Benny Halevy
db942bbd09 nfs: nfs4xdr: introduce decode_verifier helper
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[Trond: Fixed up an 'uninitialised variable' issue in decode_readdir]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:57:58 -04:00
Benny Halevy
07d30434cf nfs: nfs4xdr: introduce decode_opaque_fixed and decode_stateid helpers
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:26:27 -04:00
Benny Halevy
686841b3cc nfs: nfs4xdr: introduce print_overflow_msg
Part fo the nfs4xdr cleanup.  READ_BUF will go away.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:24:38 -04:00
Benny Halevy
c816fd3406 nfs: nfs4xdr: get rid of READTIME
It has no users.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:24:32 -04:00
Benny Halevy
3ceb4dbb99 nfs: nfs4xdr: get rid of READ64
s/READ64\(\*(.*)\)/p = xdr_decode_hyper(p, \1)/
s/READ64\((.*)\)/p = xdr_decode_hyper(p, &\1)/

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:24:13 -04:00
Benny Halevy
6f723f7710 nfs: nfs4xdr: get rid of READ32
s/READ32\((.*)\)/\1 = be32_to_cpup(p++)/

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:23:58 -04:00
Benny Halevy
811652bd6e nfs: nfs4xdr: merge xdr_encode_int+xdr_encode_opaque_fixed into xdr_encode_opaque
use encode_string where appropriate.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:19:24 -04:00
Benny Halevy
345585132a nfs: nfs4xdr: optimize low level encoding
do not increment encoding ptr if not needed.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:18:03 -04:00
Benny Halevy
13c65ce900 nfs: nfs4xdr: change RESERVE_SPACE macro into a static helper
In order to open code and expose the result pointer assignment.

Alternatively, we can open code the call to xdr_reserve_space
and do the BUG_ON an the error case at the call site.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:17:17 -04:00
Benny Halevy
2220f13a8b nfs: nfs4xdr: encode_compound_hdr does not have to round up reserved bytes
This is already done by xdr_reserve_space and since encode_compound_hdr
is adding a byte count to "12" which is already word aligned, the xdr
level rounding will work just as well.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:16:00 -04:00
Benny Halevy
42edd69812 nfs: nfs4xdr: optimize RESERVE_SPACE in encode_create_session and encode_sequence
Coalesce multilpe constant RESERVE_SPACEs into one

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:15:20 -04:00
Benny Halevy
93f0cf2594 nfs: nfs4xdr: get rid of WRITEMEM
s/WRITEMEM(/p = xdr_encode_opaque_fixed(p, /

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:13:58 -04:00
Benny Halevy
b95be5a976 nfs: nfs4xdr: get rid of WRITE64
s/WRITE64/p = xdr_encode_hyper(p, /

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:13:15 -04:00
Benny Halevy
e75bc1c89e nfs: nfs4xdr: get rid of WRITE32
s/WRITE32/*p++ = cpu_to_be32/

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-14 13:12:55 -04:00
Steven Whitehouse
d7e623da1a GFS2: Fix permissions on "recover" file
Although this file is only ever written and not read by
userspace, it seems that the utils are opening this
file O_RDWR, so we need to allow that.

Also fixes the whitespace which seemed to be broken.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
2009-08-14 14:04:46 +01:00
Linus Torvalds
bc7af9ba15 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
  ocfs2: Fix possible deadlock when extending quota file
  ocfs2: keep index within status_map[]
  ocfs2: Initialize the cluster we're writing to in a non-sparse extend
  ocfs2: Remove redundant BUG_ON in __dlm_queue_ast()
  ocfs2/quota: Release lock for error in ocfs2_quota_write.
  ocfs2: Define credit counts for quota operations
  ocfs2: Remove syncjiff field from quota info
  ocfs2: Fix initialization of blockcheck stats
  ocfs2: Zero out padding of on disk dquot structure
  ocfs2: Initialize blocks allocated to local quota file
  ocfs2: Mark buffer uptodate before calling ocfs2_journal_access_dq()
  ocfs2: Make global quota files blocksize aligned
  ocfs2: Use ocfs2_rec_clusters in ocfs2_adjust_adjacent_records.
  ocfs2: Fix deadlock on umount
  ocfs2: Add extra credits and access the modified bh in update_edge_lengths.
  ocfs2: Fail ocfs2_get_block() immediately when a block needs allocation
  ocfs2: Fix error return in ocfs2_write_cluster()
  ocfs2: Fix compilation warning for fs/ocfs2/xattr.c
  ocfs2: Initialize count in aio_write before generic_write_checks
  ocfs2: log the actual return value of ocfs2_file_aio_write()
  ...
2009-08-13 11:17:40 -07:00
Linus Torvalds
78efd1ddd9 Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix spin_is_locked assert on uni-processor builds
  xfs: check for dinode realtime flag corruption
  use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock
  xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get
  xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap
  xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set
  xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory
  xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result
  xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make
  xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc
  xfs: switch to NOFS allocation under i_lock in xfs_getbmap
  xfs: avoid memory allocation under m_peraglock in growfs code
2009-08-12 08:49:35 -07:00
Trond Myklebust
1ae88b2e44 NFS: Fix an O_DIRECT Oops...
We can't call nfs_readdata_release()/nfs_writedata_release() without
first initialising and referencing args.context. Doing so inside
nfs_direct_read_schedule_segment()/nfs_direct_write_schedule_segment()
causes an Oops.

We should rather be calling nfs_readdata_free()/nfs_writedata_free() in
those cases.

Looking at the O_DIRECT code, the "struct nfs_direct_req" is already
referencing the nfs_open_context for us. Since the readdata and writedata
structures carry a reference to that, we can simplify things by getting rid
of the extra nfs_open_context references, so that we can replace all
instances of nfs_readdata_release()/nfs_writedata_release().

Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-12 08:21:39 -07:00
Christoph Hellwig
a8914f3a6d xfs: fix spin_is_locked assert on uni-processor builds
Without SMP or preemption spin_is_locked always returns false,
so we can't do an assert with it.  Instead use assert_spin_locked,
which does the right thing on all builds.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Johannes Engel <jcnengel@googlemail.com>
Tested-by: Johannes Engel <jcnengel@googlemail.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:08:27 -05:00
Christoph Hellwig
b89d4208de xfs: check for dinode realtime flag corruption
Ramon tested XFS with a modified version of fsfuzzer and hit a NULL
pointer dereference in __xfs_get_blocks due to the RT device target
pointer being NULL.

To fix this reject inode with the realtime bit set on a a filesystem
without an RT subvolume during inode read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reported-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
Tested-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:08:21 -05:00
Eric Sandeen
e0c222c411 use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock
In Red Hat Bug 512552
 - Can't write to XFS mount during raid5 resync

a user ran into corruption while resyncing a raid, and we failed
a consistency test, but didn't get much more info; it'd be nice
to call XFS_CORRUPTION_ERROR here so we can see the buffer
contents.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:08:10 -05:00
Christoph Hellwig
ddd3a14e0f xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get
xfs_attr_rmtval_get is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:08:01 -05:00
Christoph Hellwig
7b02ecb303 xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap
xfs_readlink_bmap is called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:07:53 -05:00
Christoph Hellwig
10746e47e7 xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set
xfs_attr_rmtval_set is always called with i_lock held, and i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:07:44 -05:00
Christoph Hellwig
36fae17a64 xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory
xfs_buf_associate_memory is used for setting up the spare buffer for the
log wrap case in xlog_sync which can happen under i_lock when called from
xfs_fsync. The i_lock mutex is taken in reclaim context so all allocations
under it must avoid recursions into the filesystem.  There are a couple
more uses of xfs_buf_associate_memory in the log recovery code that are
also affected by this, but I'd rather keep the code simple than passing on
a gfp_mask argument.  Longer term we should just stop requiring the memoery
allocation in xlog_sync by some smaller rework of the buffer layer.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:07:38 -05:00
Christoph Hellwig
3f52c2f0a0 xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result
xfs_dir_cilookup_result is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into the
filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:07:23 -05:00
Christoph Hellwig
73195ed786 xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make
i_lock is taken in the reclaim context so all allocations under it
must avoid recursions into the filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:07:14 -05:00
Christoph Hellwig
f41d7fb9da xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc
xfs_da_state_alloc is always called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into the
filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:07:07 -05:00
Christoph Hellwig
ca35dcd6ca xfs: switch to NOFS allocation under i_lock in xfs_getbmap
xfs_getbmap allocates memory with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:06:59 -05:00
Christoph Hellwig
0cc6eee130 xfs: avoid memory allocation under m_peraglock in growfs code
Allocate the memory for the larger m_perag array before taking the
per-AG lock as the per-AG lock can be taken under the i_lock which
can be taken from reclaim context.

Reported by the new reclaim context tracing in lockdep.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-08-12 01:06:51 -05:00
James Morris
8b4bfc7feb Merge branch 'master' into next 2009-08-11 08:33:01 +10:00
Trond Myklebust
f884dcaead Merge branch 'sunrpc_cache-for-2.6.32' into nfs-for-2.6.32 2009-08-10 17:45:58 -04:00
Trond Myklebust
976a6f921c Merge branch 'patches_cel-for-2.6.32' into nfs-for-2.6.32 2009-08-10 17:45:50 -04:00
Jan Kara
b409d7a0ab ocfs2: Fix possible deadlock when extending quota file
In OCFS2, allocator locks rank above transaction start. Thus we
cannot extend quota file from inside a transaction less we could
deadlock.

We solve the problem by starting transaction not already in
ocfs2_acquire_dquot() but only in ocfs2_local_read_dquot() and
ocfs2_global_read_dquot() and we allocate blocks to quota files before starting
the transaction.  In case we crash, quota files will just have a few blocks
more but that's no problem since we just use them next time we extend the
quota file.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-08-10 12:20:22 -07:00
Bartlomiej Zolnierkiewicz
e576e05a73 nfs: remove superfluous BUG_ON()s
Subject: [PATCH] nfs: remove superfluous BUG_ON()s

Remove duplicated BUG_ON()s from nfs[4]_create_server()
(we make the same checks earlier in both functions).

This takes care of the following entries from Dan's list:

fs/nfs/client.c +1078 nfs_create_server(47) warning: variable derefenced before check 'server->nfs_client'
fs/nfs/client.c +1079 nfs_create_server(48) warning: variable derefenced before check 'server->nfs_client->rpc_ops'
fs/nfs/client.c +1363 nfs4_create_server(43) warning: variable derefenced before check 'server->nfs_client'
fs/nfs/client.c +1364 nfs4_create_server(44) warning: variable derefenced before check 'server->nfs_

Reported-by: Dan Carpenter <error27@gmail.com>
Cc: corbet@lwn.net
Cc: eteo@redhat.com
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 08:54:16 -04:00
Peter Staubach
38c73044f5 NFS: read-modify-write page updating
Hi.

I have a proposal for possibly resolving this issue.

I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.

The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page.  The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.

When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes.  This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.

This algorithm could be described as modify-write-read.  It
is most efficient when the application only updates pages
and does not read them.

My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path.  The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.

If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data.  If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.

This would optimize for applications which randomly access
and update portions of files.  The linkage editor for the
C compiler is an example of such a thing.

I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel.  The kernel without the
patch generated about 269,500 WRITE requests.  The modified
kernel containing the patch generated about 261,000 WRITE
requests.  Thus, about 8,500 fewer WRITE requests were
generated.  I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.

The difference between this patch and the previous one was
to remove the unneeded PageDirty() test.  I then retested to
ensure that the resulting system continued to behave as
desired.

	Thanx...

		ps

Signed-off-by: Peter Staubach <staubach@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 08:54:16 -04:00
Trond Myklebust
074cc1deec NFS: Add a ->migratepage() aop for NFS
Make NFS a bit more friendly to NUMA and memory hot removal...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 08:54:13 -04:00
Tejun Heo
1905b1bfc0 chrdev: implement __[un]register_chrdev()
[un]register_chrdev() assume minor range 0-255.  This patch adds __
prefixed versions which take @minorbase and @count explicitly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2009-08-10 13:59:12 +02:00
Oleg Nesterov
704b836cbf mm_for_maps: take ->cred_guard_mutex to fix the race with exec
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.

Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.

Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 20:49:26 +10:00
Oleg Nesterov
00f89d2185 mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 20:48:32 +10:00
Oleg Nesterov
13f0feafa6 mm_for_maps: simplify, use ptrace_may_access()
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.

Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.

Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.

With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 20:47:42 +10:00
Oleg Nesterov
896a6de40e mm_for_maps: take ->cred_guard_mutex to fix the race with exec
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.

Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.

Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 12:21:08 +10:00
Oleg Nesterov
d3c8660233 mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-08-10 12:21:06 +10:00
Trond Myklebust
bc74b4f5e6 SUNRPC: Allow the cache_detail to specify alternative upcall mechanisms
For events that are rare, such as referral DNS lookups, it makes limited
sense to have a daemon constantly listening for upcalls on a channel. An
alternative in those cases might simply be to run the app that fills the
cache using call_usermodehelper_exec() and friends.

The following patch allows the cache_detail to specify alternative upcall
mechanisms for these particular cases.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:14:29 -04:00
Trond Myklebust
2da8ca26c6 NFSD: Clean up the idmapper warning...
What part of 'internal use' is so hard to understand?

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:14:26 -04:00
Trond Myklebust
7d217caca5 SUNRPC: Replace rpc_client->cl_dentry and cl_mnt, with a cl_path
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:14:24 -04:00
Trond Myklebust
b693ba4a33 SUNRPC: Constify rpc_pipe_ops...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:14:15 -04:00
Chuck Lever
4116092b92 NFSD: Support IPv6 addresses in write_failover_ip()
In write_failover_ip(), replace the sscanf() with a call to the common
sunrpc.ko presentation address parser.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:40 -04:00
Chuck Lever
c15128c5e4 lockd: Replace nsm_display_address() with rpc_ntop()
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:39 -04:00
Chuck Lever
b97a56747e lockd: Replace nlm_clear_port()
Clean up: Use shared rpc_set_port() function instead of nlm_clear_port().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:38 -04:00
Chuck Lever
ec6ee61250 NFS: Replace nfs_set_port() with rpc_set_port()
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:37 -04:00
Chuck Lever
53a0b9c4c9 NFS: Replace nfs_parse_ip_address() with rpc_pton()
Clean up: Use the common routine now provided in sunrpc.ko for parsing mount
addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:36 -04:00
Chuck Lever
a02d692611 SUNRPC: Provide functions for managing universal addresses
Introduce a set of functions in the kernel's RPC implementation for
converting between a socket address and either a standard
presentation address string or an RPC universal address.

The universal address functions will be used to encode and decode
RPCB_FOO and NFSv4 SETCLIENTID arguments.  The other functions are
part of a previous promise to deliver shared functions that can be
used by upper-layer protocols to display and manipulate IP
addresses.

The kernel's current address printf formatters were designed
specifically for kernel to user-space APIs that require a particular
string format for socket addresses, thus are somewhat limited for the
purposes of sunrpc.ko.  The formatter for IPv6 addresses, %pI6, does
not support short-handing or scope IDs.  Also, these printf formatters
are unique per address family, so a separate formatter string is
required for printing AF_INET and AF_INET6 addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:34 -04:00
Chuck Lever
ec88f28d1e NFS: Use the authentication flavor list returned by mountd
Commit a14017db added support in the kernel's NFS mount client to
decode the authentication flavor list returned by mountd.

The NFS client can now use this list to determine whether the
authentication flavor requested by the user is actually supported
by the server.

Note we don't actually negotiate the security flavor if none was
specified by the user.  Instead, we try to use AUTH_SYS, and fail if
the server does not support it.  This prevents us from negotiating
an inappropriate security flavor (some servers list AUTH_NULL first).

If the server does not support AUTH_SYS, the user must provide an
appropriate security flavor by specifying the "sec=" mount option.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:32 -04:00
Chuck Lever
059f90b323 NFS: Fix auth flavor len accounting
Previous logic in the NFS mount parsing code path assumed
auth_flavor_len was set to zero for simple authentication flavors
(like AUTH_UNIX), and 1 for compound flavors (like AUTH_GSS).

At some earlier point (maybe even before the option parsers were
merged?) specific checks for auth_flavor_len being zero were removed
from the functions that validate the mount option that sets the mount
point's authentication flavor.

Since we are populating an array for authentication flavors, the
auth_flavor_len should always be set to the number of flavors.  Let's
eliminate some cleverness here, and prepare for new logic that needs
to know the number of flavors in the auth_flavors[] array.

(auth_flavors[] is an array because at some point we want to allow a
list of acceptable authentication flavors to be specified via the sec=
mount option.  For now it remains a single element array).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:31 -04:00
Chuck Lever
0b524123c9 NFS: Add ability to send MOUNTPROC_UMNT to the kernel's mountd client
After certain failure modes of an NFS mount, an NFS client should send
a MOUNTPROC_UMNT request to remove the just-added mount entry from the
server's mount table.  While no-one should rely on the accuracy of the
server's mount table, sending a UMNT is simply being a good internet
neighbor.

Since NFS mount processing is handled in the kernel now, we will need
a function in the kernel's mountd client that can post a MOUNTRPC_UMNT
request, in order to handle these failure modes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:30 -04:00
Chuck Lever
f3f4f4ed26 NFS: Fix up new minorversion= option
The new minorversion= mount option (commit 3fd5be9e) was merged at
the same time as the recent sloppy parser fixes (commit a5a16bae),
so minorversion= still uses the old value parsing logic.

If the minorversion= option specifies a bogus value, it should fail
with "bad value" not "bad option."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:09:29 -04:00
Trond Myklebust
c140aa9135 NFSv4: Clean up the nfs.callback_tcpport option
Tighten up the validity checking in param_set_port: check for NULL pointers.
Ensure that the option shows up on 'modinfo' output.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:06:19 -04:00
Trond Myklebust
80e52aced1 NFSv4: Don't do idmapper upcalls for asynchronous RPC calls
We don't want to cause rpciod to hang...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:06:19 -04:00
Trond Myklebust
62ab460cf5 NFSv4: Add 'server capability' flags for NFSv4 recommended attributes
If the NFSv4 server doesn't support a POSIX attribute, the generic NFS code
needs to know that, so that it don't keep trying to poll for it.

However, by the same count, if the NFSv4 server does support that
attribute, then we should ensure that the inode metadata is appropriately
labelled as being untrusted. For instance, if we don't know the correct
value of the file's uid, we should certainly not be caching ACLs or ACCESS
results.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:06:19 -04:00
Trond Myklebust
a78cb57a10 NFSv4: Don't loop forever on state recovery failure...
If the server is broken, then retrying forever won't fix it. We
should just give up after a while, and return an error to the user.
We set the number of retries to 10 for now...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:06:19 -04:00
Roel Kluin
dd8ac1da41 nfs: Keep index within mnt_errtbl[]
Ensure that index i remains within array mnt_errtbl[] and mnt3_errtbl[].

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-09 15:06:19 -04:00
Linus Torvalds
d6a0967c90 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSY
  Btrfs: correct error-handling zlib error handling
  Btrfs: remove superfluous NULL pointer check in btrfs_rename()
  Btrfs: make sure the async caching thread advances the key
  Btrfs: fix btrfs_remove_from_free_space corner case
2009-08-07 19:03:09 -07:00
Linus Torvalds
fb385003c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/hch/xfs-icache-races
* git://git.kernel.org/pub/scm/linux/kernel/git/hch/xfs-icache-races:
  xfs: fix freeing of inodes not yet added to the inode cache
  vfs: add __destroy_inode
  vfs: fix inode_init_always calling convention
2009-08-07 18:53:44 -07:00
Roel Kluin
8a57a9dda7 ocfs2: keep index within status_map[]
Do not exceed array status_map[]

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-08-07 13:16:50 -07:00
Sunil Mushran
e7432675f8 ocfs2: Initialize the cluster we're writing to in a non-sparse extend
In a non-sparse extend, we correctly allocate (and zero) the clusters between
the old_i_size and pos, but we don't zero the portions of the cluster we're
writing to outside of pos<->len.

It handles clustersize > pagesize and blocksize < pagesize.

[Cleaned up by Joel Becker.]

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-08-07 13:16:23 -07:00
Yan Zheng
ceab36edd3 Btrfs: fix balancing oops when invalidate_inode_pages2 returns EBUSY
invalidate_inode_pages2_range may return -EBUSY occasionally
which results Oops. This patch fixes the issue by moving
invalidate_inode_pages2_range into a loop and keeping calling
it until the return value is not -EBUSY.

The EBUSY return is temporary, and can happen when the btrfs release page
function is unable to release a page because the EXTENT_LOCK
bit is set.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-08-07 13:51:33 -04:00
Julia Lawall
60f2e8f8a0 Btrfs: correct error-handling zlib error handling
find_zlib_workspace returns an ERR_PTR value in an error case instead of NULL.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@

x = find_zlib_workspace(...)
... when != x = E
(
*  if (x == NULL || ...) S1 else S2
|
*  if (x == NULL && ...) S1 else S2
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-08-07 13:51:33 -04:00
Bartlomiej Zolnierkiewicz
4baf8c9201 Btrfs: remove superfluous NULL pointer check in btrfs_rename()
This takes care of the following entry from Dan's list:

fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode'

Reported-by: Dan Carpenter <error27@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-08-07 13:47:08 -04:00
Linus Torvalds
389623fef0 Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
  jffs2: Fix return value from jffs2_do_readpage_nolock()
  mtd: mtdblock: introduce mtdblks_lock
  mtd: remove 'SBC8240 Wind River' Device Driver Code
  mtd: OneNAND: OMAP2/3: free GPMC CS on module removal
  mtd: OneNAND: fix incorrect bufferram offset
  mtd: blkdevs: do not forget to get MTD devices
  mtd: fix the conversion from dev to mtd_info
  mtd: let include/linux/mtd/partitions.h stand on its own
2009-08-07 10:42:31 -07:00
Linus Torvalds
3440625d78 flat: fix uninitialized ptr with shared libs
The new credentials code broke load_flat_shared_library() as it now uses
an uninitialized cred pointer.

Reported-by: Bernd Schmidt <bernds_cb1@t-online.de>
Tested-by: Bernd Schmidt <bernds_cb1@t-online.de>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-07 10:39:57 -07:00
OGAWA Hirofumi
2d8dd38a5a vfs: mnt_want_write_file(): fix special file handling
I suspect that mnt_want_write_file() may have wrong assumption.  I think
mnt_want_write_file() is assuming it increments ->mnt_writers if
(file->f_mode & FMODE_WRITE).  But, if it's special_file(), it is false?

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-07 10:39:56 -07:00
Eric Sandeen
69130c7cf9 compat_ioctl: hook up compat handler for FIEMAP ioctl
The FIEMAP_IOC_FIEMAP mapping ioctl was missing a 32-bit compat handler,
which means that 32-bit suerspace on 64-bit kernels cannot use this ioctl
command.

The structure is nicely aligned, padded, and sized, so it is just this
simple.

Tested w/ 32-bit ioctl tester (from Josef) on a 64-bit kernel on ext4.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Mark Lord <lkml@rtr.ca>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josef Bacik <josef@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-07 10:39:56 -07:00
Christoph Hellwig
b36ec0428a xfs: fix freeing of inodes not yet added to the inode cache
When freeing an inode that lost race getting added to the inode cache we
must not call into ->destroy_inode, because that would delete the inode
that won the race from the inode cache radix tree.

This patch uses splits a new xfs_inode_free helper out of xfs_ireclaim
and uses that plus __destroy_inode to make sure we really only free
the memory allocted for the inode that lost the race, and not mess with
the inode cache state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Alex Samad <alex@samad.com.au>
Reported-by: Andrew Randrianasulu <randrik@mail.ru>
Reported-by: Stephane <sharnois@max-t.com>
Reported-by: Tommy <tommy@news-service.com>
Reported-by: Miah Gregory <mace@darksilence.net>
Reported-by: Gabriel Barazer <gabriel@oxeva.fr>
Reported-by: Leandro Lucarella <llucax@gmail.com>
Reported-by: Daniel Burr <dburr@fami.com.au>
Reported-by: Nickolay <newmail@spaces.ru>
Reported-by: Michael Guntsche <mike@it-loops.com>
Reported-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com>
Reported-by: Michael Ole Olsen <gnu@gmx.net>
Reported-by: Michael Weissenbacher <mw@dermichi.com>
Reported-by: Martin Spott <Martin.Spott@mgras.net>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Michael Guntsche <mike@it-loops.com>
Tested-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
2009-08-07 14:38:34 -03:00
Christoph Hellwig
2e00c97e2c vfs: add __destroy_inode
When we want to tear down an inode that lost the add to the cache race
in XFS we must not call into ->destroy_inode because that would delete
the inode that won the race from the inode cache radix tree.

This patch provides the __destroy_inode helper needed to fix this,
the actual fix will be in th next patch.  As XFS was the only reason
destroy_inode was exported we shift the export to the new __destroy_inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-08-07 14:38:29 -03:00
Christoph Hellwig
54e346215e vfs: fix inode_init_always calling convention
Currently inode_init_always calls into ->destroy_inode if the additional
initialization fails.  That's not only counter-intuitive because
inode_init_always did not allocate the inode structure, but in case of
XFS it's actively harmful as ->destroy_inode might delete the inode from
a radix-tree that has never been added.  This in turn might end up
deleting the inode for the same inum that has been instanciated by
another process and cause lots of cause subtile problems.

Also in the case of re-initializing a reclaimable inode in XFS it would
free an inode we still want to keep alive.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-08-07 14:38:25 -03:00
James Morris
012a5299a2 Merge branch 'master' into next 2009-08-06 08:55:03 +10:00
Linus Torvalds
624720e09c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix missing unlock in error path of nilfs_mdt_write_page
  nilfs2: fix oops due to inconsistent state in page with discrete b-tree nodes
2009-08-04 15:28:23 -07:00
Linus Torvalds
849c9caa60 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Update readme to reflect forceuid mount parms
  cifs: Read buffer overflow
  cifs: show noforceuid/noforcegid mount options (try #2)
  cifs: reinstate original behavior when uid=/gid= options are specified
  [CIFS] Updates fs/cifs/CHANGES
  cifs: fix error handling in mount-time DFS referral chasing code
2009-08-04 15:27:56 -07:00
Anders Grafström
57ca7deb06 jffs2: Fix return value from jffs2_do_readpage_nolock()
This fixes "kernel BUG at fs/jffs2/file.c:251!".
This pseudocode hopefully illustrates the scenario that triggers it:

jffs2_write_begin {
     jffs2_do_readpage_nolock {
         jffs2_read_inode_range {
             jffs2_read_dnode {
                 Data CRC 33c102e9 != calculated CRC 0ef77e7b for node at 005d42e4
                 return -EIO;
             }
         }
         ClearPageUptodate(pg);
         return 0;
     }
}

jffs2_write_end {
     BUG_ON(!PageUptodate(pg));
}

Signed-off-by: Anders Grafström <grfstrm@users.sourceforge.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-08-04 12:13:06 +01:00
Steve French
d098564f3b [CIFS] Update readme to reflect forceuid mount parms
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-08-04 03:53:28 +00:00
Roel Kluin
24e2fb615f cifs: Read buffer overflow
Check whether index is within bounds before testing the element.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-08-03 21:01:32 +00:00
Jeff Layton
4486d6ede1 cifs: show noforceuid/noforcegid mount options (try #2)
Since forceuid is the default, we now need to show when it's disabled.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-08-03 21:01:17 +00:00
Ryusuke Konishi
01a261e09a nilfs2: fix missing unlock in error path of nilfs_mdt_write_page
This adds a missing unlock of nilfs->ns_writer_mutex in
nilfs_mdt_write_page() function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-08-02 22:24:15 +09:00
Ingo Molnar
8e9ed8b024 Merge branch 'sched/urgent' into sched/core
Merge reason: avoid upcoming patch conflict.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-02 14:23:57 +02:00
Jeff Layton
9b9d6b2434 cifs: reinstate original behavior when uid=/gid= options are specified
This patch fixes the regression reported here:

http://bugzilla.kernel.org/show_bug.cgi?id=13861

commit 4ae1507f6d changed the default
behavior when the uid= or gid= option was specified for a mount. The
existing behavior was to always clobber the ownership information
provided by the server when these options were specified. The above
commit changed this behavior so that these options simply provided
defaults when the server did not provide this information (unless
"forceuid" or "forcegid" were specified)

This patch reverts this change so that the default behavior is restored.
It also adds "noforceuid" and "noforcegid" options to make it so that
ownership information from the server is preserved, even when the mount
has uid= or gid= options specified.

It also adds a couple of printk notices that pop up when forceuid or
forcegid options are specified without a uid= or gid= option.

Reported-by: Tom Chiverton <bugzilla.kernel.org@falkensweb.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-08-02 03:47:25 +00:00
Ryusuke Konishi
a97778457f nilfs2: fix oops due to inconsistent state in page with discrete b-tree nodes
Andrea Gelmini gave me a report that a kernel oops hit on a nilfs
filesystem with a 1KB block size when doing rsync.

This turned out to be caused by an inconsistency of dirty state
between a page and its buffers storing b-tree node blocks.

If the page had multiple buffers split over multiple logs, and if the
logs were written at a time, a dirty flag remained in the page even
every dirty flag in the buffers was cleared.

This will fix the failure by dropping the dirty flag properly for
pages with the discrete multiple b-tree nodes.

Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: stable@kernel.org
2009-08-01 22:48:32 +09:00
Goldwyn Rodrigues
ab57a40827 ocfs2: Remove redundant BUG_ON in __dlm_queue_ast()
We BUG_ON() the same thing twice.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-31 13:43:44 -07:00
Linus Torvalds
f5266cbd2f Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: bump up nr_to_write in xfs_vm_writepage
  xfs: reduce bmv_count in xfs_vn_fiemap
2009-07-31 12:17:37 -07:00
Chris Mason
013f1b12f4 Btrfs: make sure the async caching thread advances the key
The async caching thread can end up looping forever if a given
search puts it at the last key in a leaf.  It will end up calling
btrfs_next_leaf and then checking if it needs to politely drop
the read semaphore.

Most of the time this looping isn't noticed because it is able to
make progress the next time around.  But, during log replay,
we wait on the async caching thread to finish, and the async thread
is waiting on the commit, and no progress is really made.

The fix used here is to copy the key out of the next leaf,
that way our search lands there properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-31 14:57:55 -04:00
Josef Bacik
6606bb97e1 Btrfs: fix btrfs_remove_from_free_space corner case
Yan Zheng hit a problem where we tried to remove some free space but failed
because we couldn't find the free space entry.  This is because the free space
was held within a bitmap that had a starting offset well before the actual
offset of the free space, and there were free space extents that were in the
same range as that offset, so tree_search_offset returned with NULL because we
couldn't find a free space extent that had that offset.  This is fixed by
making sure that if we fail to find the entry, we re-search again with
bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate
bitmap.  A similar problem happens in btrfs_alloc_from_bitmap for the
clustering code, but that is not as bad since we will just go and redo our
cluster allocation.

Also this adds some debugging checks to make sure that the free space we are
trying to remove from the bitmap is in fact there.  This can probably go away
after a while, but since this code is only used by the tree-logging stuff it
would be nice to run with it for a while to make sure there are no problems.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-31 11:03:58 -04:00
Eric Sandeen
c8a4051c37 xfs: bump up nr_to_write in xfs_vm_writepage
VM calculation for nr_to_write seems off.  Bump it way
up, this gets simple streaming writes zippy again.
To be reviewed again after Jens' writeback changes.

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Cc: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-07-31 00:57:11 -05:00
Eric Sandeen
97db39a1f6 xfs: reduce bmv_count in xfs_vn_fiemap
commit 6321e3ed2a caused
the full bmv_count's worth of getbmapx structures to get
allocated; telling it to do MAXEXTNUM was a bit insane,
resulting in ENOMEM every time.

Chop it down to something reasonable, the number of slots
in the caller's input buffer.  If this is too large the
caller may get ENOMEM but the reason should not be a
mystery, and they can try again with something smaller.

We add 1 to the value because in the normal getbmap
world, bmv_count includes the header and xfs_getbmap does:

        nex = bmv->bmv_count - 1;
        if (nex <= 0)
                return XFS_ERROR(EINVAL);

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-07-31 00:56:58 -05:00
Linus Torvalds
ec6a8679fa Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: be more polite in the async caching threads
  Btrfs: preserve commit_root for async caching
2009-07-30 16:46:48 -07:00
Linus Torvalds
784b1d6b21 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks
2009-07-30 16:46:17 -07:00
Linus Torvalds
691c5f7374 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6:
  quota: Silence lockdep on quota_on
2009-07-30 16:46:06 -07:00
Linus Torvalds
79af313317 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: remove dcache entries for remote deleted inodes
  GFS2: Fix incorrent statfs consistency check
  GFS2: Don't put unlikely reclaim candidates on the reclaim list.
  GFS2: Don't try and dealloc own inode
  GFS2: Fix panic in glock memory shrinker
  GFS2: keep statfs info in sync on grows
  GFS2: Shrink the shrinker
2009-07-30 16:45:37 -07:00
Tao Ma
e9956fae7d ocfs2/quota: Release lock for error in ocfs2_quota_write.
ocfs2_quota_write needs to release the lock if it fails to
read quota block. So use "goto out" instead of "return err".

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-30 11:06:06 -07:00
Jan Kara
dee865656f quota: Silence lockdep on quota_on
Commit d01730d74d didn't completely fix
the problem since we still take dqio_mutex and i_mutex in the wrong
order. Move taking of i_mutex further down (luckily it's needed only
for updating inode flags) below where dqio_mutex is taken.

Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-30 17:31:23 +02:00
Jan Kara
4bf17af0db udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks
VAT inode is located in the last block recorded block of the medium. When the
drive errorneously reports number of recorded blocks, we failed to load the VAT
inode and thus mount the medium. This patch makes kernel try to read VAT inode
from the last block of the device if it is different from the last recorded
block.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-07-30 17:28:26 +02:00
Chris Mason
f36f3042ea Btrfs: be more polite in the async caching threads
The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall.  This
releases the semaphore more often when a transaction commit is
in progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-30 10:14:46 -04:00
Yan Zheng
276e680d19 Btrfs: preserve commit_root for async caching
The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.

During a commit, we have a loop where we update the extent allocation
tree root.  We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.

Right now the commit_root pointer is changed inside this loop.  It
is more correct to change the commit_root pointer only after all the
looping is done.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-30 09:40:40 -04:00
Benjamin Marzinski
b94a170e96 GFS2: remove dcache entries for remote deleted inodes
When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 11:01:03 +01:00
Benjamin Marzinski
6b94617024 GFS2: Fix incorrent statfs consistency check
Since both linked and unlinked inodes are counted by rgd->rd_dinodes, It
makes no sense to count them with the used data blocks (first check that
I changed), it makes sense to count them with the linked inodes (second
check), and it makes no sense to care if there are more unlinked inodes
than linked ones. This fixes these errors.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 11:00:28 +01:00
Benjamin Marzinski
8ff22a6f9b GFS2: Don't put unlikely reclaim candidates on the reclaim list.
GFS2 was placing far too many glocks on the reclaim list that were not good
candidates for freeing up from cache.  These locks would sit there and
repeatedly get scanned to see if they could be reclaimed, wasting a lot
of time when there was memory pressure. This fix does more checks on the
locks to see if they are actually likely to be removable from cache.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 11:00:09 +01:00
Steven Whitehouse
1e19a19584 GFS2: Don't try and dealloc own inode
When searching for unlinked, but still allocated inodes during block
allocation, avoid the block relating to the inode that is doing the
allocation. This fixes a hang caused when an unlinked, but still
open, inode tries to allocate some more blocks and lands up
finding itself during the search for deallocatable inodes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:59:50 +01:00
Benjamin Marzinski
a51b56fff3 GFS2: Fix panic in glock memory shrinker
It is possible for gfs2_shrink_glock_memory() to check a glock for
demotion
that's in the process of being freed by gfs2_glock_put().  In this case,
gfs2_shrink_glock_memory() will acquire a new reference to this glock,
and
then try to free the glock itself when it drops the refernce.  To solve
this, gfs2_shrink_glock_memory() just needs to check if the glock is in
the process of being freed, and if so skip it without ever unlocking the
lru_lock.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:59:28 +01:00
Benjamin Marzinski
1946f70ab5 GFS2: keep statfs info in sync on grows
GFS2 wasn't syncing its statfs info on grows.  This causes a problem
when you grow the filesystem on multiple nodes.  GFS2 would calculate
the new space based on the resource groups (which are always current),
and then assume that the filesystem had grown the from the existing
statfs size.  If you grew the filesystem on two different nodes in a
short time, the second node wouldn't see the statfs size change from the
first node, and would assume that it was grown by a larger amount than
it was.  When all these changes were synced out, the total fileystem
size would be incorrect (the first grow would be counted twice).

This patch syncs makes GFS2 read in the statfs changes from disk before
a grow, and write them out after the grow, while the master statfs inode
is locked.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:52:33 +01:00
Steven Whitehouse
2163b1e616 GFS2: Shrink the shrinker
This patch removes some of the special cases that the shrinker
was trying to deal with. As a result we leave fewer items on
the list and none at all which cannot be demoted. This makes
the list scanning more efficient and solves some issues seen
with large numbers of inodes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-07-30 10:52:14 +01:00
Steve French
5bd9052d79 [CIFS] Updates fs/cifs/CHANGES
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-30 02:26:14 +00:00
Linus Torvalds
91a5698d1f Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
  PM / Hibernate: Replace bdget call with simple atomic_inc of i_count
  PM / ACPI: HP G7000 Notebook needs a SCI_EN resume quirk
2009-07-29 19:15:18 -07:00
Catalin Marinas
5c80536523 fs/ramfs/file-nommu.c needs include/linux/sched.h
This file makes use of various macros defined in files like asm/current.h
or asm-generic/resource.h.  All these files can be included via sched.h.
The building of the !MMU ARM kernel (with additional patches) fails
without this change.

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-29 19:10:36 -07:00
Linus Torvalds
ccf5675a82 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  driver core: documentation: make it clear that sysfs is optional
  driver core: sysdev: do not send KOBJ_ADD uevent if kobject_init_and_add fails
  Dynamic debug: fix typo: -/->
  driver core: firmware_class:fix memory leak of page pointers array
  sysfs: fix hardlink count on device_move
2009-07-29 12:29:08 -07:00
Alan Jenkins
dddac6a7b4 PM / Hibernate: Replace bdget call with simple atomic_inc of i_count
Create bdgrab().  This function copies an existing reference to a
block_device.  It is safe to call from any context.

Hibernation code wishes to copy a reference to the active swap device.
Right now it calls bdget() under a spinlock, but this is wrong because
bdget() can sleep.  It doesn't need a full bdget() because we already
hold a reference to active swap devices (and the spinlock protects
against swapoff).

Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827

Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2009-07-29 21:07:55 +02:00
Linus Torvalds
655c5d8fc1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (22 commits)
  Btrfs: Fix async caching interaction with unmount
  Btrfs: change how we unpin extents
  Btrfs: Correct redundant test in add_inode_ref
  Btrfs: find smallest available device extent during chunk allocation
  Btrfs: clear all space_info->full after removing a block group
  Btrfs: make flushoncommit mount option correctly wait on ordered_extents
  Btrfs: Avoid delayed reference update looping
  Btrfs: Fix ordering of key field checks in btrfs_previous_item
  Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
  Btrfs: Remove code duplication in comp_keys
  Btrfs: async block group caching
  Btrfs: use hybrid extents+bitmap rb tree for free space
  Btrfs: Fix crash on read failures at mount
  Btrfs: remove of redundant btrfs_header_level
  Btrfs: adjust NULL test
  Btrfs: Remove broken sanity check from btrfs_rmap_block()
  Btrfs: convert nested spin_lock_irqsave to spin_lock
  Btrfs: make sure all dirty blocks are written at commit time
  Btrfs: fix locking issue in btrfs_find_next_key
  Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
  ...
2009-07-28 14:27:06 -07:00
Ramon de Carvalho Valle
f151cd2c54 eCryptfs: parse_tag_3_packet check tag 3 packet encrypted key size
The parse_tag_3_packet function does not check if the tag 3 packet contains a
encrypted key size larger than ECRYPTFS_MAX_ENCRYPTED_KEY_BYTES.

Signed-off-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
[tyhicks@linux.vnet.ibm.com: Added printk newline and changed goto to out_free]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: stable@kernel.org (2.6.27 and 30)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-28 14:26:06 -07:00
Tyler Hicks
6352a29305 eCryptfs: Check Tag 11 literal data buffer size
Tag 11 packets are stored in the metadata section of an eCryptfs file to
store the key signature(s) used to encrypt the file encryption key.
After extracting the packet length field to determine the key signature
length, a check is not performed to see if the length would exceed the
key signature buffer size that was passed into parse_tag_11_packet().

Thanks to Ramon de Carvalho Valle for finding this bug using fsfuzzer.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: stable@kernel.org (2.6.27 and 30)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-28 14:26:06 -07:00
Peter Oberparleiter
0f58b44582 sysfs: fix hardlink count on device_move
Update directory hardlink count when moving kobjects to a new parent.
Fixes the following problem which occurs when several devices are
moved to the same parent and then unregistered:

> ls -laF /sys/devices/css0/defunct/
> total 0
> drwxr-xr-x 4294967295 root root    0 2009-07-14 17:02 ./
> drwxr-xr-x        114 root root    0 2009-07-14 17:02 ../
> drwxr-xr-x          2 root root    0 2009-07-14 17:01 power/
> -rw-r--r--          1 root root 4096 2009-07-14 17:01 uevent

Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-28 13:45:21 -07:00
Yan Zheng
f25784b35f Btrfs: Fix async caching interaction with unmount
- don't stop the caching thread until btrfs_commit_super return.

- if caching is interrupted by umount, set last to (u64)-1.
  otherwise the un-scanned range of block group will be considered
  as free extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-28 08:41:57 -04:00
Jeff Layton
7b91e2661a cifs: fix error handling in mount-time DFS referral chasing code
If the referral is malformed or the hostname can't be resolved, then
the current code generates an oops. Fix it to handle these errors
gracefully.

Reported-by: Sandro Mathys <sm@sandro-mathys.ch>
Acked-by: Igor Mammedov <niallain@gmail.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-07-28 00:51:59 +00:00
Linus Torvalds
fc013a5885 Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
  inotify: use GFP_NOFS under potential memory pressure
  fsnotify: fix inotify tail drop check with path entries
  inotify: check filename before dropping repeat events
  fsnotify: use def_bool in kconfig instead of letting the user choose
  inotify: fix error paths in inotify_update_watch
  inotify: do not leak inode marks in inotify_add_watch
  inotify: drop user watch count when a watch is removed
2009-07-27 15:54:10 -07:00
Linus Torvalds
a9355cf8e6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6:
  jfs: Fix early release of acl in jfs_get_acl
2009-07-27 12:15:56 -07:00
Linus Torvalds
2bc20d09b0 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  jbd: fix race between write_metadata_buffer and get_write_access
  ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()
  jbd: Fix a race between checkpointing code and journal_get_write_access()
  ext3: Fix truncation of symlinks after failed write
  jbd: Fail to load a journal if it is too short
2009-07-27 12:12:10 -07:00
Linus Torvalds
c7425eb481 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] fix sparse warning
  cifs: fix sb->s_maxbytes so that it casts properly to a signed value
  cifs: disable serverino if server doesn't support it
2009-07-27 12:11:43 -07:00
Josef Bacik
68b38550dd Btrfs: change how we unpin extents
We are racy with async block caching and unpinning extents.  This patch makes
things much less complicated by only unpinning the extent if the block group is
cached.  We check the block_group->cached var under the block_group->lock spin
lock.  If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back.  If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent.  This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.

One thing that needed to be changed was btrfs_free_super_mirror_extents.  Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory.  So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:01 -04:00
Julia Lawall
631c07c8d1 Btrfs: Correct redundant test in add_inode_ref
dir has already been tested.  It seems that this test should be on the
recently returned value inode.

A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-27 13:57:00 -04:00
Chris Mason
9779b72f05 Btrfs: find smallest available device extent during chunk allocation
Allocating new block group is easy when the disk has plenty of space.
But things get difficult as the disk fills up, especially if
the FS has been run through btrfs-vol -b.  The balance operation
is likely to make the total bytes available on the device greater
than the largest extent we'll actually be able to allocate.

But the device extent allocation code incorrectly assumes that a device
with 5G free will be able to allocate a 5G extent.  It isn't normally a
problem because device extents don't get freed unless btrfs-vol -b
is run.

This fixes the device extent allocator to remember the largest free
extent it can find, and then uses that value as a fallback.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:41:41 -04:00
Chris Mason
283bb1979f Btrfs: clear all space_info->full after removing a block group
Btrfs allocates individual extents from block groups, and each
block group has a specific type.  It may hold metadata, data
mirrored or striped etc.

When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups.  Once a block group is freed, the space it was
using on the device may be available for use by new block groups.

btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.

This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 16:30:55 -04:00
Sage Weil
ebecd3d9d2 Btrfs: make flushoncommit mount option correctly wait on ordered_extents
The commit_transaction call to wait_ordered_extents when snap_pending
passes nocow_only=1 to process only NOCOW or PREALLOC extents.  This isn't
correct for the 'flushoncommit' mode, as it skips extents we just started
IO on in start_delalloc_inodes.

So, in the flushoncommit case, wait on all ordered extents.  Otherwise,
only pass the nocow_only flag to wait_ordered_extents if snap_pending.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 13:17:44 -04:00
Yan Zheng
d717aa1d31 Btrfs: Avoid delayed reference update looping
btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.

There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 12:42:46 -04:00
Yan Zheng
0a4eefbb74 Btrfs: Fix ordering of key field checks in btrfs_previous_item
Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Yan Zheng
1fcbac581b Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
find_free_dev_extent does not properly handle the case where
the device is not complete free, and there is a free extent
at the beginning of the device.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:47 -04:00
Diego Calleja
20736abaa3 Btrfs: Remove code duplication in comp_keys
comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 11:22:46 -04:00
Josef Bacik
817d52f8db Btrfs: async block group caching
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner.  Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation.  If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching.  This is how I tested
the speedup from this

mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo

Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.

Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group.  This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.

I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached.  This drastically reduces the amount of time it takes to
cache the block groups.  Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.

This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:39 -04:00
Josef Bacik
9630308170 Btrfs: use hybrid extents+bitmap rb tree for free space
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space.  As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area.  Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.

This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache.  Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram.  The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.

Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space.  This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap.  This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.

I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree.  I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree.  This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.

I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff.  I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues.  I've not seen
any performance regressions in any of my tests.

Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-07-24 09:23:30 -04:00
Jan Kara
0584974a77 ocfs2: Define credit counts for quota operations
Numbers of needed credits for some quota operations were written
as raw numbers. Create appropriate defines instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:31 -07:00
Jan Kara
4539f1df25 ocfs2: Remove syncjiff field from quota info
syncjiff is just a converted value of syncms. Some places which
are updating syncms forgot to update syncjiff as well. Since the
conversion is just a simple division / multiplication and it does
not happen frequently, just remove the syncjiff field to avoid
forgotten conversions.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:27 -07:00
Jan Kara
1c1d9793ff ocfs2: Fix initialization of blockcheck stats
We just set blockcheck stats to zeros but we should also
properly initialize the spinlock there.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:24 -07:00
Jan Kara
7669f54c55 ocfs2: Zero out padding of on disk dquot structure
Padding fields of on-disk dquot structure were not zeroed. Zero them
so that it's easier to use them later.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:17 -07:00
Jan Kara
0e7f387bf3 ocfs2: Initialize blocks allocated to local quota file
When we extend local quota file, we should initialize data
in newly allocated block. Firstly because on recovery we could
parse bogus data, secondly so that block checksums are properly
computed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:11 -07:00
Jan Kara
4b3fa1904c ocfs2: Mark buffer uptodate before calling ocfs2_journal_access_dq()
In a code path extending local quota files we marked new header
buffer uptodate only after calling ocfs2_journal_access_dq() which
triggers a bug. Fix it and also call ocfs2 variant of the function
marking buffer uptodate.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:59:05 -07:00
Jan Kara
b57ac2c43e ocfs2: Make global quota files blocksize aligned
Change i_size of global quota files so that it always remains aligned to block
size. This is mainly because the end of quota block may contain checksum (if
checksumming is enabled) and it's a bit awkward for it to be "outside" of quota
file (and it makes life harder for ocfs2-tools).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:58:59 -07:00
Tao Ma
82e12644cf ocfs2: Use ocfs2_rec_clusters in ocfs2_adjust_adjacent_records.
In ocfs2_adjust_adjacent_records, we will adjust adjacent records
according to the extent_list in the lower level. But actually
the lower level tree will either be a leaf or a branch. If we only
use ocfs2_is_empty_extent we will meet with some problem if the lower
tree is a branch (tree_depth > 1). So use !ocfs2_rec_clusters instead.
And actually only the leaf record can have holes. So add a BUG_ON
for non-leaf branch.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-23 10:58:46 -07:00
Stefan Bader
4a19fb11a9 jfs: Fix early release of acl in jfs_get_acl
BugLink: http://bugs.launchpad.net/ubuntu/+bug/396780

Commit 073aaa1b14 "helpers for acl
caching + switch to those" introduced new helper functions for
acl handling but seems to have introduced a regression for jfs as
the acl is released before returning it to the caller, instead of
leaving this for the caller to do.
This causes the acl object to be used after freeing it, leading
to kernel panics in completely different places.

Thanks to Christophe Dumez for reporting and bisecting into this.

Reported-by: Christophe Dumez <dchris@gmail.com>
Tested-by: Christophe Dumez <dchris@gmail.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
2009-07-23 11:08:36 -05:00