Commit Graph

16972 Commits

Author SHA1 Message Date
Joel Becker
399ff3a748 ocfs2: Handle errors while setting external xattr values.
ocfs2 can store extended attribute values as large as a single file.  It
does this using a standard ocfs2 btree for the large value.  However,
the previous code did not handle all error cases cleanly.

There are multiple problems to have.

1) We have trouble allocating space for a new xattr.  This leaves us
   with an empty xattr.
2) We overwrote an existing local xattr with a value root, and now we
   have an error allocating the storage.  This leaves us an empty xattr.
   where there used to be a value.  The value is lost.
3) We have trouble truncating a reused value.  This leaves us with the
   original entry pointing to the truncated original value.  The value
   is lost.
4) We have trouble extending the storage on a reused value.  This leaves
   us with the original value safely in place, but with more storage
   allocated when needed.

This doesn't consider storing local xattrs (values that don't require a
btree).  Those only fail when the journal fails.

Case (1) is easy.  We just remove the xattr we added.  We leak the
storage because we can't safely remove it, but otherwise everything is
happy.  We'll print a warning about the leak.

Case (4) is easy.  We still have the original value in place.  We can
just leave the extra storage attached to this xattr.  We return the
error, but the old value is untouched.  We print a warning about the
storage.

Case (2) and (3) are hard because we've lost the original values.  In
the old code, we ended up with values that could be partially read.
That's not good.  Instead, we just wipe the xattr entry and leak the
storage.  It stinks that the original value is lost, but now there isn't
a partial value to be read.  We'll print a big fat warning.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
139ffacebf ocfs2: Set inline xattr entries with ocfs2_xa_set()
ocfs2_xattr_ibody_set() is the only remaining user of
ocfs2_xattr_set_entry().  ocfs2_xattr_set_entry() actually does two
things: it calls ocfs2_xa_set(), and it initializes the inline xattrs.
Initializing the inline space really belongs in its own call.

We lift the initialization to ocfs2_xattr_ibody_init(), called from
ocfs2_xattr_ibody_set() only when necessary.  Now
ocfs2_xattr_ibody_set() can call ocfs2_xa_set() directly.
ocfs2_xattr_set_entry() goes away.

Another nice fact is that ocfs2_init_dinode_xa_loc() can trust
i_xattr_inline_size.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
d3981544d7 ocfs2: Set xattr block entries with ocfs2_xa_set()
ocfs2_xattr_block_set() calls into ocfs2_xattr_set_entry() with just the
HAS_XATTR flag.  Most of the machinery of ocfs2_xattr_set_entry() is
skipped.  All that really happens other than the call to ocfs2_xa_set()
is making sure the HAS_XATTR flag is set on the inode.

But HAS_XATTR should be set when we also set di->i_xattr_loc.  And
that's done in ocfs2_create_xattr_block().  So let's move it there, and
then ocfs2_xattr_block_set() can just call ocfs2_xa_set().

While we're there, ocfs2_create_xattr_block() can take the set_ctxt for
a smaller argument list.  It also learns to set HAS_XATTR_FL, because it
knows for sure.  ocfs2_create_empty_xatttr_block() in the reflink path
fakes a set_ctxt to call ocfs2_create_xattr_block().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:13 -08:00
Joel Becker
c5d95df5f7 ocfs2: Let ocfs2_xa_prepare_entry() do space checks.
ocfs2_xattr_set_in_bucket() doesn't need to do its own hacky space
checking.  Let's let ocfs2_xa_prepare_entry() (via ocfs2_xa_set()) do
the more accurate work.  Whenever it doesn't have space,
ocfs2_xattr_set_in_bucket() can try to get more space.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:12 -08:00
Joel Becker
bca5e9bd1e ocfs2: Gell into ocfs2_xa_set()
ocfs2_xa_set() wraps the ocfs2_xa_prepare_entry()/ocfs2_xa_store_value()
logic.  Both callers can now use the same routine.  ocfs2_xa_remove()
moves directly into ocfs2_xa_set().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
73857ee0b5 ocfs2: Allocation in ocfs2_xa_prepare_entry(), values in ocfs2_xa_store_value()
ocfs2_xa_prepare_entry() gets all the logic to add, remove, or modify
external value trees.  Now, when it exits, the entry is ready to receive
a value of any size.

ocfs2_xa_remove() is added to handle the complete removal of an entry.
It truncates the external value tree before calling
ocfs2_xa_remove_entry().

ocfs2_xa_store_inline_value() becomes ocfs2_xa_store_value().  It can
store any value.

ocfs2_xattr_set_entry() loses all the allocation logic and just uses
these functions.  ocfs2_xattr_set_value_outside() disappears.

ocfs2_xattr_set_in_bucket() uses these functions and makes
ocfs2_xattr_set_entry_in_bucket() obsolete.  That goes away, as does
ocfs2_xattr_bucket_set_value_outside() and
ocfs2_xattr_bucket_value_truncate().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
cf2bc80940 ocfs2: Teach ocfs2_xa_loc how to do its own journal work
We're going to want to make sure our buffers get accessed and dirtied
correctly.  So have the xa_loc do the work.  This includes storing the
inode on ocfs2_xa_loc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
3fc12afa0c ocfs2: Provide ocfs2_xa_fill_value_buf() for external value processing
We use the ocfs2_xattr_value_buf structure to manage external values.
It lets the value tree code do its work regardless of the containing
storage.  ocfs2_xa_fill_value_buf() initializes a value buf from an
ocfs2_xa_loc entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:11 -08:00
Joel Becker
9dc474005d ocfs2: Handle value tree roots in ocfs2_xa_set_inline_value()
Previously the xattr code would send in a fake value, containing a tree
root, to the function that installed name+value pairs.  Instead, we pass
the real value to ocfs2_xa_set_inline_value(), and it notices that the
value cannot fit.  Thus, it installs a tree root.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:10 -08:00
Joel Becker
69a3e539d0 ocfs2: Set the xattr name+value pair in one place
We create two new functions on ocfs2_xa_loc, ocfs2_xa_prepare_entry()
and ocfs2_xa_store_inline_value().

ocfs2_xa_prepare_entry() makes sure that the xl_entry field of
ocfs2_xa_loc is ready to receive an xattr.  The entry will point to an
appropriately sized name+value region in storage.  If an existing entry
can be reused, it will be.  If no entry already exists, it will be
allocated.  If there isn't space to allocate it, -ENOSPC will be
returned.

ocfs2_xa_store_inline_value() stores the data that goes into the 'value'
part of the name+value pair.  For values that don't fit directly, this
stores the value tree root.

A number of operations are added to ocfs2_xa_loc_operations to support
these functions.  This reflects the disparate behaviors of xattr blocks
and buckets.

With these functions, the overlapping ocfs2_xattr_set_entry_local() and
ocfs2_xattr_set_entry_normal() can be replaced with a single call
scheme.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:10 -08:00
Joel Becker
199799a360 ocfs2: Wrap calculation of name+value pair size.
An ocfs2 xattr entry stores the text name and value as a pair in the
storage area.  Obviously names and values can be variable-sized.  If a
value is too large for the entry storage, a tree root is stored instead.
The name+value pair is also padded.

Because of this, there are a million places in the code that do:

	if (needs_external_tree(value_size)
		namevalue_size = pad(name_size) + tree_root_size;
	else
		namevalue_size = pad(name_size) + pad(value_size);

Let's create some convenience functions to make the code more readable.
There are three forms.  The first takes the raw sizes.  The second takes
an ocfs2_xattr_info structure.  The third takes an existing
ocfs2_xattr_entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:10 -08:00
Joel Becker
18853b95d1 ocfs2: Add a name_len field to ocfs2_xattr_info.
Rather than calculating strlen all over the place, let's store the
name length directly on ocfs2_xattr_info.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:09 -08:00
Joel Becker
6b240ff63c ocfs2: Prefix the member fields of struct ocfs2_xattr_info.
struct ocfs2_xattr_info is a useful structure describing an xattr
you'd like to set.  Let's put prefixes on the member fields so it's
easier to read and use.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:09 -08:00
Joel Becker
bde1e5400a ocfs2: Remove xattrs via ocfs2_xa_loc
Add ocfs2_xa_remove_entry(), which will remove an xattr entry from its
storage via the ocfs2_xa_loc descriptor.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:09 -08:00
Joel Becker
11179f2c92 ocfs2: Introduce ocfs2_xa_loc
The ocfs2 extended attribute (xattr) code is very flexible.  It can
store xattrs in the inode itself, in an external block, or in a tree of
data structures.  This allows the number of xattrs to be bounded by the
filesystem size.

However, the code that manages each possible storage location is
different.  Maintaining the ocfs2 xattr code requires changing each hunk
separately.

This patch is the start of a series introducing the ocfs2_xa_loc
structure.  This structure wraps the on-disk details of an xattr
entry.  The goal is that the generic xattr routines can use
ocfs2_xa_loc without knowing the underlying storage location.

This first pass merely implements the basic structure, initializing it,
and wiping the name+value pair of the entry.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:08 -08:00
Sunil Mushran
8545e03d82 ocfs2: Add current->comm in trace output
Add current->comm to the standard mlog() output to help with debugging.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:08 -08:00
Wengang Wang
96a1cc731a ocfs2: Clean up the checks for CoW and direct I/O.
When ocfs2 has to do CoW for refcounted extents, we disable direct I/O
and go through the buffered I/O path.  This makes the combined check
easier to read.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:07 -08:00
Tiger Yang
b89c54282d ocfs2: add extent block stealing for ocfs2 v5
This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:07 -08:00
Alex Elder
398007f863 Merge branch 'linux-2.6.33' 2010-02-26 14:34:02 -06:00
David Teigland
b6fa8796b2 dlm: use bastmode in debugfs output
The bast mode that appears in the debugfs output should be
useful on both master and process nodes.  lkb_highbast is
currently printed, and is only useful on the master node.
lkb_bastmode is only useful on the process node.  This
patch sets lkb_bastmode on the master node as well, and
uses that value in the debugfs print.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-26 12:15:54 -06:00
Steven Whitehouse
b4a5d4bc37 dlm: Send lockspace name with uevents
Although it is possible to get this information from the path,
its much easier to provide the lockspace as a seperate env
variable.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-26 12:14:25 -06:00
David Teigland
cf6620acc0 dlm: send reply before bast
When the lock master processes a successful operation (request,
convert, cancel, or unlock), it will process the effects of the
change before sending the reply for the operation.  The "effects"
of the operation are:

- blocking callbacks (basts) for any newly granted locks
- waiting or converting locks that can now be granted

The cast is queued on the local node when the reply from the lock
master is received.  This means that a lock holder can receive a
bast for a lock mode that is doesn't yet know has been granted.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-26 11:57:37 -06:00
Martin K. Petersen
8a78362c4e block: Consolidate phys_segment and hw_segment limits
Except for SCSI no device drivers distinguish between physical and
hardware segment limits.  Consolidate the two into a single segment
limit.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-26 13:58:08 +01:00
Linus Torvalds
6ebdc661b6 Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (41 commits)
  of: remove undefined request_OF_resource & release_OF_resource
  of/sparc: Remove sparc-local declaration of allnodes and devtree_lock
  of: move definition of of_chosen into common code.
  of: remove unused extern reference to devtree_lock
  of: put default string compare and #a/s-cell values into common header
  of/flattree: Don't assume HAVE_LMB
  of: protect linux/of.h with CONFIG_OF
  proc_devtree: fix THIS_MODULE without module.h
  of: Remove old and misplaced function declarations
  of/flattree: Make the kernel accept ePAPR style phandle information
  of/flattree: endian-convert members of boot_param_header
  of: assume big-endian properties, adding conversions where necessary
  of: use __be32 for cell value accessors
  of/flattree: use OF_ROOT_NODE_{SIZE,ADDR}_CELLS DEFAULT for fdt parsing
  of/flattree: use callback to setup initrd from /chosen
  proc_devtree: include linux/of.h
  of: make set_node_proc_entry private to proc_devtree.c
  of: include linux/proc_fs.h
  of/flattree: merge early_init_dt_scan_memory() common code
  of: add 'of_' prefix to machine_is_compatible()
  ...
2010-02-25 15:38:37 -08:00
Paul E. McKenney
7dc5215798 vfs: Apply lockdep-based checking to rcu_dereference() uses
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
and fcheck_files().

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25 10:34:48 +01:00
Jens Axboe
7f03292ee1 Merge branch 'master' into for-2.6.34
Conflicts:
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-02-25 08:48:05 +01:00
Steve French
d7b619cf56 [CIFS] pSesInfo->sesSem is used as mutex. Rename it to session_mutex and
convert it to a real mutex.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-25 05:36:46 +00:00
Steve French
122ca0076e [CIFS] Use unsigned ea length for clarity
Jeff correctly noted that using unsigned ea length is more intuitive.
CC: Jeff Lyaton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-24 21:56:48 +00:00
David Teigland
7fe2b3190b dlm: fix ordering of bast and cast
When both blocking and completion callbacks are queued for lock,
the dlm would always deliver the completion callback (cast) first.
In some cases the blocking callback (bast) is queued before the
cast, though, and should be delivered first.  This patch keeps
track of the order in which they were queued and delivers them
in that order.

This patch also keeps track of the granted mode in the last cast
and eliminates the following bast if the bast mode is compatible
with the preceding cast mode.  This happens when a remotely mastered
lock is demoted, e.g. EX->NL, in which case the local node queues
a cast immediately after sending the demote message.  In this way
a cast can be queued for a mode, e.g. NL, that makes an in-transit
bast extraneous.

Signed-off-by: David Teigland <teigland@redhat.com>
2010-02-24 11:46:53 -06:00
dingdinghua
23e2af3518 jbd2: clean up an assertion in jbd2_journal_commit_transaction()
commit_transaction has the same value as journal->j_running_transaction,
so we can simplify the assert statement.

Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 12:11:20 -05:00
Dmitry Monakhov
56c50f11f4 ext4: trivial quota cleanup
The patch is aimed to reorganize and simplify quota code a bit.
Quota code is itself complex enough, but we can make it more readable
in some places:
- Move quota option parsing to separate functions.
- Simplify old-quota and journaled-quota mix check.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-01 23:28:41 -05:00
Dmitry Monakhov
482a74258f ext4: mount flags manipulation cleanup
Replace intermediate EXT4_MOUNT_XXX flags manipulation to
corresponding macro.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 11:35:32 -05:00
Jiaying Zhang
c8d46e41bc ext4: Add flag to files with blocks intentionally past EOF
fallocate() may potentially instantiate blocks past EOF, depending
on the flags used when it is called.

e2fsck currently has a test for blocks past i_size, and it
sometimes trips up - noticeably on xfstests 013 which runs fsstress.

This patch from Jiayang does fix it up - it (along with
e2fsprogs updates and other patches recently from Aneesh) has
survived many fsstress runs in a row.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-24 09:52:53 -05:00
Jeff Layton
835a36ca4a cifs: set server_eof in cifs_fattr_to_inode
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 23:28:43 +00:00
Steve French
96c03bccc7 [CIFS] Minor cleanup to EA patch
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:51:43 +00:00
Jeff Layton
31c0519f7a cifs: merge CIFSSMBQueryEA with CIFSSMBQAllEAs
Add an "ea_name" parameter to CIFSSMBQAllEAs. When it's set make it
behave like CIFSSMBQueryEA does now. The current callers of
CIFSSMBQueryEA are converted to use CIFSSMBQAllEAs, and the old
CIFSSMBQueryEA function is removed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:47:32 +00:00
Jeff Layton
0cd126b504 cifs: verify lengths of QueryAllEAs reply
Make sure the lengths in a QUERY_ALL_EAS reply don't make the parser walk
off the end of the SMB.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:47:11 +00:00
Jeff Layton
e529614ad0 cifs: increase maximum buffer size in CIFSSMBQAllEAs
It's 4000 now, but there's no reason to limit it to that. We should be
able to handle a response up to CIFSMaxBufSize.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:46:49 +00:00
Jeff Layton
6e462b9f2c cifs: rename name_len to list_len in CIFSSMBQAllEAs
...for clarity and so we can reuse the name for the real name_len.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:46:27 +00:00
Jeff Layton
f0d3868b78 cifs: clean up indentation in CIFSSMBQAllEAs
Add a label that we can goto on error, and reduce some of the
if/then/else indentation in this function.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:45:52 +00:00
Jeff Layton
370b41911c cifs: add parens around smb_var in BCC macros
...to remove ambiguity about how these values are interpreted when
passing in more complex values as arguments.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-23 20:45:21 +00:00
Michael Neuling
a17e18790a fs/exec.c: fix initial stack reservation
803bf5ec25 ("fs/exec.c: restrict initial
stack space expansion to rlimit") attempts to limit the initial stack to
20*PAGE_SIZE.  Unfortunately, in attempting ensure the stack is not
reduced in size, we ended up not changing the stack at all.

This size reduction check is not necessary as the expand_stack call does
this already.

This caused a regression in UML resulting in most guest processes being
killed.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jouni Malinen <j@w1.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-22 19:50:34 -08:00
stephen hemminger
1cc523271e seq_file: add RCU versions of new hlist/list iterators (v3)
Many usages of seq_file use RCU protected lists, so non RCU
iterators will not work safely.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-22 15:45:54 -08:00
Jens Axboe
f11cbd74c5 Merge branch 'master' into for-2.6.34 2010-02-22 13:48:51 +01:00
David Howells
8f9941aecc CacheFiles: Fix a race in cachefiles_delete_object() vs rename
cachefiles_delete_object() can race with rename.  It gets the parent directory
of the object it's asked to delete, then locks it - but rename may have changed
the object's parent between the get and the completion of the lock.

However, if such a circumstance is detected, we abandon our attempt to delete
the object - since it's no longer in the index key path, it won't be seen
again by lookups of that key.  The assumption is that cachefilesd may have
culled it by renaming it to the graveyard for later destruction.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-20 10:06:35 -05:00
Jiro SEKIBA
0d561f12b4 nilfs2: add reader's lock for cno in nilfs_ioctl_sync
This adds reader's lock for the_nilfs->cno in nilfs_ioctl_sync,
for the_nilfs->cno should be proctected by segctor_sem when reading.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-20 21:18:19 +09:00
Chuck Ebbert
aeaa5ccd64 vfs: don't call ima_file_check() unconditionally in nfsd_open()
commit 1e41568d73 ("Take ima_path_check()
in nfsd past dentry_open() in nfsd_open()") moved this code back to its
original location but missed the "else".

Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-20 00:47:31 -05:00
Al Viro
7fee4868be Switch proc/self to nd_set_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 10:25:41 -05:00
Al Viro
ac278a9c50 fix LOOKUP_FOLLOW on automount "symlinks"
Make sure that automount "symlinks" are followed regardless of LOOKUP_FOLLOW;
it should have no effect on them.

Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 03:56:42 -05:00
Al Viro
c44dcc56d2 switch inotify_user to anon_inode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-19 03:35:12 -05:00
Jiro SEKIBA
03f29365e8 nilfs2: delete unnecessary condition in load_segment_summary
This is a trivial patch to remove unnecessary condition.

load_segment_summary() checks crc of segment_summary OR crc of whole
log data blocks based on boolean argument full_check.  However,
callers of the function pass only 1 as full_check, which means only
whole log data blocks checking code is running all the time.

This patch deletes the condition and full_check argument and also
deletes enum 'NILFS_SEG_FAIL_CHECKSUM_SEGSUM' and corresponding case
clause, for it is nolonger used anymore.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-18 20:09:03 +09:00
David S. Miller
2bb4646fce Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 2010-02-16 22:09:29 -08:00
Tejun Heo
003cb608a2 percpu: add __percpu sparse annotations to fs
Add __percpu sparse annotations to fs.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors.  This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
2010-02-17 11:17:38 +09:00
Eric W. Biederman
7c0ff870d1 sysfs: sysfs_sd_setattr set iattrs unconditionally
There is currently a bug in sysfs_sd_setattr inherited from
sysfs_setattr in 2.6.32 where the first time we set the attributes
on a sysfs file we allocate backing store but do not set the
backing store attributes.  Resulting in overly restrictive
permissions on sysfs files.

The fix is to simply modify the code so that it always executes
when we update the sysfs attributes, as we did in 2.6.31 and earlier.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Tested-by: Jean Delvare <khali@linux-fr.org>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-16 15:42:42 -08:00
Curt Wohlgemuth
73b50c1c92 ext4: Fix BUG_ON at fs/buffer.c:652 in no journal mode
Calls to ext4_handle_dirty_metadata should only pass in an inode
pointer for inode-specific metadata, and not for shared metadata
blocks such as inode table blocks, block group descriptors, the
superblock, etc.

The BUG_ON can get tripped when updating a special device (such as a
block device) that is opened (so that i_mapping is set in
fs/block_dev.c) and the file system is mounted in no journal mode.

Addresses-Google-Bug: #2404870

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-16 15:06:29 -05:00
Linus Torvalds
0813e22d4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: btrfs_mark_extent_written uses the wrong slot
2010-02-15 19:56:21 -08:00
Chuck Lever
65d269538a NFS: Too many GETATTR and ACCESS calls after direct I/O
The cached read and write paths initialize fattr->time_start in their
setup procedures.  The value of fattr->time_start is propagated to
read_cache_jiffies by nfs_update_inode().  Subsequent calls to
nfs_attribute_timeout() will then use a good time stamp when
computing the attribute cache timeout, and squelch unneeded GETATTR
calls.

Since the direct I/O paths erroneously leave the inode's
fattr->time_start field set to zero, read_cache_jiffies for that inode
is set to zero after any direct read or write operation.  This
triggers an otw GETATTR or ACCESS call to update the file's attribute
and access caches properly, even when the NFS READ or WRITE replies
have usable post-op attributes.

Make sure the direct read and write setup code performs the same fattr
initialization as the cached I/O paths to prevent unnecessary GETATTR
calls.

This was likely introduced by commit 0e574af1 in 2.6.15, which appears
to add new nfs_fattr_init() call sites in the cached read and write
paths, but not in the equivalent places in fs/nfs/direct.c.  A
subsequent commit in the same series, 33801147, introduces the
fattr->time_start field.

Interestingly, the direct write reschedule path already has a call to
nfs_fattr_init() in the right place.

Reported-by: Quentin Barnes <qbarnes@yahoo-inc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-15 19:53:43 -08:00
Linus Torvalds
0aa2ca9ae1 Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Fix softlockup while waiting on an inode
2010-02-15 19:51:45 -08:00
dingdinghua
ba869023ea jbd2: delay discarding buffers in journal_unmap_buffer
Delay discarding buffers in journal_unmap_buffer until
we know that "add to orphan" operation has definitely been
committed, otherwise the log space of committing transation
may be freed and reused before truncate get committed, updates
may get lost if crash happens.

Signed-off-by: dingdinghua <dingdinghua@nrchpc.ac.cn>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 16:35:42 -05:00
Leonard Michlmayr
aca92ff6f5 ext4: correctly calculate number of blocks for fiemap
ext4_fiemap() rounds the length of the requested range down to
blocksize, which is is not the true number of blocks that cover the
requested region.  This problem is especially impressive if the user
requests only the first byte of a file: not a single extent will be
reported.

We fix this by calculating the last block of the region and then
subtract to find the number of blocks in the extents.

Signed-off-by: Leonard Michlmayr <leonard.michlmayr@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-03-04 17:07:28 -05:00
Roel Kluin
9aaab0589b ext4: add missing error checking to ext4_expand_extra_isize_ea()
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
2010-02-15 14:26:16 -05:00
Eric Sandeen
12062dddda ext4: move __func__ into a macro for ext4_warning, ext4_error
Just a pet peeve of mine; we had a mishash of calls with either __func__
or "function_name" and the latter tends to get out of sync.

I think it's easier to just hide the __func__ in a macro, and it'll
be consistent from then on.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-02-15 14:19:27 -05:00
Frederic Weisbecker
175359f89d reiserfs: Fix softlockup while waiting on an inode
When we wait for an inode through reiserfs_iget(), we hold
the reiserfs lock. And waiting for an inode may imply waiting
for its writeback. But the inode writeback path may also require
the reiserfs lock, which leads to a deadlock.

We just need to release the reiserfs lock from reiserfs_iget()
to fix this.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
Cc: Chris Mason <chris.mason@oracle.com>
2010-02-14 19:07:56 +01:00
Jeremy Kerr
7c540d9e3d proc_devtree: fix THIS_MODULE without module.h
Commit e22f628395 introduced a build
breakage for ARM devtree work: the THIS_MODULE macro was added, but we
don't have module.h

This change adds the necessary #include to get THIS_MODULE defined.
While we could just replace it with NULL (PROC_FS is a bool, not a
tristate), using THIS_MODULE will prevent unexpected breakage if we
ever do compile this as a module.

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michal Simek <monstr@monstr.eu>
2010-02-14 07:13:41 -07:00
Julia Lawall
d67b1b0325 fs/xfs: Correct NULL test
Test the value that was just allocated rather than the previously tested one.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r@
expression *x;
expression e;
identifier l;
@@

if (x == NULL || ...) {
    ... when forall
    return ...; }
... when != goto l;
    when != x = e
    when != &x
*x == NULL
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-13 13:22:53 -06:00
Ryusuke Konishi
d1c6b72a72 nilfs2: move iterator to write log into segment buffer
This moves iterator to submit write requests for a series of logs into
segbuf.c, and hides nilfs_segbuf_write() and nilfs_segbuf_wait() in
the file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
e605f0a724 nilfs2: get rid of s_dirt flag use
This replaces s_dirt flag use in nilfs with a new flag added on the
nilfs object.  The s_dirt flag was used to indicate if
sop->write_super() should be called, however the current version of
nilfs does not use the callback.  Thus, it can be replaced with the
own flag.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Jiro SEKIBA <jir@unicus.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
dcd7618695 nilfs2: get rid of nilfs_segctor_req struct
This will clean up nilfs_segctor_req struct and the obscure request
argument passed among private methods of segment constructor.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Jiro SEKIBA
086d1764b2 nilfs2: delete unnecessary condition in nilfs_dat_translate
This is a trivial patch to delete unnecessary condition in nilfs_dat_translate.

nilfs_dat_translate() will asign translated address to *blocknrp if blocknrp
is not NULL.  However the condition is unneeded, because all callers of
nilfs_dat_translate() pass blocknrp properly.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
fe5f171bb2 nilfs2: fix potential hang in nilfs_error on errors=remount-ro
nilfs_error() calls nilfs_detach_segment_constructor() if
errors=remount-ro option is specified, and this may lead to a hang due
to recursive locking of, for instance, nilfs->ns_segctor_sem and
others.

In this case, detaching segment constructor is not necessary because
read-only flag is set to the filesystem and further writes are
blocked.

This fixes the potential hang issue by removing the
nilfs_detach_segment_constructor() call from nilfs_error.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:03 +09:00
Ryusuke Konishi
7512487e6d nilfs2: use mnt_want_write in ioctls where write access is needed
A few nilfs2 ioctls need to ask for and then later release write
access to the mount in order to avoid potential write to read-only
mounts.

This adds the missing mnt_want_write and mnt_drop_write in
nilfs_ioctl_change_cpmode, nilfs_ioctl_delete_checkpoint, and
nilfs_ioctl_clean_segments.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:02 +09:00
Jiro SEKIBA
e902ec9906 nilfs2: issue discard request after cleaning segments
This adds a function to send discard requests for given array of
segment numbers, and calls the function when garbage collection
succeeded.

Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-02-13 12:26:02 +09:00
Shaohua Li
3f6fae9559 Btrfs: btrfs_mark_extent_written uses the wrong slot
My test do: fallocate a big file and do write. The file is 512M, but
after file write is done btrfs-debug-tree shows:
item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53
                extent data disk byte 1103101952 nr 536870912
                extent data offset 0 nr 399634432 ram 536870912
                extent compression 0
Looks like a regression introducted by
6c7d54ac87, where we set wrong slot.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-12 16:47:19 -05:00
Christoph Hellwig
180040b89e xfs: optimize log flushing in xfs_fsync
If we have a pinned inode it must have a log item attached to it.
Usually that log item will have ili_last_lsn already set, in which
case we only need to flush the log up to that LSN instead of doing a
full log force.  This gives speedups of about 5% in some fsync heavy
workloads.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-12 13:45:14 -06:00
Christoph Hellwig
87185517de xfs: only clear the suid bit once in xfs_write
file_remove_suid already calls into ->setattr to clear the suid and
sgid bits if needed, no need to start a second transaction to do it
ourselves.

Note that xfs_write_clear_setuid issues a sync transaction while the
path through ->setattr doesn't, but that is consistant with the
other filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-12 13:43:57 -06:00
Steven Whitehouse
07ccb7bf2c GFS2: Fix bmap allocation corner-case bug
This patch solves a corner case during allocation which occurs if both
metadata (indirect) and data blocks are required but there is an
obstacle in the filesystem (e.g. a resource group header or another
allocated block) such that when the allocation is requested only
enough blocks for the metadata are returned.

By changing the exit condition of this loop, we ensure that a
minimum of one data block will always be returned.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-12 10:16:14 +00:00
Abhijith Das
0e5a9fb042 GFS2: Fix error code
We need this one-liner to signal the mount helper of the 'insufficient journals' condition.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-12 10:15:51 +00:00
Linus Torvalds
efa82bab8e Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Fix the mapping of the NFSERR_SERVERFAULT error
  NFS: Remove a redundant check for PageFsCache in nfs_migrate_page()
  NFS: Fix a bug in nfs_fscache_release_page()
2010-02-11 14:06:28 -08:00
Linus Torvalds
fd48d6c888 Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6:
  [SCSI] qla2xxx: Obtain proper host structure during response-queue processing.
  [SCSI] compat_ioct: fix bsg SG_IO
  [SCSI] qla2xxx: make msix interrupt handler safe for irq
  [SCSI] zfcp: Report FC BSG errors in correct field
  [SCSI] mptfusion : mptscsih_abort return value should be SUCCESS instead of value 0.
2010-02-11 14:05:55 -08:00
Michael Neuling
803bf5ec25 fs/exec.c: restrict initial stack space expansion to rlimit
When reserving stack space for a new process, make sure we're not
attempting to expand the stack by more than rlimit allows.

This fixes a bug caused by b6a2fea393 ("mm:
variable length argument support") and unmasked by
fc63cf2370 ("exec: setup_arg_pages() fails
to return errors").

This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
80K on 4K pages or 'ulimit -s 79') all processes will be killed before
they start.  This is particularly bad with 64K pages, where a ulimit below
1280K will kill every process.

To test, do:

  'ulimit -s 15; ls'

before and after the patch is applied.  Before it's applied, 'ls' should
be killed.  After the patch is applied, 'ls' should no longer be killed.

A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
correctly with this code.

4K pages should be fine to test with.

[kosaki.motohiro@jp.fujitsu.com: cleanup]
[akpm@linux-foundation.org: cleanup cleanup]
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Americo Wang <xiyou.wangcong@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-11 13:59:43 -08:00
Andreas Schwab
4cfbafd33f compat_ioctl: add compat handler for TIOCGSID ioctl
This is used by tcgetsid(3).

Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-11 13:59:42 -08:00
Li Zefan
66655de6d1 seq_file: Add helpers for iteration over a hlist
Some places in kernel need to iterate over a hlist in seq_file,
so provide some common helpers.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-10 11:12:06 -08:00
Arnd Bergmann
f79f118528 compat_ioctl: ignore RAID_VERSION ioctl
md ioctls are now handled by the md driver itself, but mdadm
may call RAID_VERSION on other devices as well. Mark the command
as IGNORE_IOCTL so this fails silently rather than printing
an annoying message.

Reported-by: "Michael S. Tsirkin" <m.s.tsirkin@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-10 07:36:16 -08:00
Linus Torvalds
5551638acb Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix dentry hash calculation for case-insensitive mounts
  [CIFS] Don't cache timestamps on utimes due to coarse granularity
  [CIFS] Maximum username length check in session setup does not match
  cifs: fix length calculation for converted unicode readdir names
  [CIFS] Add support for TCP_NODELAY
2010-02-10 07:16:44 -08:00
Chuck Lever
f895c53f8a NFS: Make close(2) asynchronous when closing NFS O_DIRECT files
For NFSv2 and v3:

O_DIRECT writes are always synchronous, and aren't cached, so nothing
should be flushed when closing an NFS O_DIRECT file descriptor.  Thus
there are no write errors to report on close(2).

In addition, there's no cached data to verify on the next open(2),
so we don't need clean GETATTR results at close time to compare with.

Thus, there's no need for the nfs_revalidate_inode() call when closing
an NFS O_DIRECT file.  This reduces the number of synchronous
on-the-wire requests for a simple open-write-close of an NFS O_DIRECT
file by roughly 20%.

For NFSv4:

Call nfs4_do_close() with wait set to zero when closing an NFS
O_DIRECT file.  The CLOSE will go on the wire, but the application
won't wait for it to complete.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:05 -05:00
Chuck Lever
7e381172cf NFS: Improve NFS iostat byte count accuracy for writes
The bytes counted by the performance counters for NFS writes should
reflect write and sync errors.  If the write(2) system call reports
an error, the bytes should not be counted.  And, if the write is
short, the actual number of bytes that was written should be counted,
not the number of bytes that was requested.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:04 -05:00
Chuck Lever
aa2f1ef10e NFS: Account for NFS bytes read via the splice API
Bytes read via the splice API should be accounted for in the NFS
performance statistics.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:03 -05:00
Chuck Lever
4184dcf2db NFS: Fix byte accounting for generic NFS reads
Currently, the NFS I/O counters count the number of bytes requested
by applications, rather than the number of bytes actually read by the
system calls.

The number of bytes requested for reads is actually not that useful,
because the value is usually a buffer size for reads.  That is, that
requested number is usually a maximum, and frequently doesn't reflect
the actual number of bytes read.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:03 -05:00
Chuck Lever
c2459dc462 NFS: Proper accounting for NFS VFS calls
Nit: The VFSOPEN and VFSFLUSH counters are function call counters.
Count every call to these routines.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:02 -05:00
Andy Adamson
9733f0d928 nfs41: cleanup callback code to use __be32 type
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:01 -05:00
Andy Adamson
41f54a5548 nfs41: clear NFS4CLNT_RECALL_SLOT bit on session reset
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:00 -05:00
Andy Adamson
bae0ac0ee1 nfs41: fix nfs4_callback_recallslot
Return NFS4_OK if target high slotid equals enforced high slotid.
Fix nfs_client reference leak.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:31:00 -05:00
Andy Adamson
104aeba484 nfs41: resize slot table in reset
When session is reset, client can renegotiate slot table size.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:59 -05:00
Andy Adamson
b9efa1b27e nfs41: implement cb_recall_slot
Drain the fore channel and reset the max_slots to the new value.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:59 -05:00
Andy Adamson
4911096f1a nfs41: back channel drc minimal implementation
For now the back channel ca_maxresponsesize_cached is 0 and there is no
backchannel DRC. Return NFS4ERR_REP_TOO_BIG_TO_CACHE when a cb_sequence
cachethis is true.  When it is false, return NFS4ERR_RETRY_UNCACHED_REP as the
next operation error.

Remember the replay error accross compound operation processing.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:58 -05:00
Andy Adamson
b2f28bd783 nfs41: prepare for back channel drc
Make all cb_sequence arguments available to verify_seqid which will make
replay decisions.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:58 -05:00
Andy Adamson
e95e60daee nfs41: remove uneeded checks in callback processing
All callback operations have arguments to decode and require processing.
The preprocess_nfs4X_op functions catch unsupported or illegal ops so
decode_args and process_op pointers are always non NULL.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:57 -05:00
Andy Adamson
b92b301900 nfs41: directly encode back channel error
Skip all other processing when error is encountered.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:56 -05:00
Andy Adamson
31d2b4356b nfs41: fix wrong error on callback header xdr overflow
Set NFS4ERR_RESOURCE as CB_COMPOUND status and do not return an op on
decode_op_hdr or encode_op_hdr buffer overflow.

NFS4ERR_RESOURCE is correct for v4.0. Will fix the return for v4.1 along with
all the other NFS4ERR_RESOURCE errors in a later patch.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:56 -05:00
Mike Sager
72ce2b3c06 nfs41: Process callback's referring call list
If a CB_SEQUENCE referring call triple matches a slot table entry, the
client is still waiting for a response to the original request.  In this
case, return NFS4ERR_DELAY as the response to the callback.

Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:55 -05:00
Mike Sager
a7989c3e47 nfs41: Check slot table for referring calls
Traverse a list of referring calls and look for a session/slot/seq number
match.

Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:55 -05:00
Mike Sager
8e0d46e138 nfs41: Adjust max cache response size value
For the CREATE_SESSION attribute ca_maxresponsesize_cached, calculate
the value based on the rpc reply header size plus the maximum nfs compound
reply size.

Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:54 -05:00
Jeff Layton
97cefcc6d0 nfs: handle NFSv2 -EKEYEXPIRED returns from RPC layer appropriately
Add a wrapper around rpc_call_sync that handles -EKEYEXPIRED errors from
the RPC layer as it would an -EJUKEBOX error if NFSv2 had such a thing.
Also, add a handler for that error for async calls that makes it
resubmit the RPC on -EKEYEXPIRED.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:52 -05:00
Jeff Layton
b68d69b8c6 nfs: handle NFSv3 -EKEYEXPIRED errors as we would -EJUKEBOX
We're using -EKEYEXPIRED to indicate that a krb5 credcache contains an
expired ticket and that we should have the NFS layer retry the RPC call
instead of returning an error back to the caller. Handle this as we
would an -EJUKEBOX error return.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:51 -05:00
Jeff Layton
2c6434888c nfs4: handle -EKEYEXPIRED errors from RPC layer
If a KRB5 TGT ticket expires, we don't want to return an error
immediatel. If someone has a long running job and just forgets to run
"kinit" in time then this will make it fail.

Instead, we want to treat this situation as we would NFS4ERR_DELAY and
retry the upcall after delaying a bit with an exponential backoff.

This patch just makes any place that would handle NFS4ERR_DELAY also
handle -EKEYEXPIRED the same way. In the future, we may want to be more
sophisticated however and handle hard vs. soft mounts differently, or
specify some upper limit on how long we'll wait for a new TGT to be
acquired.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-10 08:30:50 -05:00
Trond Myklebust
fdcb45777a NFS: Fix the mapping of the NFSERR_SERVERFAULT error
It was recently pointed out that the NFSERR_SERVERFAULT error, which is
designed to inform the user of a serious internal error on the server, was
being mapped to an error value that is internal to the kernel.

This patch maps it to the error EREMOTEIO, which is exported to userland
through errno.h.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-02-09 14:29:29 -05:00
Trond Myklebust
7549ad5f9b NFS: Remove a redundant check for PageFsCache in nfs_migrate_page()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: David Howells <dhowells@redhat.com>
2010-02-09 14:29:21 -05:00
Trond Myklebust
2c1740098c NFS: Fix a bug in nfs_fscache_release_page()
Not having an fscache cookie is perfectly valid if the user didn't mount
with the fscache option.

This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=15234

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
2010-02-09 14:29:10 -05:00
Linus Torvalds
3af9cf11b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9p: fix p9_client_destroy unconditional calling v9fs_put_trans
  9p: fix memory leak in v9fs_parse_options()
  9p: Fix the kernel crash on a failed mount
  9p: fix option parsing
  9p: Include fsync support for 9p client
  net/9p: fix statsize inside twstat
  net/9p: fail when user specifies a transport which we can't find
  net/9p: fix virtio transport to correctly update status on connect
2010-02-09 11:19:06 -08:00
Jeremy Kerr
50ab2fe147 proc_devtree: include linux/of.h
Currenly, proc_devtree.c depends on asm/prom.h to include linux/of.h, to
provide some device-tree definitions (eg, struct property).

Instead, include linux/of.h directly. We still need asm/prom.h for
HAVE_ARCH_DEVTREE_FIXUPS.

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-02-09 08:34:10 -07:00
Jeremy Kerr
8cfb3343f7 of: make set_node_proc_entry private to proc_devtree.c
We only need set_node_proc_entry in proc_devtree.c, so move it there.

This fixes the !HAVE_ARCH_DEVTREE_FIXUPS build, as we can't make make
the definition in linux/of.h conditional on this #define (definitions in
asm/prom.h can't be exposed to linux/of.h, due to the enforced #include
ordering).

Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-02-09 08:34:10 -07:00
Linus Torvalds
deb0c98c7f Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux:
  Revert "nfsd4: fix error return when pseudoroot missing"
2010-02-08 17:08:01 -08:00
Linus Torvalds
a5f28ae4df Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2/cluster: Make o2net connect messages KERN_NOTICE
  ocfs2/dlm: Fix printing of lockname
  ocfs2: Fix contiguousness check in ocfs2_try_to_merge_extent_map()
  ocfs2/dlm: Remove BUG_ON in dlm recovery when freeing locks of a dead node
  ocfs2: Plugs race between the dc thread and an unlock ast message
  ocfs2: Remove overzealous BUG_ON during blocked lock processing
  ocfs2: Do not downconvert if the lock level is already compatible
  ocfs2: Prevent a livelock in dlmglue
  ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast
  ocfs2: Use compat_ptr in reflink_arguments.
  ocfs2/dlm: Handle EAGAIN for compatibility - v2
  ocfs2: Add parenthesis to wrap the check for O_DIRECT.
  ocfs2: Only bug out when page size is larger than cluster size.
  ocfs2: Fix memory overflow in cow_by_page.
  ocfs2/dlm: Print more messages during lock migration
  ocfs2/dlm: Ignore LVBs of locks in the Blocked list
  ocfs2/trivial: Remove trailing whitespaces
  ocfs2: fix a misleading variable name
  ocfs2: Sync max_inline_data_with_xattr from tools.
  ocfs2: Fix refcnt leak on ocfs2_fast_follow_link() error path
2010-02-08 16:05:50 -08:00
Eric Van Hensbergen
bf2d29c64d 9p: fix memory leak in v9fs_parse_options()
If match_strdup() fail this function exits without freeing the options string.

Signed-off-by: Venkateswararao Jujjuri <jvrao@us.ibm.com>
Sigend-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-02-08 17:59:34 -06:00
Eric Van Hensbergen
d8c8a9e365 9p: fix option parsing
Options pointer is being moved before calling kfree() which seems
to cause problems.  This uses a separate pointer to track and free
original allocation.

Signed-off-by: Venkateswararao Jujjuri <jvrao@us.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>w
2010-02-08 16:23:23 -06:00
M. Mohan Kumar
7a4439c406 9p: Include fsync support for 9p client
Implement the fsync in the client side by marking stat field values to 'don't touch' so that server may 
interpret it as a request to guarantee that the contents of the associated file are committed to stable 
storage before the Rwstat message is returned.

Without this patch, calling fsync on a 9p file results in "Invalid argument" error. Please check the attached 
C program.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> 
Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com> 
Acked-by: Venkateswararao Jujjuri (JV) <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-02-08 15:36:48 -06:00
Sunil Mushran
6efd806634 ocfs2/cluster: Make o2net connect messages KERN_NOTICE
Connect and disconnect messages are more than informational as they are required
during root cause analysis for failures. This patch changes them from KERN_INFO
to KERN_NOTICE.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Faseh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-08 13:02:28 -08:00
Sunil Mushran
86a06abab0 ocfs2/dlm: Fix printing of lockname
The debug call printing the name of the lock resource was chopping
off the last character. This patch fixes the problem.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-08 13:01:31 -08:00
J. Bruce Fields
260c64d235 Revert "nfsd4: fix error return when pseudoroot missing"
Commit f39bde24b2 fixed the error return from PUTROOTFH in the
case where there is no pseudofilesystem.

This is really a case we shouldn't hit on a correctly configured server:
in the absence of a root filehandle, there's no point accepting version
4 NFS rpc calls at all.

But the shared responsibility between kernel and userspace here means
the kernel on its own can't eliminate the possiblity of this happening.
And we have indeed gotten this wrong in distro's, so new client-side
mount code that attempts to negotiate v4 by default first has to work
around this case.

Therefore when commit f39bde24b2 arrived at roughly the same
time as the new v4-default mount code, which explicitly checked only for
the previous error, the result was previously fine mounts suddenly
failing.

We'll fix both sides for now: revert the error change, and make the
client-side mount workaround more robust.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-02-08 15:25:23 -05:00
FUJITA Tomonori
84eb8fb42c [SCSI] compat_ioct: fix bsg SG_IO
bsg's SG_IO doesn't work on 32-bit userspace and 64-bit kernelspace.

The problem is that both sg and bsg drivers use SG_IO
ioctl. sg_ioctl_trans() does 32/64-bit conversion even against bsg
header. It messes up bsg header. bsg driver gets garbage.

This patch fixes sg_ioctl_trans to handle only sg header (struct
sg_io_hdr).

Reported-by: Giridhar Malavali <giridhar.malavali@qlogic.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-02-08 13:43:18 -06:00
Al Viro
cccc6bba3f Lose the first argument of audit_inode_child()
it's always equal to ->d_name.name of the second argument

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-08 14:38:36 -05:00
Al Viro
123df2944c Lose the new_name argument of fsnotify_move()
it's always new_dentry->d_name.name

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-08 14:38:36 -05:00
Jeff Layton
05507fa2ac cifs: fix dentry hash calculation for case-insensitive mounts
case-insensitive mounts shouldn't use full_name_hash(). Make sure we
use the parent dentry's d_hash routine when one is set.

Reported-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-08 17:52:34 +00:00
Steve French
ccd4bb1beb [CIFS] Don't cache timestamps on utimes due to coarse granularity
force revalidate of the file when any of the timestamps are set since
some filesytem types do not have finer granularity timestamps and
we can not always detect which file systems round timestamps down
to determine whether we can cache the mtime on setattr
samba bugzilla 3775

Acked-by: Shirish Pargaonkar <sharishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-08 17:39:58 +00:00
Linus Torvalds
6339204ecc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  Take ima_file_free() to proper place.
  ima: rename PATH_CHECK to FILE_CHECK
  ima: rename ima_path_check to ima_file_check
  ima: initialize ima before inodes can be allocated
  fix ima breakage
  Take ima_path_check() in nfsd past dentry_open() in nfsd_open()
  freeze_bdev: don't deactivate successfully frozen MS_RDONLY sb
  befs: fix leak
2010-02-07 11:18:28 -08:00
Linus Torvalds
80e1e82398 Fix race in tty_fasync() properly
This reverts commit 7036251180 ("tty: fix race in tty_fasync") and
commit b04da8bfdf ("fnctl: f_modown should call write_lock_irqsave/
restore") that tried to fix up some of the fallout but was incomplete.

It turns out that we really cannot hold 'tty->ctrl_lock' over calling
__f_setown, because not only did that cause problems with interrupt
disables (which the second commit fixed), it also causes a potential
ABBA deadlock due to lock ordering.

Thanks to Tetsuo Handa for following up on the issue, and running
lockdep to show the problem.  It goes roughly like this:

 - f_getown gets filp->f_owner.lock for reading without interrupts
   disabled, so an interrupt that happens while that lock is held can
   cause a lockdep chain from f_owner.lock -> sighand->siglock.

 - at the same time, the tty->ctrl_lock -> f_owner.lock chain that
   commit 7036251180 introduced, together with the pre-existing
   sighand->siglock -> tty->ctrl_lock chain means that we have a lock
   dependency the other way too.

So instead of extending tty->ctrl_lock over the whole __f_setown() call,
we now just take a reference to the 'pid' structure while holding the
lock, and then release it after having done the __f_setown.  That still
guarantees that 'struct pid' won't go away from under us, which is all
we really ever needed.

Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Acked-by: Américo Wang <xiyou.wangcong@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-07 10:26:01 -08:00
Al Viro
89068c576b Take ima_file_free() to proper place.
Hooks: Just Say No.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-07 03:07:29 -05:00
Mimi Zohar
9bbb6cad01 ima: rename ima_path_check to ima_file_check
ima_path_check actually deals with files!  call it ima_file_check instead.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-07 03:06:22 -05:00
Mimi Zohar
8eb988c70e fix ima breakage
The "Untangling ima mess, part 2 with counters" patch messed
up the counters.  Based on conversations with Al Viro, this patch
streamlines ima_path_check() by removing the counter maintaince.
The counters are now updated independently, from measuring the file,
in __dentry_open() and alloc_file() by calling ima_counts_get().
ima_path_check() is called from nfsd and do_filp_open().
It also did not measure all files that should have been measured.
Reason: ima_path_check() got bogus value passed as mask.
[AV: mea culpa]
[AV: add missing nfsd bits]

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-07 03:06:22 -05:00
Al Viro
1e41568d73 Take ima_path_check() in nfsd past dentry_open() in nfsd_open()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-07 03:06:22 -05:00
Jun'ichi Nomura
4b06e5b9ad freeze_bdev: don't deactivate successfully frozen MS_RDONLY sb
Thanks Thomas and Christoph for testing and review.
I removed 'smp_wmb()' before up_write from the previous patch,
since up_write() should have necessary ordering constraints.
(I.e. the change of s_frozen is visible to others after up_write)
I'm quite sure the change is harmless but if you are uncomfortable
with Tested-by/Reviewed-by on the modified patch, please remove them.

If MS_RDONLY, freeze_bdev should just up_write(s_umount) instead of
deactivate_locked_super().
Also, keep sb->s_frozen consistent so that remount can check the frozen state.

Otherwise a crash reported here can happen:
http://lkml.org/lkml/2010/1/16/37
http://lkml.org/lkml/2010/1/28/53

This patch should be applied for 2.6.32 stable series, too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Thomas Backlund <tmb@mandriva.org>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-07 03:06:21 -05:00
Al Viro
8dd5ca532c befs: fix leak
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-07 03:06:21 -05:00
Steve French
301a6a3177 [CIFS] Maximum username length check in session setup does not match
Fix length check reported by D. Binderman (see below)

d binderman <dcb314@hotmail.com> wrote:
>
> I just ran the sourceforge tool cppcheck over the source code of the
> new Linux kernel 2.6.33-rc6
>
> It said
>
> [./cifs/sess.c:250]: (error) Buffer access out-of-bounds

May turn out to be harmless, but best to be safe. Note max
username length is defined to 32 due to Linux (Windows
maximum is 20).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-06 07:08:53 +00:00
Jeff Layton
f12f98dba6 cifs: fix length calculation for converted unicode readdir names
cifs_from_ucs2 returns the length of the converted name, including the
length of the NULL terminator. We don't want to include the NULL
terminator in the dentry name length however since that'll throw off the
hash calculation for the dentry cache.

I believe that this is the root cause of several problems that have
cropped up recently that seem to be papered over with the "noserverino"
mount option. More confirmation of that would be good, but this is
clearly a bug and it fixes at least one reproducible problem that
was reported.

This patch fixes at least this reproducer in this kernel.org bug:

    http://bugzilla.kernel.org/show_bug.cgi?id=15088#c12

Reported-by: Bjorn Tore Sund <bjorn.sund@it.uib.no>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-02-06 06:25:16 +00:00
Roel Kluin
bd6b0bf87d ocfs2: Fix contiguousness check in ocfs2_try_to_merge_extent_map()
The wrong member was compared in the continguousness check.

Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-05 15:06:21 -08:00
James Bottomley
73c77e2ccc xfs: fix xfs to work with Virtually Indexed architectures
xfs_buf.c includes what is essentially a hand rolled version of
blk_rq_map_kern().  In order to work properly with the vmalloc buffers
that xfs uses, this hand rolled routine must also implement the flushing
API for vmap/vmalloc areas.

[style updates from hch@lst.de]
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
2010-02-05 12:32:35 -06:00
Linus Torvalds
adbfbcd12a Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: apply updated fallocate i_size fix
  Btrfs: do not try and lookup the file extent when finishing ordered io
  Btrfs: Fix oopsen when dropping empty tree.
  Btrfs: remove BUG_ON() due to mounting bad filesystem
  Btrfs: make error return negative in btrfs_sync_file()
  Btrfs: fix race between allocate and release extent buffer.
2010-02-05 07:23:03 -08:00
Fang Wenqi
b2d82ee3c8 fuse: fix large stack use
gcc 4.4 warns about:
  fs/fuse/dev.c: In function ‘fuse_notify_inval_entry’:
  fs/fuse/dev.c:925: warning: the frame size of 1060 bytes is larger than 1024 bytes

The problem is we declare two structures and a large array on the stack,
I move the array alway from the stack and allocate memory for it dynamically.

Signed-off-by: Fang Wenqi <antonf@turbolinux.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-02-05 12:08:31 +01:00
Miklos Szeredi
b21dda438b fuse: cleanup in fuse_notify_inval_...()
Small cleanup in fuse_notify_inval_inode() and
fuse_notify_inval_entry().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-02-05 12:08:31 +01:00
Linus Torvalds
a9861b5037 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Don't clobber the attribute type in nfs_update_inode()
  NFS: Fix a umount race
  NFS: Fix an Oops when truncating a file
  NFS: Ensure that we handle NFS4ERR_STALE_STATEID correctly
  NFSv4.1: Don't call nfs4_schedule_state_recovery() unnecessarily
  NFSv4: Don't allow posix locking against servers that don't support it
  NFSv4: Ensure that the NFSv4 locking can recover from stateid errors
  NFS: Avoid warnings when CONFIG_NFS_V4=n
  NFS: Make nfs_commitdata_release static
  NFS: Try to commit unstable writes in nfs_release_page()
  NFS: Fix a reference leak in nfs_wb_cancel_page()
2010-02-04 16:08:15 -08:00
Linus Torvalds
a3a71ca9a7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Extend umount wait coverage to full glock lifetime
  GFS2: Wait for unlock completion on umount
2010-02-04 16:06:48 -08:00
Aneesh Kumar K.V
23b5c50945 Btrfs: apply updated fallocate i_size fix
This version of the i_size fix for fallocate makes sure we only update
the i_size when the current fallocate is really operating outside of
i_size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:33:03 -05:00
Josef Bacik
efd049fb26 Btrfs: do not try and lookup the file extent when finishing ordered io
When running the following fio job

[torrent]
filename=torrent-test
rw=randwrite
size=4g
filesize=4g
bs=4k
ioengine=sync

you would see long stalls where no work was being done.  That is because we were
doing all this extra work to read in the file extent outside of the transaction,
however in the random io case this ends up hurting us because the file extents
are not there to begin with.  So axe this logic, since we end up reading in the
file extent when we go to update it anyway.  This took the fio job from 11 mb/s
with several ~10 second stalls to 24 mb/s to a couple of 1-2 second stalls.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:45 -05:00
Yan, Zheng
7a7965f83e Btrfs: Fix oopsen when dropping empty tree.
When dropping a empty tree, walk_down_tree() skips checking
extent information for the tree root. This will triggers a
BUG_ON in walk_up_proc().

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:45 -05:00
Miao Xie
d7ce5843bb Btrfs: remove BUG_ON() due to mounting bad filesystem
Mounting a bad filesystem caused a BUG_ON(). The following is steps to
reproduce it.
 # mkfs.btrfs /dev/sda2
 # mount /dev/sda2 /mnt
 # mkfs.btrfs /dev/sda1 /dev/sda2
 (the program says that /dev/sda2 was mounted, and then exits. )
 # umount /mnt
 # mount /dev/sda1 /mnt

At the third step, mkfs.btrfs exited in the way of make filesystem. So the
initialization of the filesystem didn't finish. So the filesystem was bad, and
it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong
code, not user's operation, so I think it is a bug of btrfs.

This patch fixes it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:44 -05:00
Roel Kluin
014e4ac4f7 Btrfs: make error return negative in btrfs_sync_file()
It appears the error return should be negative

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:44 -05:00
Yan, Zheng
f044ba7835 Btrfs: fix race between allocate and release extent buffer.
Increase extent buffer's reference count while holding the lock.
Otherwise it can race with try_release_extent_buffer.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-02-04 11:31:44 -05:00
Kees Cook
d78ca3cd73 syslog: use defined constants instead of raw numbers
Right now the syslog "type" action are just raw numbers which makes
the source difficult to follow.  This patch replaces the raw numbers
with defined constants for some level of sanity.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04 14:20:41 +11:00
Kees Cook
002345925e syslog: distinguish between /proc/kmsg and syscalls
This allows the LSM to distinguish between syslog functions originating
from /proc/kmsg access and direct syscalls.  By default, the commoncaps
will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg
file descriptor.  For example the kernel syslog reader can now drop
privileges after opening /proc/kmsg, instead of staying privileged with
CAP_SYS_ADMIN.  MAC systems that implement security_syslog have unchanged
behavior.

Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04 14:20:12 +11:00
Sunil Mushran
cda70ba8c0 ocfs2/dlm: Remove BUG_ON in dlm recovery when freeing locks of a dead node
During recovery, the dlm frees the locks for the dead node. If it finds a
lock in a resource for the dead node, it expects that node to also have a
ref in that lock resource. If not, it BUGs.

ossbz#1175 was filed with the above BUG. Now, while it is correct that we
should be expecting the ref, I see no reason why we have to BUG. After all,
we are freeing up the lock and clearing the ref.

This patch replaces the BUG_ON with a printk(). Hopefully, that will give
us more clues next time this happens.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1175

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-03 17:51:41 -08:00
Sunil Mushran
079b805782 ocfs2: Plugs race between the dc thread and an unlock ast message
This patch plugs a race between the downconvert thread and an unlock ast message.
Specifically, after the downconvert worker has done its task, the dc thread needs
to check whether an unlock ast made the downconvert moot.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@sus.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-03 17:26:03 -08:00
Dave Chinner
5322892d86 xfs: kill xfs_bawrite
There are no more users of this function left in the XFS code
now that we've switched everything to delayed write flushing.
Remove it.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-02-04 10:09:14 +11:00
Christoph Hellwig
07fec73625 xfs: log changed inodes instead of writing them synchronously
When an inode has already be flushed delayed write,
xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
return on a synchronous inode write without having written the
inode. Currently these sycnhronous writes only come sync(1),
unmount, a sycnhronous NFS export and cachefiles so should be
relatively rare and out of common performance paths.

Realistically, a synchronous inode write is not necessary here; we
can avoid writing the inode by logging any non-transactional changes
that are pending.  This needs to be done with synchronous
transactions, but it avoids seeking between the log and inode
clusters as we do now. We don't force the log if the inode is
pinned, though, so this differs from the fsync case.  For normal
sys_sync and unmount behaviour this is fine because we do a
synchronous log force in xfs_sync_data which is called from the
->sync_fs code.

It does however break the NFS synchronous export guarantees for now,
but work is under way to fix this at a higher level or for the
higher level to provide an additional flag in the writeback control
to tell us that a log force is needed.

Portions of this patch are based on work from Dave Chinner.

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
2010-02-09 11:43:49 +11:00
Linus Torvalds
c1c0cbb878 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix potential leak of dirty data on umount
2010-02-03 08:47:15 -08:00
Trond Myklebust
9b4b351346 NFS: Don't clobber the attribute type in nfs_update_inode()
If the NFS_ATTR_FATTR_TYPE field isn't set in fattr->valid, then we should
not set the S_IFMT part of inode->i_mode.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-02-03 08:27:35 -05:00
Trond Myklebust
387c149b54 NFS: Fix a umount race
Ensure that we unregister the bdi before kill_anon_super() calls
ida_remove() on our device name.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-02-03 08:27:35 -05:00
Trond Myklebust
9f557cd807 NFS: Fix an Oops when truncating a file
The VM/VFS does not allow mapping->a_ops->invalidatepage() to fail.
Unfortunately, nfs_wb_page_cancel() may fail if a fatal signal occurs.
Since the NFS code assumes that the page stays mapped for as long as the
writeback is active, we can end up Oopsing (among other things).

The only safe fix here is to convert nfs_wait_on_request(), so as to make
it uninterruptible (as is already the case with wait_on_page_writeback()).


Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-02-03 08:27:22 -05:00
Steven Whitehouse
8f05228ee7 GFS2: Extend umount wait coverage to full glock lifetime
Although all glocks are, by the time of the umount glock wait,
scheduled for demotion, some of them haven't made it far
enough through the process for the original set of waiting
code to wait for them.

This extends the ref count to the whole glock lifetime in order
to ensure that the waiting does catch all glocks. It does make
it a bit more invasive, but it seems the only sensible solution
at the moment.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-03 09:56:21 +00:00
Steven Whitehouse
e402746a94 GFS2: Wait for unlock completion on umount
This patch adds a wait on umount between the point at which we
dispose of all glocks and the point at which we unmount the
lock protocol. This ensures that we've received all the replies
to our unlock requests before we stop the locking.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Fabio M. Di Nitto <fdinitto@redhat.com>
2010-02-03 09:47:04 +00:00
Sunil Mushran
db0f6ce697 ocfs2: Remove overzealous BUG_ON during blocked lock processing
During blocked lock processing, we should consider the possibility that the
lock is no longer blocking.

Joel Becker <joel.becker@oracle.com> assisted in fixing this issue.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:16 -08:00
Sunil Mushran
0d74125a6a ocfs2: Do not downconvert if the lock level is already compatible
During upconvert, if the master were to send a BAST, dlmglue will detect the
upconversion in process and send a cancel convert to the master. Upon receiving
the AST for the cancel convert, it will re-process the lock resource to determine
whether it needs downconverting. Say, the up was from PR to EX and the BAST was
for EX. After the cancel convert, it will need to downconvert to NL.

However, if the node was originally upconverting from NL to EX, then there would
be no reason to downconvert (assuming the same message sequence).

This patch makes dlmglue consider the possibility that the current lock level
is already compatible and that downconverting is not required.

Joel Becker <joel.becker@oracle.com> assisted in fixing this issue.

Fixes ossbz#1178
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1178

Reported-by: Coly Li <coly.li@suse.de>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:14 -08:00
Sunil Mushran
a191282601 ocfs2: Prevent a livelock in dlmglue
There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
to get an ast for an upconvert request, followed immediately by a bast,
there is a small window where the fs may downconvert the lock before the
process requesting the upconvert is able to take the lock.

This patch adds a new flag to indicate that the upconvert is still in
progress and that the dc thread should not downconvert it right now.

Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker
<joel.becker@oracle.com> contributed heavily to this patch.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:13 -08:00
Wengang Wang
0b94a909eb ocfs2: Fix setting of OCFS2_LOCK_BLOCKED during bast
During bast, set the OCFS2_LOCK_BLOCKED flag only if the lock needs to
downconverted.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Sunil Mushran <sunil.mushran@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:50:55 -08:00
Tao Ma
34e6c59af0 ocfs2: Use compat_ptr in reflink_arguments.
Although we use u64 to pass userspace pointers to the kernel
to avoid compat_ioctl, it doesn't work in some ppc platform.
So wrap them with compat_ptr and add compat_ioctl.

The detailed discussion about compat_ptr can be found in thread
http://lkml.org/lkml/2009/10/27/423.

We indeed met with a bug when testing on ppc(-EFAULT is returned
when using old_path). This patch try to fix this.
I have tested in ppc64(with 32 bit reflink) and x86_64(with i686
reflink), both works.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:56:37 -08:00
Sunil Mushran
cd34edd8cf ocfs2/dlm: Handle EAGAIN for compatibility - v2
Mainline commit aad1b15310 made the
dlm_begin_reco_handler() return -EAGAIN instead of EAGAIN.

As this error is transmitted over the wire, we want the receiver,
dlm_send_begin_reco_message(), to understand both the older EAGAIN and
the newer -EAGAIN, to allow rolling upgrade of the cluster nodes.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:56:34 -08:00
Tao Ma
60c486744c ocfs2: Add parenthesis to wrap the check for O_DIRECT.
Add parenthesis to wrap the check for O_DIRECT.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:15:37 -08:00
Tao Ma
0a1ea437d8 ocfs2: Only bug out when page size is larger than cluster size.
In CoW, we have to make sure that the page is already written
out to the disk. So we have a BUG_ON(PageDirty(page)).

In ppc platform we have pagesize=64K, so if the cs=4K, if the
file have fragmented clusters, we will map the page many times.
See this file as an example.
Tree Depth: 0   Count: 19   Next Free Rec: 14
	## Offset        Clusters       Block#          Flags
	0  0             4              2164864         0x2 Refcounted
	1  4             2              9302792         0x2 Refcounted
...

We have to replace the extent recs one by one, so the page with index 0
will be mapped and dirtied twice.

I'd like to leave the BUG_ON there while adding a check so that in
case we meet with an error in other platforms, we can find it easily.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:15:35 -08:00
Tao Ma
d622b89a2f ocfs2: Fix memory overflow in cow_by_page.
In ocfs2_duplicate_clusters_by_page, we calculate map_end
by shifting page_index. But actually in case we meet with
a large offset(say in a i686 box, poff_t is only 32 bits
and page_index=2056240), we will overflow. So change the
type of page_index to loff_t.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 18:14:20 -08:00
anfei zhou
931e80e4b3 mm: flush dcache before writing into page to avoid alias
The cache alias problem will happen if the changes of user shared mapping
is not flushed before copying, then user and kernel mapping may be mapped
into two different cache line, it is impossible to guarantee the coherence
after iov_iter_copy_from_user_atomic.  So the right steps should be:

	flush_dcache_page(page);
	kmap_atomic(page);
	write to page;
	kunmap_atomic(page);
	flush_dcache_page(page);

More precisely, we might create two new APIs flush_dcache_user_page and
flush_dcache_kern_page to replace the two flush_dcache_page accordingly.

Here is a snippet tested on omap2430 with VIPT cache, and I think it is
not ARM-specific:

	int val = 0x11111111;
	fd = open("abc", O_RDWR);
	addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	*(addr+0) = 0x44444444;
	tmp = *(addr+0);
	*(addr+1) = 0x77777777;
	write(fd, &val, sizeof(int));
	close(fd);

The results are not always 0x11111111 0x77777777 at the beginning as expected.  Sometimes we see 0x44444444 0x77777777.

Signed-off-by: Anfei <anfei.zhou@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-arch@vger.kernel.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 18:11:21 -08:00
Linus Torvalds
1a45dcfe25 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: Do not idle on async queues
  blk-cgroup: Fix potential deadlock in blk-cgroup
  block: fix bugs in bio-integrity mempool usage
  block: fix bio_add_page for non trivial merge_bvec_fn case
  drbd: null dereference bug
  drbd: fix max_segment_size initialization
2010-02-02 12:54:37 -08:00
Linus Torvalds
4dab75ec3e Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Use GFP_NOFS for alloc structure
  GFS2: Fix previous patch
  GFS2: Don't withdraw on partial rindex entries
  GFS2: Fix refcnt leak on gfs2_follow_link() error path
2010-02-02 12:48:26 -08:00
Linus Torvalds
7ab02af428 Fix 'flush_old_exec()/setup_new_exec()' split
Commit 221af7f87b ("Split 'flush_old_exec' into two functions") split
the function at the point of no return - ie right where there were no
more error cases to check.  That made sense from a technical standpoint,
but when we then also combined it with the actual personality setting
going in between flush_old_exec() and setup_new_exec(), it needs to be a
bit more careful.

In particular, we need to make sure that we really flush the old
personality bits in the 'flush' stage, rather than later in the 'setup'
stage, since otherwise we might be flushing the _new_ personality state
that we're just setting up.

So this moves the flags and personality flushing (and 'flush_thread()',
which is the arch-specific function that generally resets lazy FP state
etc) of the old process into flush_old_exec(), so that it doesn't affect
any state that execve() is setting up for the new process environment.

This was reported by Michal Simek as breaking his Microblaze qemu
environment.

Reported-and-tested-by: Michal Simek <michal.simek@petalogix.com>
Cc: Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-02 12:37:44 -08:00
Christoph Hellwig
e8b217e753 xfs: remove invalid barrier optimization from xfs_fsync
We always need to flush the disk write cache and can't skip it just because
the no inode attributes have changed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-02-02 10:16:26 +11:00
Dave Chinner
20026d9201 xfs: kill the unused XFS_QMOPT_* flush flags V2
dquots are never flushed asynchronously. Remove the flag and the
async write support from the flush function. Make the default flush
a delwri flush to make the inode flush code, which leaves the
XFS_QMOPT_SYNC the only flag remaining.  Convert that to use
SYNC_WAIT instead, just like the inode flush code.

V2:
- just pass flush flags straight through

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-02-04 09:48:58 +11:00
Linus Torvalds
13af75740f Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing:
  reiserfs: Fix vmalloc call under reiserfs lock
2010-02-01 10:46:18 -08:00
Steven Whitehouse
ea8d62dadd GFS2: Use GFP_NOFS for alloc structure
This is called under a glock, so its a good plan to use GFP_NOFS

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-01 10:01:34 +00:00
Steven Whitehouse
7fe3ec6fe5 GFS2: Fix previous patch
The do_div() call needs to remain.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-01 10:00:23 +00:00
Benjamin Marzinski
55f0b4c546 GFS2: Don't withdraw on partial rindex entries
ince gfs2 writes the rindex file a block at a time, and releases the
exclusive lock after each block, it is possible that another process
will grab the lock in the middle of the write.  Since rindex entries are
not an even divisor of blocks, that other process may see partial
entries.  On grows, this is fine.  The process can simply ignore the the
partial entires. Previously, the code withdrew when it saw partial
entries. Now it simply ignores them.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-02-01 09:59:54 +00:00
Ryusuke Konishi
3256a05531 nilfs2: fix potential leak of dirty data on umount
This fixes incorrect usage of nilfs_segctor_confirm() test function in
nilfs_segctor_destroy(); nilfs_segctor_confirm() returns zero if the
filesystem is not clean, so its use in nilfs_segctor_destroy() needs
inversion.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-01-31 14:57:31 +09:00
Chuck Ebbert
9e9432c267 block: fix bugs in bio-integrity mempool usage
Fix two bugs in the bio integrity code:

 use_bip_pool() always returns 0 because it checks against the wrong limit,
 causing the mempool to be used only when regular allocation fails.

 When the mempool is used as a fallback we don't free the data properly.

Signed-Off-By: Chuck Ebbert <cebbert@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-30 20:28:19 +01:00
Linus Torvalds
67f15b06c1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check total number of devices when removing missing
  Btrfs: check return value of open_bdev_exclusive properly
  Btrfs: do not mark the chunk as readonly if in degraded mode
  Btrfs: run orphan cleanup on default fs root
  Btrfs: fix a memory leak in btrfs_init_acl
  Btrfs: Use correct values when updating inode i_size on fallocate
  Btrfs: remove tree_search() in extent_map.c
  Btrfs: Add mount -o compress-force
2010-01-29 10:27:37 -08:00
Linus Torvalds
221af7f87b Split 'flush_old_exec' into two functions
'flush_old_exec()' is the point of no return when doing an execve(), and
it is pretty badly misnamed.  It doesn't just flush the old executable
environment, it also starts up the new one.

Which is very inconvenient for things like setting up the new
personality, because we want the new personality to affect the starting
of the new environment, but at the same time we do _not_ want the new
personality to take effect if flushing the old one fails.

As a result, the x86-64 '32-bit' personality is actually done using this
insane "I'm going to change the ABI, but I haven't done it yet" bit
(TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
personality, but just the "pending" bit, so that "flush_thread()" can do
the actual personality magic.

This patch in no way changes any of that insanity, but it does split the
'flush_old_exec()' function up into a preparatory part that can fail
(still called flush_old_exec()), and a new part that will actually set
up the new exec environment (setup_new_exec()).  All callers are changed
to trivially comply with the new world order.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-29 08:22:01 -08:00
Josef Bacik
035fe03a7a Btrfs: check total number of devices when removing missing
If you have a disk failure in RAID1 and then add a new disk to the
array, and then try to remove the missing volume, it will fail.  The
reason is the sanity check only looks at the total number of rw devices,
which is just 2 because we have 2 good disks and 1 bad one.  Instead
check the total number of devices in the array to make sure we can
actually remove the device.  Tested this with a failed disk setup and
with this test we can now run

btrfs-vol -r missing /mount/point

and it works fine.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Josef Bacik
7f59203abe Btrfs: check return value of open_bdev_exclusive properly
Hit this problem while testing RAID1 failure stuff.  open_bdev_exclusive
returns ERR_PTR(), not NULL.  So change the return value properly.  This
is important if you accidently specify a device that doesn't exist when
trying to add a new device to an array, you will panic the box
dereferencing bdev.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Josef Bacik
f48b90756b Btrfs: do not mark the chunk as readonly if in degraded mode
If a RAID setup has chunks that span multiple disks, and one of those
disks has failed, btrfs_chunk_readonly will return 1 since one of the
disks in that chunk's stripes is dead and therefore not writeable.  So
instead if we are in degraded mode, return 0 so we can go ahead and
allocate stuff.  Without this patch all of the block groups in a RAID1
setup will end up read-only, which will mean we can't add new disks to
the array since we won't be able to make allocations.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Josef Bacik
e3acc2a685 Btrfs: run orphan cleanup on default fs root
This patch revert's commit

6c090a11e1

Since it introduces this problem where we can run orphan cleanup on a
volume that can have orphan entries re-added.  Instead of my original
fix, Yan Zheng pointed out that we can just revert my original fix and
then run the orphan cleanup in open_ctree after we look up the fs_root.
I have tested this with all the tests that gave me problems and this
patch fixes both problems.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Yang Hongyang
f858153c36 Btrfs: fix a memory leak in btrfs_init_acl
In btrfs_init_acl() cloned acl is not released

Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:39 -05:00
Aneesh Kumar K.V
d1ea6a6145 Btrfs: Use correct values when updating inode i_size on fallocate
commit f2bc9dd07e3424c4ec5f3949961fe053d47bc825
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Jan 20 12:57:53 2010 +0530

    Btrfs: Use correct values when updating inode i_size on fallocate

    Even though we allocate more, we should be updating inode i_size
    as per the arguments passed

    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:38 -05:00
Miao Xie
b8d9bfeb18 Btrfs: remove tree_search() in extent_map.c
This patch removes tree_search() in extent_map.c because it is not called by
anything.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:20:38 -05:00
Chris Mason
a555f810af Btrfs: Add mount -o compress-force
The default btrfs mount -o compress mode will quickly back off
compressing a file if it notices that compression does not reduce the
size of the data being written.  This can save considerable CPU because
all future writes to the file go through uncompressed.

But some files are both very large and have mixed data stored in
them.  In that case, we want to add the ability to always try
compressing data before writing it.

This commit adds mount -o compress-force.  A later commit will add
a new inode flag that does the same thing.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-01-28 16:18:15 -05:00
Dmitry Monakhov
1d6165851c block: fix bio_add_page for non trivial merge_bvec_fn case
We have to properly decrease bi_size in order to merge_bvec_fn return
right result.  Otherwise this result in false merge rejects for two
absolutely valid bio_vecs.  This may cause significant performance
penalty for example fs_block_size == 1k and block device is raid0 with
small chunk_size = 8k. Then it is impossible to merge 7-th fs-block in
to bio which already has 6 fs-blocks.

Cc: <stable@kernel.org>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-01-28 15:08:29 +01:00
Frederic Weisbecker
bbec919150 reiserfs: Fix vmalloc call under reiserfs lock
Vmalloc is called to allocate journal->j_cnode_free_list but
we hold the reiserfs lock at this time, which raises a
{RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} lock inversion.

Just drop the reiserfs lock at this time, as it's not even
needed but kept for paranoid reasons.

This fixes:

[ INFO: inconsistent lock state ]
2.6.33-rc5 #1
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/313 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&REISERFS_SB(s)->lock){+.+.?.}, at: [<c11118c8>]
reiserfs_write_lock_once+0x28/0x50
{RECLAIM_FS-ON-W} state was registered at:
  [<c104ee32>] mark_held_locks+0x62/0x90
  [<c104eefa>] lockdep_trace_alloc+0x9a/0xc0
  [<c108f7b6>] kmem_cache_alloc+0x26/0xf0
  [<c108621c>] __get_vm_area_node+0x6c/0xf0
  [<c108690e>] __vmalloc_node+0x7e/0xa0
  [<c1086aab>] vmalloc+0x2b/0x30
  [<c110e1fb>] journal_init+0x6cb/0xa10
  [<c10f90a2>] reiserfs_fill_super+0x342/0xb80
  [<c1095665>] get_sb_bdev+0x145/0x180
  [<c10f68e1>] get_super_block+0x21/0x30
  [<c1094520>] vfs_kern_mount+0x40/0xd0
  [<c1094609>] do_kern_mount+0x39/0xd0
  [<c10aaa97>] do_mount+0x2c7/0x6d0
  [<c10aaf06>] sys_mount+0x66/0xa0
  [<c16198a7>] mount_block_root+0xc4/0x245
  [<c1619a81>] mount_root+0x59/0x5f
  [<c1619b98>] prepare_namespace+0x111/0x14b
  [<c1619269>] kernel_init+0xcf/0xdb
  [<c100303a>] kernel_thread_helper+0x6/0x1c
irq event stamp: 63236801
hardirqs last  enabled at (63236801): [<c134e7fa>]
__mutex_unlock_slowpath+0x9a/0x120
hardirqs last disabled at (63236800): [<c134e799>]
__mutex_unlock_slowpath+0x39/0x120
softirqs last  enabled at (63218800): [<c102f451>] __do_softirq+0xc1/0x110
softirqs last disabled at (63218789): [<c102f4ed>] do_softirq+0x4d/0x60

other info that might help us debug this:
2 locks held by kswapd0/313:
 #0:  (shrinker_rwsem){++++..}, at: [<c1074bb4>] shrink_slab+0x24/0x170
 #1:  (&type->s_umount_key#19){++++..}, at: [<c10a2edd>]
shrink_dcache_memory+0xfd/0x1a0

stack backtrace:
Pid: 313, comm: kswapd0 Not tainted 2.6.33-rc5 #1
Call Trace:
 [<c134db2c>] ? printk+0x18/0x1c
 [<c104e7ef>] print_usage_bug+0x15f/0x1a0
 [<c104ebcf>] mark_lock+0x39f/0x5a0
 [<c104d66b>] ? trace_hardirqs_off+0xb/0x10
 [<c1052c50>] ? check_usage_forwards+0x0/0xf0
 [<c1050c24>] __lock_acquire+0x214/0xa70
 [<c10438c5>] ? sched_clock_cpu+0x95/0x110
 [<c10514fa>] lock_acquire+0x7a/0xa0
 [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
 [<c134f03f>] mutex_lock_nested+0x5f/0x2b0
 [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
 [<c11118c8>] ? reiserfs_write_lock_once+0x28/0x50
 [<c11118c8>] reiserfs_write_lock_once+0x28/0x50
 [<c10f05b0>] reiserfs_delete_inode+0x50/0x140
 [<c10a653f>] ? generic_delete_inode+0x5f/0x150
 [<c10f0560>] ? reiserfs_delete_inode+0x0/0x140
 [<c10a657c>] generic_delete_inode+0x9c/0x150
 [<c10a666d>] generic_drop_inode+0x3d/0x60
 [<c10a5597>] iput+0x47/0x50
 [<c10a2a4f>] dentry_iput+0x6f/0xf0
 [<c10a2af4>] d_kill+0x24/0x50
 [<c10a2d3d>] __shrink_dcache_sb+0x21d/0x2b0
 [<c10a2f0f>] shrink_dcache_memory+0x12f/0x1a0
 [<c1074c9e>] shrink_slab+0x10e/0x170
 [<c1075177>] kswapd+0x477/0x6a0
 [<c1072d10>] ? isolate_pages_global+0x0/0x1b0
 [<c103e160>] ? autoremove_wake_function+0x0/0x40
 [<c1074d00>] ? kswapd+0x0/0x6a0
 [<c103de6c>] kthread+0x6c/0x80
 [<c103de00>] ? kthread+0x0/0x80
 [<c100303a>] kernel_thread_helper+0x6/0x1c

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Chris Mason <chris.mason@oracle.com>
2010-01-28 13:43:50 +01:00
Al Viro
083c73c253 fix oops in fs/9p late mount failure
if 9P ->get_sb() fails late (at root inode or root dentry
allocation), we'll hit its ->kill_sb() with NULL ->s_root

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:27 -05:00
Al Viro
7e32b7bb73 fix leak in romfs_fill_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:26 -05:00
Al Viro
ef52c75e4b get rid of pointless checks after simple_pin_fs()
if we'd just got success from it, vfsmount won't be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:26 -05:00
Al Viro
5998649f77 Fix failure exits in bfs_fill_super()
double iput(), leaks...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:25 -05:00
Al Viro
217686e983 fix affs parse_options()
Error handling in that sucker got broken back in 2003.  If function
returns 0 on failure, it's not nice to add return -EINVAL into it.
Adding return 1 on other failure exits is also not a good thing (and
yes, original success exits with 1 and some of failure exits with 0
are still there; so's the original logics in callers).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:25 -05:00
Al Viro
29333920a5 Fix remount races with symlink handling in affs
A couple of fields in affs_sb_info is used in follow_link() and
symlink() for handling AFFS "absolute" symlinks.  Need locking
against affs_remount() updates.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:24 -05:00
Al Viro
afc70ed05a Fix a leak in affs_fill_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-26 22:22:24 -05:00
Greg Kroah-Hartman
b04da8bfdf fnctl: f_modown should call write_lock_irqsave/restore
Commit 7036251180 exposed that f_modown()
should call write_lock_irqsave instead of just write_lock_irq so that
because a caller could have a spinlock held and it would not be good to
renable interrupts.

Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Tavis Ormandy <taviso@google.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-26 17:25:38 -08:00
Trond Myklebust
a2c0b9e291 NFS: Ensure that we handle NFS4ERR_STALE_STATEID correctly
Even if the server is crazy, we should be able to mark the stateid as being
bad, to ensure it gets recovered.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:47 -05:00
Trond Myklebust
03391693a9 NFSv4.1: Don't call nfs4_schedule_state_recovery() unnecessarily
Currently, nfs4_handle_exception() will call it twice if called with an
error of -NFS4ERR_STALE_CLIENTID, -NFS4ERR_STALE_STATEID or
-NFS4ERR_EXPIRED.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:38 -05:00
Trond Myklebust
8e469ebd6d NFSv4: Don't allow posix locking against servers that don't support it
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:30 -05:00
Trond Myklebust
2bee72a6aa NFSv4: Ensure that the NFSv4 locking can recover from stateid errors
In most cases, we just want to mark the lock_stateid sequence id as being
uninitialised.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:21 -05:00
David Howells
b0706ca415 NFS: Avoid warnings when CONFIG_NFS_V4=n
Avoid the following warnings when CONFIG_NFS_V4=n:

	fs/nfs/sysctl.c:19: warning: unused variable `nfs_set_port_max'
	fs/nfs/sysctl.c:18: warning: unused variable `nfs_set_port_min'

by making those variables contingent on NFSv4 being configured.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:11 -05:00
H Hartley Sweeten
0aa05887af NFS: Make nfs_commitdata_release static
The symbol nfs_commitdata_release is only used locally
in this file. Make it static to prevent the following sparse warning:

warning: symbol 'nfs_commitdata_release' was not declared. Should it be static?

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:42:03 -05:00
Trond Myklebust
82be934a59 NFS: Try to commit unstable writes in nfs_release_page()
If someone calls nfs_release_page(), we presumably already know that the
page is clean, however it may be holding an unstable write.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:41:53 -05:00
Trond Myklebust
c9edda7140 NFS: Fix a reference leak in nfs_wb_cancel_page()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
2010-01-26 15:41:34 -05:00
Dave Chinner
7d6a7bde52 xfs: Use delay write promotion for dquot flushing
xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
pushing used to do to flush out delayed write dquot buffers. Change
it to use the new promotion method rather than an async flush.

Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
held, yet the callers make the assumption that after this call the
flush lock is held. Always return with the flush lock held.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-01-26 15:13:41 +11:00
Dave Chinner
089716aa14 xfs: Sort delayed write buffers before dispatch
Currently when the xfsbufd writes delayed write buffers, it pushes
them to disk in the order they come off the delayed write list. If
there are lots of buffers ѕpread widely over the disk, this results
in overwhelming the elevator sort queues in the block layer and we
end up losing the posibility of merging adjacent buffers to minimise
the number of IOs.

Use the new generic list_sort function to sort the delwri dispatch
queue before issue to ensure that the buffers are pushed in the most
friendly order possible to the lower layers.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-01-26 15:13:25 +11:00
Dave Chinner
d808f617ad xfs: Don't issue buffer IO direct from AIL push V2
All buffers logged into the AIL are marked as delayed write.
When the AIL needs to push the buffer out, it issues an async write of the
buffer. This means that IO patterns are dependent on the order of
buffers in the AIL.

Instead of flushing the buffer, promote the buffer in the delayed
write list so that the next time the xfsbufd is run the buffer will
be flushed by the xfsbufd. Return the state to the xfsaild that the
buffer was promoted so that the xfsaild knows that it needs to cause
the xfsbufd to run to flush the buffers that were promoted.

Using the xfsbufd for issuing the IO allows us to dispatch all
buffer IO from the one queue. This means that we can make much more
enlightened decisions on what order to flush buffers to disk as
we don't have multiple places issuing IO. Optimisations to xfsbufd
will be in a future patch.

Version 2
- kill XFS_ITEM_FLUSHING as it is now unused.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-02-02 10:13:42 +11:00
Dave Chinner
c854363e80 xfs: Use delayed write for inodes rather than async V2
We currently do background inode flush asynchronously, resulting in
inodes being written in whatever order the background writeback
issues them. Not only that, there are also blocking and non-blocking
asynchronous inode flushes, depending on where the flush comes from.

This patch completely removes asynchronous inode writeback. It
removes all the strange writeback modes and replaces them with
either a synchronous flush or a non-blocking delayed write flush.
That is, inode flushes will only issue IO directly if they are
synchronous, and background flushing may do nothing if the operation
would block (e.g. on a pinned inode or buffer lock).

Delayed write flushes will now result in the inode buffer sitting in
the delwri queue of the buffer cache to be flushed by either an AIL
push or by the xfsbufd timing out the buffer. This will allow
accumulation of dirty inode buffers in memory and allow optimisation
of inode cluster writeback at the xfsbufd level where we have much
greater queue depths than the block layer elevators. We will also
get adjacent inode cluster buffer IO merging for free when a later
patch in the series allows sorting of the delayed write buffers
before dispatch.

This effectively means that any inode that is written back by
background writeback will be seen as flush locked during AIL
pushing, and will result in the buffers being pushed from there.
This writeback path is currently non-optimal, but the next patch
in the series will fix that problem.

A side effect of this delayed write mechanism is that background
inode reclaim will no longer directly flush inodes, nor can it wait
on the flush lock. The result is that inode reclaim must leave the
inode in the reclaimable state until it is clean. Hence attempts to
reclaim a dirty inode in the background will simply skip the inode
until it is clean and this allows other mechanisms (i.e. xfsbufd) to
do more optimal writeback of the dirty buffers. As a result, the
inode reclaim code has been rewritten so that it no longer relies on
the ambiguous return values of xfs_iflush() to determine whether it
is safe to reclaim an inode.

Portions of this patch are derived from patches by Christoph
Hellwig.

Version 2:
- cleanup reclaim code as suggested by Christoph
- log background reclaim inode flush errors
- just pass sync flags to xfs_iflush

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-02-06 12:39:36 +11:00
Dave Chinner
777df5afdb xfs: Make inode reclaim states explicit
A.K.A.: don't rely on xfs_iflush() return value in reclaim

We have gradually been moving checks out of the reclaim code because
they are duplicated in xfs_iflush(). We've had a history of problems
in this area, and many of them stem from the overloading of the
return values from xfs_iflush() and interaction with inode flush
locking to determine if the inode is safe to reclaim.

With the desire to move to delayed write flushing of inodes and
non-blocking inode tree reclaim walks, the overloading of the
return value of xfs_iflush makes it very difficult to determine
the correct thing to do next.

This patch explicitly re-adds the checks to the inode reclaim code,
removing the reliance on the return value of xfs_iflush() to
determine what to do next. It also means that we can clearly
document all the inode states that reclaim must handle and hence
we can easily see that we handled all the necessary cases.

This also removes the need for the xfs_inode_clean() check in
xfs_iflush() as all callers now check this first (safely).

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-02-06 12:37:26 +11:00
Eric Sandeen
d5db0f97fb xfs: more reserved blocks fixups
This mangles the reserved blocks counts a little more.

1) add a helper function for the default reserved count
2) add helper functions to save/restore counts on ro/rw
3) save/restore reserved blocks on freeze/thaw
4) disallow changing reserved count while readonly

V2: changed field name to match Dave's changes

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-02-08 17:41:48 -06:00
Dave Chinner
388f1f0c34 xfs: turn off sign warnings
Because they cause warnings in static inline functions conditionally
compiled into XFS from the VFS (e.g. fsnotify).

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-01-26 15:10:15 +11:00
Dave Chinner
cbe132a8bd xfs: don't hold onto reserved blocks on remount,ro
If we hold onto reserved blocks when doing a remount,ro we end
up writing the blocks used count to disk that includes the reserved
blocks. Reserved blocks are not actually used, so this results in
the values in the superblock being incorrect.

Hence if we run xfs_check or xfs_repair -n while the filesystem is
mounted remount,ro we end up with an inconsistent filesystem being
reported. Also, running xfs_copy on the remount,ro filesystem will
result in an inconsistent image being generated.

To fix this, unreserve the blocks when doing the remount,ro, and
reserved them again on remount,rw. This way a remount,ro filesystem
will appear consistent on disk to all utilities.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-01-26 15:08:49 +11:00
Sunil Mushran
26636bf6b2 ocfs2/dlm: Print more messages during lock migration
When a lock resource is migrated, the dlm compares the migrated
locks with that that was already existing on the new node. If the
comparison fails, it BUGs. This patch prints more messages when the
comparison fails inorder to help with the root cause analyis.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1206
This does not fix bz1206. However, if we run into it again, we will
have more information to chew on.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:21:09 -08:00
Sunil Mushran
71656fa6ec ocfs2/dlm: Ignore LVBs of locks in the Blocked list
During lock resource migration, o2dlm fills the packet with a LVB from the
first valid lock. For sanity, it ensures that the other valid locks have the
same LVB. If not, it BUGs.

The valid locks are ones that have granted EX or PR lock levels and are either
on the Granted or Converting lists. Locks in the Blocked list cannot have a
valid LVB.

This patch ensures that we skip the locks in the Blocked list.

Fixes oss bugzilla#1202
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1202

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:57 -08:00
Sunil Mushran
2bd632165c ocfs2/trivial: Remove trailing whitespaces
Patch removes trailing whitespaces.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:51 -08:00
Wengang Wang
e5f2cb2b1a ocfs2: fix a misleading variable name
a local variable "dlm_version" is used as a fs locking version.
rename it fs_version.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:48 -08:00
Tao Ma
1097df3ffe ocfs2: Sync max_inline_data_with_xattr from tools.
In ocfs2-tools, we have added ocfs2_max_inline_data_with_xattr,
so add it in the kernel's ocfs2_fs.h.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-01-25 19:20:45 -08:00
Linus Torvalds
9a3cbe3265 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Drop EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE flag
  ext4: Fix quota accounting error with fallocate
  ext4: Handle -EDQUOT error on write
2010-01-25 19:05:06 -08:00
Davide Libenzi
cb289d6244 eventfd - allow atomic read and waitqueue remove
KVM needs a wait to atomically remove themselves from the eventfd ->poll()
wait queue head, in order to handle correctly their IRQfd deassign
operation.

This patch introduces such API, plus a way to read an eventfd from its
context.

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-01-25 12:26:38 -02:00
Theodore Ts'o
f710b4b96b ext4: Reserve INCOMPAT_EA_INODE and INCOMPAT_DIRDATA feature codepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-25 03:31:32 -05:00
Theodore Ts'o
19f5fb7ad6 ext4: Use bitops to read/modify EXT4_I(inode)->i_state
At several places we modify EXT4_I(inode)->i_state without holding
i_mutex (ext4_release_file, ext4_bmap, ext4_journalled_writepage,
ext4_do_update_inode, ...). These modifications are racy and we can
lose updates to i_state. So convert handling of i_state to use bitops
which are atomic.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-01-24 14:34:07 -05:00
Christoph Hellwig
9b00f30762 xfs: quota limit statvfs available blocks
A "df" run on an NFS client of an exported XFS file system reports
the wrong information for "available" blocks.  When a block quota is
enforced, the amount reported as free is limited by the quota, but
the amount reported available is not (and should be).

Reported-by: Guk-Bong, Kwon <gbkwon@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 16:34:23 -06:00
Christoph Hellwig
bdfb04301f xfs: replace KM_LARGE with explicit vmalloc use
We use the KM_LARGE flag to make kmem_alloc and friends use vmalloc
if necessary.  As we only need this for a few boot/mount time
allocations just switch to explicit vmalloc calls there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:56 -06:00
Christoph Hellwig
a14a348bff xfs: cleanup up xfs_log_force calling conventions
Remove the XFS_LOG_FORCE argument which was always set, and the
XFS_LOG_URGE define, which was never used.

Split xfs_log_force into a two helpers - xfs_log_force which forces
the whole log, and xfs_log_force_lsn which forces up to the
specified LSN.  The underlying implementations already were entirely
separate, as were the users.

Also re-indent the new _xfs_log_force/_xfs_log_force which
previously had a weird coding style.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:49 -06:00
Christoph Hellwig
4139b3b337 xfs: kill XLOG_VEC_SET_TYPE
This macro only obsfucates the log item type assignments, so kill it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:43 -06:00
Christoph Hellwig
0cadda1c5f xfs: remove duplicate buffer flags
Currently we define aliases for the buffer flags in various
namespaces, which only adds confusion.  Remove all but the XBF_
flags to clean this up a bit.

Note that we still abuse XFS_B_ASYNC/XBF_ASYNC for some non-buffer
uses, but I'll clean that up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:36 -06:00
Christoph Hellwig
a210c1aa7f xfs: implement quota warnings via netlink
Wire up quota_send_warning to send quota warnings over netlink.
This is used by various desktops to show user quota warnings.

Tested by running the quota_nld daemon while running the xfstest
quota tests and observing the warnings.  I'll see how I can get a
more formal testcase for it written.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:28 -06:00
Christoph Hellwig
4d1f88d75b xfs: clean up error handling in xfs_trans_dqresv
Move the error code selection after the goto label and fold the
xfs_quota_error helper into it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:22 -06:00
Christoph Hellwig
512dd1abd9 xfs: kill XFS_QMOPT_ASYNC
The option is unused and one of the few remaining users of
xfs_bawrite, so let's get rid of it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-21 13:44:10 -06:00
Linus Torvalds
bdeef61cd0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6:
  tty: fix race in tty_fasync
  serial: serial_cs: oxsemi quirk breaks resume
  serial: imx: bit &/| confusion
  serial: Fix crash if the minimum rate of the device is > 9600 baud
  serial-core: resume serial hardware with no_console_suspend
  serial: 8250_pnp: use wildcard for serial Wacom tablets
  nozomi: quick fix for the close/close bug
  compat_ioctl: Supress "unknown cmd" message on serial /dev/console
2010-01-21 07:37:20 -08:00
Linus Torvalds
456eac9478 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  fs/bio.c: fix shadows sparse warning
  drbd: The kernel code is now equivalent to out of tree release 8.3.7
  drbd: Allow online resizing of DRBD devices while peer not reachable (needs to be explicitly forced)
  drbd: Don't go into StandAlone mode when authentification failes because of network error
  drivers/block/drbd/drbd_receiver.c: correct NULL test
  cfq-iosched: Respect ioprio_class when preempting
  genhd: overlapping variable definition
  block: removed unused as_io_context
  DM: Fix device mapper topology stacking
  block: bdev_stack_limits wrapper
  block: Fix discard alignment calculation and printing
  block: Correct handling of bottom device misaligment
  drbd: check on CONFIG_LBDAF, not LBD
  drivers/block/drbd: Correct NULL test
  drbd: Silenced an assert that could triggered after changing write ordering method
  drbd: Kconfig fix
  drbd: Fix for a race between IO and a detach operation [Bugz 262]
  drbd: Use drbd_crypto_is_hash() instead of an open coded check
2010-01-21 07:32:11 -08:00
Linus Torvalds
15e551e52b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  ecryptfs: use after free
  ecryptfs: Eliminate useless code
  ecryptfs: fix interpose/interpolate typos in comments
  ecryptfs: pass matching flags to interpose as defined and used there
  ecryptfs: remove unnecessary d_drop calls in ecryptfs_link
  ecryptfs: don't ignore return value from lock_rename
  ecryptfs: initialize private persistent file before dereferencing pointer
  eCryptfs: Remove mmap from directory operations
  eCryptfs: Add getattr function
  eCryptfs: Use notify_change for truncating lower inodes
2010-01-21 07:28:54 -08:00
Linus Torvalds
30a0f5e1fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix possible panic on unmount
  Btrfs: deal with NULL acl sent to btrfs_set_acl
  Btrfs: fix regression in orphan cleanup
  Btrfs: Fix race in btrfs_mark_extent_written
  Btrfs, fix memory leaks in error paths
  Btrfs: align offsets for btrfs_ordered_update_i_size
  btrfs: fix missing last-entry in readdir(3)
2010-01-21 07:28:05 -08:00
Atsushi Nemoto
3f00171125 compat_ioctl: Supress "unknown cmd" message on serial /dev/console
After the commit fb07a5f8 ("compat_ioctl: remove all VT ioctl
handling"), I got this error message on 64-bit mips kernel with 32-bit
busybox userland:

ioctl32(init:1): Unknown cmd fd(0) cmd(00005600){t:'V';sz:0} arg(7fd76480) on /dev/console

The cmd 5600 is VT_OPENQRY.  The busybox's init issues this ioctl to
know vt-console or serial-console.  If the console was serial console,
VT ioctls are not handled by the serial driver.

And by quick search, I found some programs using VT_GETMODE to check
vt-console is available or not.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-20 15:03:26 -08:00
Phillip Lougher
dc3256782f Squashfs: add decompressor entries for lzma and lzo
Add knowledge of lzma/lzo compression formats to the decompressor
framework.  For now these are added as unsupported.  Without
these entries lzma/lzo compressed filesystems will be flagged as
having unknown compression which is undesirable.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-01-20 21:47:48 +00:00
Phillip Lougher
4c0f0bb235 Squashfs: add a decompressor framework
This adds a decompressor framework which allows multiple compression
algorithms to be cleanly supported.

Also update zlib wrapper and other code to use the new framework.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-01-20 21:47:47 +00:00
Phillip Lougher
f1a40359f8 Squashfs: factor out remaining zlib dependencies into separate wrapper file
Move zlib buffer init/destroy code into separate wrapper file.  Also
make zlib z_stream field a void * removing the need to include zlib.h
for most files.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-01-20 21:47:47 +00:00
Phillip Lougher
e6a6d37955 Squashfs: move zlib decompression wrapper code into a separate file
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2010-01-20 21:47:46 +00:00
Dan Carpenter
ece550f51b ecryptfs: use after free
The "full_alg_name" variable is used on a couple error paths, so we
shouldn't free it until the end.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:06 -06:00
Julia Lawall
4aa25bcb7d ecryptfs: Eliminate useless code
The variable lower_dentry is initialized twice to the same (side effect-free)
expression.  Drop one initialization.

A simplified version of the semantic match that finds this problem is:
(http://coccinelle.lip6.fr/)

// <smpl>
@forall@
idexpression *x;
identifier f!=ERR_PTR;
@@

x = f(...)
... when != x
(
x = f(...,<+...x...+>,...)
|
* x = f(...)
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:05 -06:00
Erez Zadok
fe0fc013cd ecryptfs: fix interpose/interpolate typos in comments
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:03 -06:00
Erez Zadok
3469b57329 ecryptfs: pass matching flags to interpose as defined and used there
ecryptfs_interpose checks if one of the flags passed is
ECRYPTFS_INTERPOSE_FLAG_D_ADD, defined as 0x00000001 in ecryptfs_kernel.h.
But the only user of ecryptfs_interpose to pass a non-zero flag to it, has
hard-coded the value as "1". This could spell trouble if any of these values
changes in the future.

Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:02 -06:00
Erez Zadok
c44a66d674 ecryptfs: remove unnecessary d_drop calls in ecryptfs_link
Unnecessary because it would unhash perfectly valid dentries, causing them
to have to be re-looked up the next time they're needed, which presumably is
right after.

Signed-off-by: Aseem Rastogi <arastogi@cs.sunysb.edu>
Signed-off-by: Shrikar archak <shrikar84@gmail.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Saumitra Bhanage <sbhanage@cs.sunysb.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:36:00 -06:00
Erez Zadok
0d132f7364 ecryptfs: don't ignore return value from lock_rename
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:35:59 -06:00
Erez Zadok
e27759d7a3 ecryptfs: initialize private persistent file before dereferencing pointer
Ecryptfs_open dereferences a pointer to the private lower file (the one
stored in the ecryptfs inode), without checking if the pointer is NULL.
Right afterward, it initializes that pointer if it is NULL.  Swap order of
statements to first initialize.  Bug discovered by Duckjin Kang.

Signed-off-by: Duckjin Kang <fromdj2k@gmail.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:54 -06:00
Tyler Hicks
38e3eaeedc eCryptfs: Remove mmap from directory operations
Adrian reported that mkfontscale didn't work inside of eCryptfs mounts.
Strace revealed the following:

open("./", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3
fcntl64(3, F_GETFD) = 0x1 (flags FD_CLOEXEC)
open("./fonts.scale", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
getdents(3, /* 80 entries */, 32768) = 2304
open("./.", O_RDONLY) = 5
fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
fstat64(5, {st_mode=S_IFDIR|0755, st_size=16384, ...}) = 0
mmap2(NULL, 16384, PROT_READ, MAP_PRIVATE, 5, 0) = 0xb7fcf000
close(5) = 0
--- SIGBUS (Bus error) @ 0 (0) ---
+++ killed by SIGBUS +++

The mmap2() on a directory was successful, resulting in a SIGBUS
signal later.  This patch removes mmap() from the list of possible
ecryptfs_dir_fops so that mmap() isn't possible on eCryptfs directory
files.

https://bugs.launchpad.net/ecryptfs/+bug/400443

Reported-by: Adrian C. <anrxc@sysphere.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2010-01-19 22:32:11 -06:00