Commit Graph

30567 Commits

Author SHA1 Message Date
Jan Kara
091e26dfc1 ext4: fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: stable@vger.kernel.org
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-29 22:48:17 -05:00
Linus Torvalds
f96736e1ba xfs: bugfixes for 3.8-rc6
- fix return value when filesystem probe finds no XFS magic, a
   regression introduced in 9802182.
 
 - fix stack switch in __xfs_bmapi_allocate by moving the check for stack
   switch up into xfs_bmapi_write.
 
 - fix oops in _xfs_buf_find by validating that the requested block is
   within the filesystem bounds.
 
 - limit speculative preallocation near ENOSPC.
 
 - fix an unmount hang in xfs_wait_buftarg by freeing the
   xfs_buf_log_item in xfs_buf_item_unlock.
 
 - fix a possible use after free with AIO.
 
 - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
   regression introduced in fb59581404.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRBvmgAAoJENaLyazVq6ZOOacP/RilCPsi41NkJqRx1Rs5aRGE
 UvinrHfAL/tBupS2JVo1niIilBNJG/cI+lLcpV/P5omLBJfpEu0trzZUSxU7S1Vc
 a2M8J0qhmKfBcl70fuCALAxPY52895y+44gufxaH0O5HQDN6tB8n4MMqYGPmS8hz
 Ul/q3MO601hVyBHaoYa7BNGS3YG0TCdFGtWcC5tQaR3v7upTLR2ouZrGQ8CV0BBa
 Ek1xdxLh4D0fRybSL7lUw64W957iyldoLsEg+zQrE9NSfTE8DSqUG+NPWB0wjPce
 ICtmO6TbE5c6q1ScOL3YCC2cmYvjR9mlAHnPy73SqWSIsTqUsVzdibNo+tUJJZ5r
 RZf3u6Uri6uKC6Hl4XEtg4LVnnquKosTXfoiHmn+eh0dhYL7sZG0Ya5we5pH5Tmi
 P6B2DlfUA1fj4Ne4Asx2d7mwOJaZcLHDZoeCs/Haz2Z6kGVEm7ImyAb1h76uNOZo
 l0NFhXJGcOQLyjPtQjl81SGjQmntiIN0Poia3528zjxxGXlBNwAwalkOtdJnk5iN
 IaYRKtvIcrdjFvunasiKZIsV/O9w3/mguXlrSqBDgUsKPUc/cq5vLfsa70jYGc2j
 M6ldJRRqTvSjkVXc/7SXv4GLt/qbUWa92ESzhZQXEABIjZJUnOZpZuLmSVdRUyZk
 +SaGvbMphE0U/4ps3pO/
 =wViN
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:
 "Here are fixes for returning EFSCORRUPTED on probe of a non-xfs
  filesystem, the stack switch in xfs_bmapi_allocate, a crash in
  _xfs_buf_find, speculative preallocation as the filesystem nears
  ENOSPC, an unmount hang, a race with AIO, and a regression with
  xfs_fsr:

   - fix return value when filesystem probe finds no XFS magic, a
     regression introduced in 9802182.

   - fix stack switch in __xfs_bmapi_allocate by moving the check for
     stack switch up into xfs_bmapi_write.

   - fix oops in _xfs_buf_find by validating that the requested block is
     within the filesystem bounds.

   - limit speculative preallocation near ENOSPC.

   - fix an unmount hang in xfs_wait_buftarg by freeing the
     xfs_buf_log_item in xfs_buf_item_unlock.

   - fix a possible use after free with AIO.

   - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
     regression introduced in commit fb59581404a."

* tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs:
  xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
  xfs: Fix possible use-after-free with AIO
  xfs: fix shutdown hang on invalid inode during create
  xfs: limit speculative prealloc near ENOSPC thresholds
  xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
  xfs: pull up stack_switch check into xfs_bmapi_write
  xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
2013-01-30 11:59:37 +11:00
Steven Whitehouse
4513899092 GFS2: Use ->writepages for ordered writes
Instead of using a list of buffers to write ahead of the journal
flush, this now uses a list of inodes and calls ->writepages
via filemap_fdatawrite() in order to achieve the same thing. For
most use cases this results in a shorter ordered write list,
as well as much larger i/os being issued.

The ordered write list is sorted by inode number before writing
in order to retain the disk block ordering between inodes as
per the previous code.

The previous ordered write code used to conflict in its assumptions
about how to write out the disk blocks with mpage_writepages()
so that with this updated version we can also use mpage_writepages()
for GFS2's ordered write, writepages implementation. So we will
also send larger i/os from writeback too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:29:17 +00:00
Steven Whitehouse
d564053f07 GFS2: Clean up freeze code
The freeze code has not been looked at a lot recently. Upstream has
moved on, and this is an attempt to catch us back up again. There
is a vfs level interface for the freeze code which can be called
from our (obsolete, but kept for backward compatibility purposes)
sysfs freeze interface. This means freezing this way vs. doing it
from the ioctl should now work in identical fashion.

As a result of this, the freeze function is only called once
and we can drop our own special purpose code for counting the
number of freezes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:29:05 +00:00
Steven Whitehouse
c76c4d96bd GFS2: Merge gfs2_attach_bufdata() into trans.c
The locking in gfs2_attach_bufdata() was type specific (data/meta)
which made the function rather confusing. This patch moves the core
of gfs2_attach_bufdata() into trans.c renaming it gfs2_alloc_bufdata()
and moving the locking into gfs2_trans_add_data()/gfs2_trans_add_meta()

As a result all of the locking related to adding data and metadata to
the journal is now in these two functions. This should help to clarify
what is going on, and give us some opportunities to simplify in
some cases.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:28:44 +00:00
Steven Whitehouse
767f433f34 GFS2: Copy gfs2_trans_add_bh into new data/meta functions
This patch copies the body of gfs2_trans_add_bh into the two newly
added gfs2_trans_add_data and gfs2_trans_add_meta functions. We can
then move the .lo_add functions from lops.c into trans.c and call
them directly.

As a result of this, we no longer need to use the .lo_add functions
at all, so that is removed from the log operations structure.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:28:28 +00:00
Steven Whitehouse
350a9b0a72 GFS2: Split gfs2_trans_add_bh() into two
There is little common content in gfs2_trans_add_bh() between the data
and meta classes by the time that the functions which it calls are
taken into account. The intent here is to split this into two
separate functions. Stage one is to introduce gfs2_trans_add_data()
and gfs2_trans_add_meta() and update the callers accordingly.

Later patches will then pull in the content of gfs2_trans_add_bh()
and its dependent functions in order to clean up the code in this
area.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:28:04 +00:00
Steven Whitehouse
75f2b879ae GFS2: Merge revoke adding functions
This moves the lo_add function for revokes into trans.c, removing
a function call and making the code easier to read.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:27:46 +00:00
Steven Whitehouse
2a00585593 GFS2: Separate LRU scanning from shrinker
This breaks out the LRU scanning function from the shrinker in
preparation for adding other callers to the LRU scanner.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-29 10:27:28 +00:00
Jiri Kosina
617677295b Merge branch 'master' into for-next
Conflicts:
	drivers/devfreq/exynos4_bus.c

Sync with Linus' tree to be able to apply patches that are
against newer code (mvneta).
2013-01-29 10:48:30 +01:00
Guo Chao
b1deefc99e ext4: remove unnecessary NULL pointer check
brelse() and ext4_journal_force_commit() are both inlined and able
to handle NULL.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:41:02 -05:00
Guo Chao
41be871f74 ext4: remove useless assignment in dx_probe()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:33:28 -05:00
Guo Chao
2bbbee2a68 ext4: remove unused variable in add_dirent_to_buf()
After commit 978fef9 (create __ext4_insert_dentry for dir entry
insertion), 'reclen' is not used anymore.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:26:44 -05:00
Guo Chao
d5ac777305 ext4: release buffer when checksum failed
Commit b0336e8d (ext4: calculate and verify checksums of directory
leaf blocks) and commit dbe89444 (ext4: Calculate and verify checksums
for htree nodes) forget to release buffer when checksum failed, at
some places.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
2013-01-28 21:23:24 -05:00
Lukas Czerner
b06acd38a4 ext4: remove explicit WARN_ON when ext4_map_blocks() fails
In two places we call WARN_ON() before we print out the debug message,
however we agreed that the WARN_ON() is unnecessary at those places so
remove them.

Also use ext4_warning() instead of ext4_msg() and printk().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:21:12 -05:00
Lukas Czerner
cfa7275482 ext4: remove unused variable flags
Remove unused variable flags from dump_completed_IO(). The code is
only exercised when EXT4FS_DEBUG is defined.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-01-28 21:14:11 -05:00
Jan Kara
fe386132f6 ext4: fix ext4_writepage() to achieve data=ordered guarantees
So far ext4_writepage() skipped writing pages that had any delayed or
unwritten buffers attached. When blocksize < pagesize this breaks
data=ordered mode guarantees as we can have a page with one freshly
allocated buffer whose allocation is part of the committing
transaction and another buffer in the page which is delayed or
unwritten. So fix this problem by calling ext4_bio_writepage()
anyway. It will submit mapped buffers and leave others alone.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 21:06:42 -05:00
Jan Kara
8a850c3fb8 ext4: Make ext4_bio_writepage() handle unprepared buffers
So far ext4_bio_writepage() unconditionally cleared dirty bit on all
buffers underlying the page. That implicitely assumes we can write all
buffers. So far that is true because callers call into
ext4_bio_writepage() make sure all buffers in the page are mapped but:

a) it's a data corruption bug waiting to happen
b) in data=ordered mode when blocksize < pagesize we do need to write
   pages that may have only some of dirty buffers mapped.

So change ext4_bio_writepage() to skip buffers that cannot be written without
clearing their dirty bit.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 20:53:28 -05:00
Torsten Kaiser
65e3aa77f1 xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and  truncate_pagecache_range() directly.

But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'

Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 16:05:10 -06:00
Torsten Kaiser
2729423cf2 xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
Commit fb59581404 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and  truncate_pagecache_range() directly.

But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'

Signed-off-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 13:50:10 -06:00
Jan Kara
4b05d09c18 xfs: Fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:51:22 -06:00
Dave Chinner
9f87832a82 xfs: fix shutdown hang on invalid inode during create
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:

[   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck

The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:

xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock

And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.

The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.

The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:51:12 -06:00
Dave Chinner
f2a459565b xfs: limit speculative prealloc near ENOSPC thresholds
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.

Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:50:50 -06:00
Dave Chinner
eb178619f9 xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.

In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported.  Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:49:21 -06:00
Brian Foster
d26978dd86 xfs: pull up stack_switch check into xfs_bmapi_write
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:48:55 -06:00
Eric Sandeen
1bee12b8c4 xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-28 12:48:21 -06:00
Jan Kara
b6a8e62f8b ext4: simplify mpage_add_bh_to_extent()
The argument b_size of mpage_add_bh_to_extent() was bogus since it was
always == blocksize (which we can easily derive from inode->i_blkbits).
Also second branch of condition:
	if (nrblocks >= EXT4_MAX_TRANS_DATA) {
	} else if ((nrblocks + (b_size >> mpd->inode->i_blkbits)) >
						EXT4_MAX_TRANS_DATA) {
	}
was never taken because (b_size >> mpd->inode->i_blkbits) == 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 13:06:48 -05:00
Jan Kara
f8bec37037 ext4: dirty page has always buffers attached
ext4_writepage(), write_cache_pages_da(), and mpage_da_submit_io()
doesn't have to deal with the case when page doesn't have buffers. We
attach buffers to a page in ->write_begin() and ->page_mkwrite() which
covers all places where a page can become dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 12:55:08 -05:00
Jan Kara
002bd7fa3a ext4: simplify list handling in ext4_do_flush_completed_IO()
The function splices i_completed_io_list to its private list
first.  From that moment on we don't need any lock for working with
io_end structures because all io_end structure on the list are only
our own. So we can remove the other two lists in the function and free
io_end immediately after we are done with it.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:49:15 -05:00
Jan Kara
84c17543ab ext4: move work from io_end to inode
It does not make much sense to have struct work in ext4_io_end_t
because we always use it for only one ext4_io_end_t per inode (the
first one in the i_completed_io list). So just move the structure to
inode itself.  This also allows for a small simplification in
processing io_end structures.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:43:46 -05:00
Jan Kara
fe089c77f1 ext4: remove __ext4_journalled_writepage() from mpage_da_submit_io()
We don't support delayed allocation in data=journal mode. So checking for it in
mpage_da_submit_io() doesn't make really sence. If we ever decide to extend
delayed allocation support to data=journal mode, adding
__ext4_journalled_writepage() call will be the least of problems we have to
solve. Most likely we'd have to implement separate writepages call anyways
because we don't have transaction credits for writing more than a single page
so mapping of page buffers would have to be done differently.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:38:49 -05:00
Jan Kara
1ae48a6354 ext4: use redirty_page_for_writepage() in ext4_bio_write_page()
When we cannot write a page we should use redirty_page_for_writepage()
instead of plain set_page_dirty(). That tells writeback code we have
problems, redirties only the page (redirtying buffers is not needed),
and updates mm accounting of failed page writes.

Also move clearing of buffer dirty flag after io_submit_add_bh(). At that
moment we are sure buffer will be going to disk.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:32:54 -05:00
Jan Kara
36ade451a5 ext4: Always use ext4_bio_write_page() for writeout
Currently we sometimes used block_write_full_page() and sometimes
ext4_bio_write_page() for writeback (depending on mount options and call
path). Let's always use ext4_bio_write_page() to simplify things a bit.

Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:30:52 -05:00
Zheng Liu
8bad6fc813 ext4: add punching hole support for non-extent-mapped files
This patch add supports for indirect file support punching hole.  It
is almost the same as ext4_ext_punch_hole.  First, we invalidate all
pages between this hole, and then we try to deallocate all blocks of
this hole.

A recursive function is used to handle deallocation of blocks.  In
this function, it iterates over the entries in inode's i_blocks or
indirect blocks, and try to free the block for each one of them.

After applying this patch, xfstest #255 will not pass w/o extent because
indirect-based file doesn't support unwritten extents.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-28 09:21:37 -05:00
David Teigland
d4e0bfec9b GFS2: fix skip unlock condition
The recent commit fb6791d100
included the wrong logic.  The lvbptr check was incorrectly
added after the patch was tested.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-28 09:49:15 +00:00
Trond Myklebust
65436ec0c8 NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
We do need to start the lease recovery thread prior to waiting for the
client initialisation to complete in NFSv4.1.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
2013-01-27 15:51:41 -05:00
Trond Myklebust
202c312dba NFSv4: Fix NFSv4 trunking discovery
If walking the list in nfs4[01]_walk_client_list fails, then the most
likely explanation is that the server dropped the clientid before we
actually managed to confirm it. As long as our nfs_client is the very
last one in the list to be tested, the caller can be assured that this
is the case when the final return value is NFS4ERR_STALE_CLIENTID.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org [>=3.7]
Tested-by: Ben Greear <greearb@candelatech.com>
2013-01-27 15:51:28 -05:00
Trond Myklebust
4ae19c2dd7 NFSv4: Fix NFSv4 reference counting for trunked sessions
The reference counting in nfs4_init_client assumes wongly that it
is safe for nfs4_discover_server_trunking() to return a pointer to a
nfs_client prior to bumping the reference count.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
2013-01-27 15:51:15 -05:00
Trond Myklebust
dee972b967 NFS: Fix error reporting in nfs_xdev_mount
Currently, nfs_xdev_mount converts all errors from clone_server() to
ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that.
Also ensure that if nfs_fs_mount_common() returns an error, we
don't dprintk(0)...

The regression originated in commit 3d176e3fe4
(NFS: Use nfs_fs_mount_common() for xdev mounts)

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.5]
2013-01-27 15:51:15 -05:00
Frederic Weisbecker
6fac4829ce cputime: Use accessors to read task cputime stats
This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-01-27 19:23:31 +01:00
Eric W. Biederman
b3c6761d9b userns: Allow the userns root to mount ramfs.
There is no backing store to ramfs and file creation
rules are the same as for any other filesystem so
it is semantically safe to allow unprivileged users
to mount it.

The memory control group successfully limits how much
memory ramfs can consume on any system that cares about
a user namespace root using ramfs to exhaust memory
the memory control group can be deployed.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-26 22:22:38 -08:00
Eric W. Biederman
ec2aa8e8dd userns: Allow the userns root to mount of devpts
- The context in which devpts is mounted has no effect on the creation
  of ptys as the /dev/ptmx interface has been used by unprivileged
  users for many years.

- Only support unprivileged mounts in combination with the newinstance
  option to ensure that mounting of /dev/pts in a user namespace will
  not allow the options of an existing mount of devpts to be modified.

- Create /dev/pts/ptmx as the root user in the user namespace that
  mounts devpts so that it's permissions to be changed.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-26 22:22:21 -08:00
Jan Kara
ced55f38d6 xfs: Fix possible use-after-free with AIO
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.

CC: xfs@oss.sgi.com
CC: Ben Myers <bpm@sgi.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-26 09:43:58 -06:00
Dave Chinner
3b19034d4f xfs: fix shutdown hang on invalid inode during create
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:

[   22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[   23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck

The trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:

xfs_buf_init:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest:   bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror:     bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone:      bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse:    bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse:  bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock:     bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find:        bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get:         bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read:        bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold:        bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf:  bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf:   bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock:      bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele:        bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock

And that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.

The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.

The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-26 09:34:38 -06:00
Greg Kroah-Hartman
422d26b6ec Merge 3.8-rc5 into driver-core-next
This resolves a gpio driver merge issue pointed out in linux-next.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 21:06:30 -08:00
Greg Kroah-Hartman
9f9cba810f Merge 3.8-rc5 into tty-next
This resolves a number of tty driver merge issues found in linux-next

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 13:27:36 -08:00
Rafael J. Wysocki
0bb8f3d6ae sysfs: Functions for adding/removing symlinks to/from attribute groups
The most convenient way to expose ACPI power resources lists of a
device is to put symbolic links to sysfs directories representing
those resources into special attribute groups in the device's sysfs
directory.  For this purpose, it is necessary to be able to add
symbolic links to attribute groups.

For this reason, add sysfs helper functions for adding/removing
symbolic links to/from attribute groups, sysfs_add_link_to_group()
and sysfs_remove_link_from_group(), respectively.

This change set includes a build fix from David Rientjes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 21:51:13 +01:00
Linus Torvalds
d7df025eb4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "It turns out that we had two crc bugs when running fsx-linux in a
  loop.  Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
  all down.  Miao also has a new OOM fix in this v2 pull as well.

  Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
  and resuming a running balance across drives.

  Josef's orphan truncate patch fixes an obscure corruption we'd see
  during xfstests.

  Arne's patches address problems with subvolume quotas.  If the user
  destroys quota groups incorrectly the FS will refuse to mount.

  The rest are smaller fixes and plugs for memory leaks."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
  Btrfs: fix repeated delalloc work allocation
  Btrfs: fix wrong max device number for single profile
  Btrfs: fix missed transaction->aborted check
  Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
  Btrfs: put csums on the right ordered extent
  Btrfs: use right range to find checksum for compressed extents
  Btrfs: fix panic when recovering tree log
  Btrfs: do not allow logged extents to be merged or removed
  Btrfs: fix a regression in balance usage filter
  Btrfs: prevent qgroup destroy when there are still relations
  Btrfs: ignore orphan qgroup relations
  Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
  Btrfs: fix unlock order in btrfs_ioctl_rm_dev
  Btrfs: fix unlock order in btrfs_ioctl_resize
  Btrfs: fix "mutually exclusive op is running" error code
  Btrfs: bring back balance pause/resume logic
  btrfs: update timestamps on truncate()
  btrfs: fix btrfs_cont_expand() freeing IS_ERR em
  Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
  Btrfs: fix off-by-one in lseek
  ...
2013-01-25 10:55:21 -08:00
Chen Gang
03dafb5f59 ext4: fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-01-24 23:24:58 -05:00
Theodore Ts'o
72ba74508b ext4: release sysfs kobject when failing to enable quotas on mount
In addition, print the error returned from ext4_enable_quotas()

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Cc: stable@vger.kernel.org
2013-01-24 23:24:54 -05:00
Linus Torvalds
66e2d3e8c2 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Two small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
  cifs: fix srcip_matches() for ipv6
2013-01-24 19:15:43 -08:00
Miao Xie
1eafa6c737 Btrfs: fix repeated delalloc work allocation
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.

Fix this problem by splice this delalloc list.

Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:27 -05:00
Miao Xie
c9f01bfe0c Btrfs: fix wrong max device number for single profile
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:26 -05:00
Miao Xie
2cba30f172 Btrfs: fix missed transaction->aborted check
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.

Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.

So we need more check for ->aborted when committing the transaction. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:25 -05:00
Miao Xie
8d25a086eb Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:23 -05:00
Josef Bacik
e58dd74bcc Btrfs: put csums on the right ordered extent
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent.  This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent.  Without this we could end up with csums for
bytenrs that don't have extents to cover them yet.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:22 -05:00
Liu Bo
192000dda2 Btrfs: use right range to find checksum for compressed extents
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:17 -05:00
Josef Bacik
b0175117b9 Btrfs: fix panic when recovering tree log
A user reported a BUG_ON(ret) that occured during tree log replay.  Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry.  We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON().  The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility.  Thanks,

Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:49 -05:00
Josef Bacik
201a903894 Btrfs: do not allow logged extents to be merged or removed
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:49:48 -05:00
Dave Chinner
4d559a3bcb xfs: limit speculative prealloc near ENOSPC thresholds
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.

Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-24 11:08:55 -06:00
Dave Chinner
10616b806d xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.

In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported.  Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-24 11:06:41 -06:00
Miklos Szeredi
fb05f41f5f fuse: cleanup fuse_direct_io()
Fix the following sparse warnings:

fs/fuse/file.c:1216:43: warning: cast removes address space of expression
fs/fuse/file.c:1216:43: warning: incorrect type in initializer (different address spaces)
fs/fuse/file.c:1216:43:    expected void [noderef] <asn:1>*iov_base
fs/fuse/file.c:1216:43:    got void *<noident>
fs/fuse/file.c:1241:43: warning: cast removes address space of expression
fs/fuse/file.c:1241:43: warning: incorrect type in initializer (different address spaces)
fs/fuse/file.c:1241:43:    expected void [noderef] <asn:1>*iov_base
fs/fuse/file.c:1241:43:    got void *<noident>
fs/fuse/file.c:1267:43: warning: cast removes address space of expression
fs/fuse/file.c:1267:43: warning: incorrect type in initializer (different address spaces)
fs/fuse/file.c:1267:43:    expected void [noderef] <asn:1>*iov_base
fs/fuse/file.c:1267:43:    got void *<noident>

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:28 +01:00
Maxim Patlasov
5565a9d884 fuse: optimize __fuse_direct_io()
__fuse_direct_io() allocates fuse-requests by calling fuse_get_req(fc, n). The
patch calculates 'n' based on iov[] array. This is useful because allocating
FUSE_MAX_PAGES_PER_REQ page pointers and descriptors for each fuse request
would be waste of memory in case of iov-s of smaller size.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:28 +01:00
Maxim Patlasov
7c190c8b9c fuse: optimize fuse_get_user_pages()
Let fuse_get_user_pages() pack as many iov-s to a single fuse_req as
possible. This is very beneficial in case of iov[] consisting of many
iov-s of relatively small sizes (e.g. PAGE_SIZE).

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov
b98d023a24 fuse: pass iov[] to fuse_get_user_pages()
The patch makes preliminary work for the next patch optimizing scatter-gather
direct IO. The idea is to allow fuse_get_user_pages() to pack as many iov-s
to each fuse request as possible. So, here we only rework all related
call-paths to carry iov[] from fuse_direct_IO() to fuse_get_user_pages().

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov
85f40aec88 fuse: use req->page_descs[] for argpages cases
Previously, anyone who set flag 'argpages' only filled req->pages[] and set
per-request page_offset. This patch re-works all cases where argpages=1 to
fill req->page_descs[] properly.

Having req->page_descs[] filled properly allows to re-work fuse_copy_pages()
to copy page fragments described by req->page_descs[]. This will be useful
for next patches optimizing direct_IO.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov
b2430d7567 fuse: add per-page descriptor <offset, length> to fuse_req
The ability to save page pointers along with lengths and offsets in fuse_req
will be useful to cover several iovec-s with a single fuse_req.

Per-request page_offset is removed because anybody who need it can use
req->page_descs[0].offset instead.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov
54b966702d fuse: rework fuse_do_ioctl()
fuse_do_ioctl() already calculates the number of pages it's going to use. It is
stored in 'num_pages' variable. So the patch simply uses it for allocating
fuse_req.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov
d07f09f509 fuse: rework fuse_perform_write()
The patch allocates as many page pointers in fuse_req as needed to cover
interval [pos .. pos+len-1]. Inline helper fuse_wr_pages() is introduced
to hide this cumbersome arithmetic.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov
f8dbdf8182 fuse: rework fuse_readpages()
The patch uses 'nr_pages' argument of fuse_readpages() as heuristics for the
number of page pointers to allocate.

This can be improved further by taking in consideration fc->max_read and gaps
between page indices, but it's not clear whether it's worthy or not.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov
4d53dc99ba fuse: rework fuse_retrieve()
The patch reworks fuse_retrieve() to allocate only so many page pointers
as needed. The core part of the patch is the following calculation:

	num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;

(thanks Miklos for formula). All other changes are mostly shuffling lines.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov
b111c8c0e3 fuse: categorize fuse_get_req()
The patch categorizes all fuse_get_req() invocations into two categories:
 - fuse_get_req_nopages(fc) - when caller doesn't care about req->pages
 - fuse_get_req(fc, n) - when caller need n page pointers (n > 0)

Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0)
scattered over code. Now it's clear from the first glance when a caller need
fuse_req with page pointers.

The patch doesn't make any logic changes. In multi-page case, it silly
allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended
by future patches.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Maxim Patlasov
4250c0668e fuse: general infrastructure for pages[] of variable size
The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from
fuse_req. Instead of that, req->pages may now point either to small inline
array or to an array allocated dynamically.

This essentially means that all callers of fuse_request_alloc[_nofs] should
pass the number of pages needed explicitly.

The patch doesn't make any logic changes.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Anand V. Avati
0b05b18381 fuse: implement NFS-like readdirplus support
This patch implements readdirplus support in FUSE, similar to NFS.
The payload returned in the readdirplus call contains
'fuse_entry_out' structure thereby providing all the necessary inputs
for 'faking' a lookup() operation on the spot.

If the dentry and inode already existed (for e.g. in a re-run of ls -l)
then just the inode attributes timeout and dentry timeout are refreshed.

With a simple client->network->server implementation of a FUSE based
filesystem, the following performance observations were made:

Test: Performing a filesystem crawl over 20,000 files with

sh# time ls -lR /mnt

Without readdirplus:
Run 1: 18.1s
Run 2: 16.0s
Run 3: 16.2s

With readdirplus:
Run 1: 4.1s
Run 2: 3.8s
Run 3: 3.8s

The performance improvement is significant as it avoided 20,000 upcalls
calls (lookup). Cache consistency is no worse than what already is.

Signed-off-by: Anand V. Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Cong Ding
10b8c7dff5 fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
When it goes to error through line 144, the memory allocated to *devname is
not freed, and the caller doesn't free it either in line 250. So we free the
memroy of *devname in function cifs_compose_mount_options() when it goes to
error.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
CC: stable <stable@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-22 23:58:16 -06:00
Linus Torvalds
262060ea46 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "This contain a bugfix for CUSE and miscellaneous small fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: remove unused variable in fuse_try_move_page()
  fuse: make fuse_file_fallocate() static
  fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig
  cuse: fix uninitialized variable warnings
  cuse: do not register multiple devices with identical names
  cuse: use mutex as registration lock instead of spinlocks
2013-01-22 11:53:19 -08:00
Linus Torvalds
05c2cf35c3 f2fs fixes for v3.8-rc5
o Support swap file and link generic_file_remap_pages
 o Enhance the bio streaming flow and free section control
 o Major bug fix on recovery routine
 o Minor bug/warning fixes and code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ/fpIAAoJEEAUqH6CSFDS6BAP/jz/JIPoSu4NunWgVH68tzA4
 xx4lsmJQJQG+1241uA9v4MMkcTzxE0QVNTOX3BpUXBhCMc00aPOPWlWjULridLI9
 0vRE6LpG+NntkyKF+M5T1ydGuQoDlEvqoGs6c5p3yaI6PxzbzmBmipsmqXUA8fyu
 280+OWJoAcALEMJiQ8JsHcDmvM9wdQ+BV/j3BNCm4dqBUA4dYPfDzRKUJYfwqiig
 qmVRseJJaekxrQ2lHG/K/WPAXa8aRcV6khP9tv/BPGRMt+I/fli/J4sWEFT6c73B
 +qYmhrf/RLNJ1O13dePlo3URwMu083PL8QN355GAKUJbMaX/UPjEnq0DLbBOkCS2
 KBQI5O1eiFauEE6YU7p7GuvnLeVkukcXSQNVnRrnWzTUA9CMThZZ2mAgb2lz+iEP
 oZWNirRwnxdcTQRXjPyTCtpPTCgJi4GO1WS5s6HeLP8G8Muo4PzDzcEwMX0Mw5ih
 s4n4wpQ+Zp3h53cc/DxdFAK15uM3XYtUfb92kJwqaEG5VmBy6KvliXDRnNXg7WMI
 imCb08c0Fr0M8ZHOd+UCveICcndFj25jkjx8w7PoE1KBbGJkKf+cjpZ3OhOAliux
 sGtH3EZLV6jx1MjV79OpDXQGoVsvktesCbRcaxc7BbqljXYVNiap8QbzjPnmAt6z
 KKN0GU32eolGgK4Zd/iF
 =vJQc
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fixes from Jaegeuk Kim:
 o Support swap file and link generic_file_remap_pages
 o Enhance the bio streaming flow and free section control
 o Major bug fix on recovery routine
 o Minor bug/warning fixes and code cleanups

* tag 'f2fs-for-3.8-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (22 commits)
  f2fs: use _safe() version of list_for_each
  f2fs: add comments of start_bidx_of_node
  f2fs: avoid issuing small bios due to several dirty node pages
  f2fs: support swapfile
  f2fs: add remap_pages as generic_file_remap_pages
  f2fs: add __init to functions in init_f2fs_fs
  f2fs: fix the debugfs entry creation path
  f2fs: add global mutex_lock to protect f2fs_stat_list
  f2fs: remove the blk_plug usage in f2fs_write_data_pages
  f2fs: avoid redundant time update for parent directory in f2fs_delete_entry
  f2fs: remove redundant call to set_blocksize in f2fs_fill_super
  f2fs: move f2fs_balance_fs to punch_hole
  f2fs: add f2fs_balance_fs in several interfaces
  f2fs: revisit the f2fs_gc flow
  f2fs: check return value during recovery
  f2fs: avoid null dereference in f2fs_acl_from_disk
  f2fs: initialize newly allocated dnode structure
  f2fs: update f2fs partition info about SIT/NAT layout
  f2fs: update f2fs document to reflect SIT/NAT layout correctly
  f2fs: remove unneeded INIT_LIST_HEAD at few places
  ...
2013-01-22 10:33:17 -08:00
Namjae Jeon
99600051b0 udf: add extent cache support in case of file reading
This patch implements extent caching in case of file reading.
While reading a file, currently, UDF reads metadata serially
which takes a lot of time depending on the number of extents present
in the file. Caching last accessd extent improves metadata read time.
Instead of reading file metadata from start, now we read from
the cached extent.

This patch considerably improves the time spent by CPU in kernel mode.
For example, while reading a 10.9 GB file using dd:
Time before applying patch:
11677022208 bytes (10.9GB) copied, 1529.748921 seconds, 7.3MB/s
real    25m 29.85s
user    0m 12.41s
sys     15m 34.75s

Time after applying patch:
11677022208 bytes (10.9GB) copied, 1469.338231 seconds, 7.6MB/s
real    24m 29.44s
user    0m 15.73s
sys     3m 27.61s

[JK: Fix bh refcounting issues, simplify initialization]

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Bonggil Bak <bgbak@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-22 10:48:31 +01:00
Dan Carpenter
d8b79b2f94 f2fs: use _safe() version of list_for_each
This is calling list_del() inside a loop which is a problem when we try
move to the next item on the list.  I've converted it to use the _safe
version.  And also, as a cleanup, I've converted it to use
list_for_each_entry instead of list_for_each.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:49:00 +09:00
Jaegeuk Kim
9af45ef5ab f2fs: add comments of start_bidx_of_node
The caller of start_bidx_of_node() should give proper node offsets which
point only direct node blocks. Otherwise, it is a caller's bug.
This patch adds comments to make it clear.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:59 +09:00
Jaegeuk Kim
a7fdffbd3e f2fs: avoid issuing small bios due to several dirty node pages
If some small bios of dirty node pages are supposed to be issued during the
sequential data writes, there-in well-produced consecutive data bios are able
to be split by the small node bios, resulting in performance degradation.
So, let's collect a number of dirty node pages until reaching a threshold.
And, by default, I set the threshold as 2MB, a segment size.

This improves sequential write performance on i5, 512GB SSD (830 w/ SATA2) as
follows.
Before: 231 MB/s -> After: 255 MB/s

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Reviewed-by: Namjae Jeon <namjae.jeon@samsung.com>
2013-01-22 10:48:59 +09:00
Jaegeuk Kim
c01e54b770 f2fs: support swapfile
This patch adds f2fs_bmap operation to the data address space.
This enables f2fs to support swapfile.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:58 +09:00
Jaegeuk Kim
692bb55d1a f2fs: add remap_pages as generic_file_remap_pages
This was added for all the file systems before.

See the following commit.

commit id: 0b173bc4da

[PATCH] mm: kill vma flag VM_CAN_NONLINEAR

This patch moves actual ptes filling for non-linear file mappings
into special vma operation: ->remap_pages().

File system must implement this method to get non-linear mappings support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support."

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:58 +09:00
Namjae Jeon
6e6093a8f1 f2fs: add __init to functions in init_f2fs_fs
Add __init to functions in init_f2fs_fs for code consistency.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-22 10:48:38 +09:00
Ilya Dryomov
a105bb88f4 Btrfs: fix a regression in balance usage filter
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input.  Bring it back.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:40:27 -05:00
Chris Mason
83bfccb5c0 Merge branch 'mutex-ops@next-for-chris' of git://github.com/idryomov/btrfs-unstable into linus 2013-01-21 20:39:06 -05:00
Chris Mason
daf2c08911 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into linus 2013-01-21 20:26:55 -05:00
Arne Jansen
2cf6870396 Btrfs: prevent qgroup destroy when there are still relations
Currently you can just destroy a qgroup even though it is in use by other qgroups
or has qgroups assigned to it. This patch prevents destruction of qgroups unless
they are completely unused. Otherwise destroy will return EBUSY.

Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:18:11 -05:00
Arne Jansen
ff24858c65 Btrfs: ignore orphan qgroup relations
If a qgroup that has still assignments is deleted by the user, the corresponding
relations are left in the tree. This leads to an unmountable filesystem.
With this patch, those relations are simple ignored.

Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:18:11 -05:00
Kees Cook
b7e17a10ed fs/ufs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:06 -08:00
Kees Cook
f987c90257 fs/nfsd: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
03edef0fea fs/logfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Joern Engel <joern@logfs.org>
CC: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
cf98c5e568 fs/jffs2: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: David Woodhouse <dwmw2@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
adae07485e fs/hfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
462f16a557 fs/efs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Neil Brown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
00f3616b25 fs/cifs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Steve French <sfrench@samba.org>
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
38db331b57 fs/btrfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:05 -08:00
Kees Cook
d0e09c80f0 fs/bfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: "Tigran A. Aivazian" <tigran@aivazian.fsnet.co.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook
5c8e0226e7 fs/befs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook
8dbd5d6df3 fs/afs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook
6d7a19fa74 fs/affs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook
acd4fd07d4 fs/adfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

Acked-by: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Kees Cook
c904197533 fs/9p: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Ron Minnich <rminnich@sandia.gov>
CC: Latchesar Ionkov <lucho@ionkov.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:39:04 -08:00
Jan Kara
9734c971aa udf: Write LVID to disk after opening / closing
So far we just marked the buffer as dirty and left writing on flusher thread
but especially on opening that opens possible race window where we could write
other modified fs structures to disk before we mark filesystem as open. So sync
LVID buffer to disk after opening and closing fs.

Reported-by: Steve Nickel <snickel58@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:58 +01:00
Wang Shilong
c04e88e271 Ext3: return ENOMEM rather than EIO if sb_getblk fails
It will be better to use ENOMEM rather than EIO, because the only
reason that sb_getblk fails is that allocation fails.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:57 +01:00
Wang Shilong
ab6a773dbc Ext2: return ENOMEM rather than EIO if sb_getblk fails
As the only reason that sb_getblks fails is that allocation fails.
It will be better to use ENOMEM rather than EIO.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:57 +01:00
Wang Shilong
1b7d76e9b1 Ext3: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return
NULL value,it will be better to use unlikely to check it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:57 +01:00
Wang Shilong
2b0542a4a0 Ext2: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return
NULL value. It will be better to use unlikely to optimize it.

Signed-off-by: Wang shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:56 +01:00
Wang Shilong
61f43e6880 Ext3: add necessary check in case IO error happens
As we know io error may happen when the function 'sb_getblk'
is called.Add necessary check for it

The patch also fix a coding style problem.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:56 +01:00
Wang Shilong
8d8759eb48 Ext2: free memory allocated and forget buffer head when io error happens
Add a necessary check when an io error happens.
If io error happens,free the memory allocated and forget
buffer head.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:56 +01:00
Jan Kara
f56426ae4d ext3: Fix memory leak when quota options are specified multiple times
When usrjquota or grpjquota mount options are specified several times,
we leak memory storing the names. Free the memory correctly.

Reported-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:55 +01:00
Guo Chao
306a74920b ext3, ext4, ocfs2: remove unused macro NAMEI_RA_INDEX
This macro, initially introduced by ext2 in v0.99.15, does not
have any users from the beginning. It has been removed in later
ext2 version but still remains in the code of ext3, ext4, ocfs2.
Remove this macro there.

Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Mark Fasheh <mfasheh@suse.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-21 11:19:55 +01:00
Nickolai Zeldovich
e3e2775ced cifs: fix srcip_matches() for ipv6
srcip_matches() previously had code like this:

  srcip_matches(..., struct sockaddr *rhs) {
    /* ... */
    struct sockaddr_in6 *vaddr6 = (struct sockaddr_in6 *) &rhs;
    return ipv6_addr_equal(..., &vaddr6->sin6_addr);
  }

which interpreted the values on the stack after the 'rhs' pointer as an
ipv6 address.  The correct thing to do is to use 'rhs', not '&rhs'.

Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2013-01-21 01:37:26 -06:00
Ilya Dryomov
25122d15e2 Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Operation-specific check (whether subvol is readonly or not) should go
after the mutual exclusiveness check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:22 +02:00
Ilya Dryomov
4ac20c70b0 Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Fix unlock order in btrfs_ioctl_rm_dev().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:21 +02:00
Ilya Dryomov
18f39c416d Btrfs: fix unlock order in btrfs_ioctl_resize
Fix unlock order in btrfs_ioctl_resize().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:20 +02:00
Ilya Dryomov
2c0c9da02a Btrfs: fix "mutually exclusive op is running" error code
The error code that is returned in response to starting a mutually
exclusive operation when there is one already running got silently
changed from EINVAL to EINPROGRESS by 5ac00add.  Returning EINPROGRESS
to, say, add_dev, when rm_dev is running is misleading.  Furthermore,
the operation itself may want to use EINPROGRESS for other purposes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:18 +02:00
Ilya Dryomov
ed0fb78fb6 Btrfs: bring back balance pause/resume logic
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge).  Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away.  Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:17 +02:00
Joe Millenbach
4f73bc4dd3 tty: Added a CONFIG_TTY option to allow removal of TTY
The option allows you to remove TTY and compile without errors. This
saves space on systems that won't support TTY interfaces anyway.
bloat-o-meter output is below.

The bulk of this patch consists of Kconfig changes adding "depends on
TTY" to various serial devices and similar drivers that require the TTY
layer.  Ideally, these dependencies would occur on a common intermediate
symbol such as SERIO, but most drivers "select SERIO" rather than
"depends on SERIO", and "select" does not respect dependencies.

bloat-o-meter output comparing our previous minimal to new minimal by
removing TTY.  The list is filtered to not show removed entries with awk
'$3 != "-"' as the list was very long.

add/remove: 0/226 grow/shrink: 2/14 up/down: 6/-35356 (-35350)
function                                     old     new   delta
chr_dev_init                                 166     170      +4
allow_signal                                  80      82      +2
static.__warned                              143     142      -1
disallow_signal                               63      62      -1
__set_special_pids                            95      94      -1
unregister_console                           126     121      -5
start_kernel                                 546     541      -5
register_console                             593     588      -5
copy_from_user                                45      40      -5
sys_setsid                                   128     120      -8
sys_vhangup                                   32      19     -13
do_exit                                     1543    1526     -17
bitmap_zero                                   60      40     -20
arch_local_irq_save                          137     117     -20
release_task                                 674     652     -22
static.spin_unlock_irqrestore                308     260     -48

Signed-off-by: Joe Millenbach <jmillenbach@gmail.com>
Reviewed-by: Jamey Sharp <jamey@minilop.net>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-18 16:15:27 -08:00
Ben Myers
003fd6c8be xfs: fix fs/xfs/xfs_log.c:1740:39: error: 'B_TRUE' undeclared
Commit 667a9291c5 "xfs: Remove boolean_t typedef completely." didn't.

Remove a stray B_TRUE that breaks CONFIG_XFS_DEBUG=y.

Signed-off-by: Ben Myers <bpm@sgi.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
2013-01-18 15:11:57 -06:00
Greg Kroah-Hartman
ed408f7c0f Merge 3.9-rc4 into driver-core-next
This is to fix up a build problem with a wireless driver due to the
dynamic-debug patches in this branch.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 19:48:18 -08:00
Brian Foster
9e96fe6df4 xfs: pull up stack_switch check into xfs_bmapi_write
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-17 17:53:37 -06:00
Thiago Farina
667a9291c5 xfs: Remove boolean_t typedef completely.
Since we are using C99 we have one builtin defined in include/linux/types.h,
use that instead.

v2: you missed one in fs/xfs/xfs_qm_bhv.c, cleaned up. -bpm

Signed-off-by: Thiago Farina <tfarina@chromium.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-17 17:32:57 -06:00
Greg Kroah-Hartman
595e0eb067 Revert "sysfs: Convert print_symbol to %pSR"
This reverts commit 6ad58fa82d as %pSR
isn't in the tree yet.

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 13:09:57 -08:00
Sasha Levin
1884bd4b14 debugfs: remove redundant initialization of dentry
We already initialize it to NULL when declaring it, no need to do
that twice.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 13:02:08 -08:00
Joe Perches
6ad58fa82d sysfs: Convert print_symbol to %pSR
Use the new vsprintf extension to avoid any possible
message interleaving.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 12:57:07 -08:00
Bin Wang
6b8fbde418 sysfs: Fixed a trailing white space error
This patch removes the trailing white space in fs/sysfs/mount.c.

Signed-off-by: Bin Wang <wbin00@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17 12:54:47 -08:00
Wei Yongjun
8f706111a8 fuse: remove unused variable in fuse_try_move_page()
The variables mapping,index are initialized but never used
otherwise, so remove the unused variables.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:09:59 +01:00
Miklos Szeredi
cdadb11cef fuse: make fuse_file_fallocate() static
Fix the following sparse warning:

fs/fuse/file.c:2249:6: warning: symbol 'fuse_file_fallocate' was not declared. Should it be static?

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:09:47 +01:00
Robert P. J. Day
807185eb3e fuse: Move CUSE Kconfig entry from fs/Kconfig into fs/fuse/Kconfig
Given that CUSE depends on FUSE, it only makes sense to move its
Kconfig entry into the FUSE Kconfig file.  Also, add a few grammatical
and semantic touchups.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:08:45 +01:00
Miklos Szeredi
e2560362cc cuse: fix uninitialized variable warnings
Fix the following compiler warnings:

fs/fuse/cuse.c: In function 'cuse_process_init_reply':
fs/fuse/cuse.c:288:24: warning: 'val' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/cuse.c:272:14: note: 'val' was declared here
fs/fuse/cuse.c:284:10: warning: 'key' may be used uninitialized in this function [-Wmaybe-uninitialized]
fs/fuse/cuse.c:272:8: note: 'key' was declared here

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:05:52 +01:00
David Herrmann
30783587b0 cuse: do not register multiple devices with identical names
Sysfs doesn't allow two devices with the same name, but we register a
sysfs entry for each cuse device without checking for name collisions.
This extends the registration to first check whether the name was already
registered.

To avoid race-conditions between the name-check and linking the device, we
need to protect the whole registration with a mutex.

Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:04:57 +01:00
David Herrmann
8ce03fd76d cuse: use mutex as registration lock instead of spinlocks
We need to check for name-collisions during cuse-device registration. To
avoid race-conditions, this needs to be protected during the whole device
registration. Therefore, replace the spinlocks by mutexes first so we can
safely extend the locked regions to include more expensive or sleeping
code paths.

Signed-off-by: David Herrmann <dh.herrmann@googlemail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:04:51 +01:00
Philippe De Muyter
4de80273a2 fs/jfs: Fix typo in comment : 'how may' -> 'how many'
Signed-off-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-17 10:18:02 +01:00
Zheng Liu
aaddea812c ext4: add tracepoint in punching hole
This patch adds a tracepoint in ext4_punch_hole.

CC: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-16 20:21:26 -05:00
Linus Torvalds
dfdebc2483 xfs: bugfixes for 3.8-rc4
- fix(es) for compound buffers
 - fix for dquot soft timer asserts due to overflow of d_blk_softlimit
 - fix for regression in dir v2 code introduced in commit 20f7e9f3
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ9zKnAAoJENaLyazVq6ZORGcP/RemqCHJEw0a89Y0tLLLAcz/
 Es97kJMESdvi3gX3JTdz3vC8LP21dSCR3k3MvVgucb8RsvGoiLixrmluIRxKb79M
 DEmz9YJ/qxFIpnM9y46VxCYV+/ezxUDEv68wA6T2wJbof26nTLlTj2gAgqjvyWiF
 R1c1OmdCsTfA257UvxfxSVixVnWv7E2io2ZXUGsrBkP4J9OMaMtn00UYOuP1YL8S
 NJ44z9QAzTqVEbAfGeaeV/QVUJzMj/IqWCwF7YKEhfmccO/tPyN0+nMG2DI0Fp5e
 cYGsi4JnaFbqE6Aa/7mu3kv8lYnPe0n3t9d3EwzxOEx+PAvuY8N0EW8Qa4c+805n
 zXFvAroLgP0jYEEuIfEGYIwDPxG0xjor6ztu8e2twcIj6cDHzSpeYaDPnYvWJlwu
 FiupnVu+3FX6mVY1jCealI47nOwzM12R7nXysqF3F6Sf95xGJtG3BoTIKioNqk1g
 dzJGMQvwg/WLvquYb9W/ZNb1T314R23wdYtmI7gWJ74z9IQqWCZBWFYyBhQ8y1Pr
 Vf3LFjzqNqqnYNzoe8Wnn9wKQ57Es7onAo34Y9HZCOkslZsn5nKriNTXNN6Q9Upc
 5RKvj1CbTpKAJYrrhWryI1HtlDKqqtMFdmRQulSu+O9ZJuWZh4XNTu4t3oewt0Ac
 5otZwOdk53V3tGxt3prs
 =gA4q
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.8-rc4' of git://oss.sgi.com/xfs/xfs

Pull xfs bugfixes from Ben Myers:

 - fix(es) for compound buffers

 - fix for dquot soft timer asserts due to overflow of d_blk_softlimit

 - fix for regression in dir v2 code introduced in commit 20f7e9f372
   ("xfs: factor dir2 block read operations")

* tag 'for-linus-v3.8-rc4' of git://oss.sgi.com/xfs/xfs:
  xfs: recalculate leaf entry pointer after compacting a dir2 block
  xfs: remove int casts from debug dquot soft limit timer asserts
  xfs: fix the multi-segment log buffer format
  xfs: fix segment in xfs_buf_item_format_segment
  xfs: rename bli_format to avoid confusion with bli_formats
  xfs: use b_maps[] for discontiguous buffers
2013-01-16 16:19:54 -08:00
Eric Sandeen
aeb4f20a02 xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 17:33:53 -06:00
Eric Sandeen
37f13561de xfs: recalculate leaf entry pointer after compacting a dir2 block
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:

XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828

Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.

                        memmove(&blp[mid + 1], &blp[mid],
                                (highstale - mid) * sizeof(*blp));

overwrote it.

What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array.  So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().

Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.

Thanks also to dchinner for looking at this one with me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:08:55 -06:00
Brian Foster
ab7eac2200 xfs: remove int casts from debug dquot soft limit timer asserts
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:08:40 -06:00
Mark Tinguely
91e4bac0b7 xfs: fix the multi-segment log buffer format
Per Dave Chinner suggestion, this patch:
 1) Corrects the detection of whether a multi-segment buffer is
    still tracking data.
 2) Clears all the buffer log formats for a multi-segment buffer.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:08:08 -06:00
Mark Tinguely
2d0e9df579 xfs: fix segment in xfs_buf_item_format_segment
Not every segment in a multi-segment buffer is dirty in a
transaction and they will not be outputted. The assert in
xfs_buf_item_format_segment() that checks for the at least
one chunk of data in the segment to be used is not necessary
true for multi-segmented buffers.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:07:56 -06:00
Mark Tinguely
0f22f9d0cd xfs: rename bli_format to avoid confusion with bli_formats
Rename the bli_format structure to __bli_format to avoid
accidently confusing them with the bli_formats pointer.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:07:37 -06:00
Mark Tinguely
d44d9bc68e xfs: use b_maps[] for discontiguous buffers
Commits starting at 77c1a08 introduced a multiple segment support
to xfs_buf. xfs_trans_buf_item_match() could not find a multi-segment
buffer in the transaction because it was looking at the single segment
block number rather than the multi-segment b_maps[0].bm.bn. This
results on a recursive buffer lock that can never be satisfied.

This patch:
 1) Changed the remaining b_map accesses to be b_maps[0] accesses.
 2) Renames the single segment b_map structure to __b_map to avoid
    future confusion.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-16 16:07:11 -06:00
Linus Torvalds
31db720643 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext3 and udf fixes from Jan Kara:
 "One ext3 performance regression fix and one udf regression fix (oops
  on interrupted mount)."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  UDF: Fix a null pointer dereference in udf_sb_free_partitions
  jbd: don't wake kjournald unnecessarily
2013-01-16 10:55:10 -08:00
Kees Cook
1e817fb62c time: create __getnstimeofday for WARNless calls
The pstore RAM backend can get called during resume, and must be defensive
against a suspended time source. Expose getnstimeofday logic that returns
an error instead of a WARN. This can be detected and the timestamp can
be zeroed out.

Reported-by: Doug Anderson <dianders@chromium.org>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2013-01-15 18:16:02 -08:00
Akinobu Mita
3d251a5b9e UBIFS: rename random32() to prandom_u32()
Use more preferable function name which implies using a pseudo-random
number generator.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2013-01-15 15:45:27 +02:00
Namjae Jeon
4589d25d01 f2fs: fix the debugfs entry creation path
As the "status" debugfs entry will be maintained for entire F2FS filesystem
irrespective of the number of partitions.
So, we can move the initialization to the init part of the f2fs and destroy will
be done from exit part. After making changes, for individual partition mount -
entry creation code will not be executed.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-15 20:19:15 +09:00
majianpeng
66af62ce75 f2fs: add global mutex_lock to protect f2fs_stat_list
There is an race condition between umounting f2fs and reading f2fs/status, which
results in oops.

Fox example:
Thread A			Thread B
umount f2fs 			cat f2fs/status

f2fs_destroy_stats() {		stat_show() {
				 list_for_each_entry_safe(&f2fs_stat_list)
 list_del(&si->stat_list);
 mutex_lock(&si->stat_lock);
 si->sbi = NULL;
 mutex_unlock(&si->stat_lock);
 kfree(sbi->stat_info);
} 				 mutex_lock(&si->stat_lock) <- si is gone.
				 ...
				}

Solution with a global lock: f2fs_stat_mutex:
Thread A			Thread B
umount f2fs 			cat f2fs/status

f2fs_destroy_stats() {		stat_show() {
 mutex_lock(&f2fs_stat_mutex);
 list_del(&si->stat_list);
 mutex_unlock(&f2fs_stat_mutex);
 kfree(sbi->stat_info);		 mutex_lock(&f2fs_stat_mutex);
}				 list_for_each_entry_safe(&f2fs_stat_list)
				 ...
				 mutex_unlock(&f2fs_stat_mutex);
				}

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
[jaegeuk.kim@samsung.com: fix typos, description, and remove the existing lock]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-15 20:18:29 +09:00
Namjae Jeon
fa9150a84c f2fs: remove the blk_plug usage in f2fs_write_data_pages
Let's consider the usage of blk_plug in f2fs_write_data_pages().
We can come up with the two issues: lock contention and task awareness.

1. Merging bios prior to grabing "queue lock"
 The f2fs merges consecutive IOs in the file system level before
 submitting any bios, which is similar with the back merge by the
 plugging mechanism in attempt_plug_merge(). Both of them need to acquire
 no queue lock.

2. Merging policy with respect to tasks
 The f2fs merges IOs as much as possible regardless of tasks, while
 blk-plugging is conducted on a basis of tasks. As we can understand
 there are trade-offs, f2fs tries to maximize the write performance with
 well-merged bios.

As a result, if f2fs produces many consecutive but separated bios in
writepages(), it would be good to use blk-plugging since f2fs would be
able to avoid queue lock contention in the block layer by merging them.
But, f2fs merges IOs and submit one bio, which means that there are not
much chances to merge bios by attempt_plug_merge().

However, f2fs has already been used blk_plug by triggering generic_writepages()
in f2fs_write_data_pages().
So to make the overall code consistency, I'd like to remove blk_plug there.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-15 20:18:16 +09:00
Namjae Jeon
1b1baff6e5 UDF: Fix a null pointer dereference in udf_sb_free_partitions
This patch fixes a regression caused by commit bff943af6f "udf: Fix memory
leak when mounting" due to which it was triggering a kernel null point
dereference in case of interrupted mount OR when allocating memory to
sbi->s_partmaps failed in function udf_sb_alloc_partition_maps.

Reported-and-tested-by: James Hogan <james@albanarts.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-14 22:53:47 +01:00
Eric Sandeen
7e2fb2d7e6 jbd: don't wake kjournald unnecessarily
Don't send an extra wakeup to kjournald in the case where we
already have the proper target in j_commit_request, i.e. that
commit has already been requested for commit.

commit d9b0193 "jbd: fix fsync() tid wraparound bug" changed
the logic leading to a wakeup, but it caused some extra wakeups
which were found to lead to a measurable performance regression.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-01-14 22:50:45 +01:00
Linus Torvalds
6d283dba37 vfs: add missing virtual cache flush after editing partial pages
Andrew Morton pointed this out a month ago, and then I completely forgot
about it.

If we read a partial last page of a block device, we will zero out the
end of the page, but since that page can then be mapped into user space,
we should also make sure to flush the cache on architectures that have
virtual caches.  We have the flush_dcache_page() function for this, so
use it.

Now, in practice this really never matters, because nobody sane uses
virtual caches to begin with, and they largely exist on old broken RISC
arhitectures.

And even if you did run on one of those obsolete CPU's, the whole "mmap
and access the last partial page of a block device" behavior probably
doesn't actually exist.  The normal IO functions (read/write) will never
see the zeroed-out part of the page that migth not be coherent in the
cache, because they honor the size of the device.

So I'm marking this for stable (3.7 only), but I'm not sure anybody will
ever care.

Pointed-out-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org  # 3.7
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-14 13:17:50 -08:00
Eric Sandeen
3972f2603d btrfs: update timestamps on truncate()
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:37 -05:00
Zach Brown
f276795627 btrfs: fix btrfs_cont_expand() freeing IS_ERR em
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.

An instance of -EEXIST was reported in the wild:

  https://bugzilla.redhat.com/show_bug.cgi?id=874407

I have no idea if that -EEXIST is surprising, or not.  Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).

This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
2013-01-14 13:53:23 -05:00
Liu Bo
f9e4fb5393 Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
xfstests case 285 complains.

It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Liu Bo
1214b53f90 Btrfs: fix off-by-one in lseek
Lock end is inclusive.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:53:22 -05:00
Liu Bo
3268a2468e Btrfs: reset path lock state to zero
We forgot to reset the path lock state to zero after we unlock the path block,
and this can lead to the ASSERT checker in tree unlock API.

Reported-by: Slava Barinov <rayslava@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:53 -05:00
Liu Bo
ac5c93005b Btrfs: let allocation start from the right raid type
This'd avoid us empty looping.

Say we have only one disk and the metadata raid type will be defaultly DUP,
and we do not need to start from index=0(RAID10) and get over two empty
loops to index=2(DUP).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Josef Bacik
f3fe820c20 Btrfs: add orphan before truncating pagecache
Running xfstests 83 in a loop would sometimes fail the fsck.  This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up.  The problem with this is there is plenty of time
for the truncate to fail after we've done this work.  So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe.  This
fixes the btrfsck failures I was seeing while running 83 in a loop.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:52 -05:00
Josef Bacik
72bcd99d45 Btrfs: set flushing if we're limited flushing
We still need to say we're flushing if we're limit flushing to keep somebody
from coming in and stealing our reservation.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:51 -05:00
Miao Xie
9754767657 Btrfs: fix missing write access release in btrfs_ioctl_resize()
We forget to give up the write access after we find some device operation
is going on. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:51 -05:00
Miao Xie
dba60f3f5d Btrfs: fix resize a readonly device
We should not resize a readonly device, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-14 13:52:49 -05:00
Miao Xie
5c39da5b6c Btrfs: do not delete a subvolume which is in a R/O subvolume
Step to reproduce:
 # mkfs.btrfs <disk>
 # mount <disk> <mnt>
 # btrfs sub create <mnt>/subv0
 # btrfs sub snap <mnt> <mnt>/subv0/snap0
 # change <mnt>/subv0 from R/W to R/O
 # btrfs sub del <mnt>/subv0/snap0

We deleted the snapshot successfully. I think we should not be able to delete
the snapshot since the parent subvolume is R/O.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-01-14 13:52:32 -05:00
Miao Xie
d86e56cf7d Btrfs: disable qgroup id 0
Qgroup id 0 is a special number, we should set the id of a qgroup to 0.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2013-01-14 13:52:31 -05:00
Lukas Czerner
cc975eb460 btrfs: get the device in write mode when deleting it
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-01-14 13:52:31 -05:00
Tsutomu Itoh
cfa7a9ccda Btrfs: fix memory leak in name_cache_insert()
We should free name_cache_entry before returning from the
error handling code.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2013-01-14 13:52:30 -05:00
Linus Torvalds
3441f0d26d Driver core fixes for 3.8-rc3
Here are two patches for 3.8-rc3.
 
 One removes the __dev* defines from init.h now that all usages of it are gone
 from your tree.  The other fix is for debugfs's paramater that was using the
 wrong base for the option.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlDzjcAACgkQMUfUDdst+ykJVwCcDqiKrO9p0dcH9WXN5aukBWX/
 N8EAoK786v7PjtiVyNOJ/cPUDU8OHUpg
 =U4nL
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg Kroah-Hartman:
 "Here are two patches for 3.8-rc3.

  One removes the __dev* defines from init.h now that all usages of it
  are gone from your tree.  The other fix is for debugfs's paramater
  that was using the wrong base for the option.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

* tag 'driver-core-3.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  debugfs: convert gid= argument from decimal, not octal
  Remove __dev* markings from init.h
2013-01-14 09:07:11 -08:00
Namjae Jeon
163799872b f2fs: avoid redundant time update for parent directory in f2fs_delete_entry
In call to f2fs_delete_entry, 'dir' time modification code is put
at two places.
So, remove the redundant code for timing update.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-14 09:43:27 +09:00
Namjae Jeon
ff9234ad4e f2fs: remove redundant call to set_blocksize in f2fs_fill_super
Since, f2fs supports only 4KB blocksize, which is set at the beginning in
f2fs_fill_super. So, we do not need to again check this blocksize setting
in such case.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-14 09:41:30 +09:00
Abhijit Pawar
a17164e54b fs/xfs remove obsolete simple_strto<foo>
This patch replaces usages of obsolete simple_strtoul with kstrtoint in
xfs_args and suffix_strtoul.

Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-13 14:42:07 -06:00
Eric Sandeen
d4608632ec xfs: recalculate leaf entry pointer after compacting a dir2 block
Dave Jones hit this assert when doing a compile on recent git, with
CONFIG_XFS_DEBUG enabled:

XFS: Assertion failed: (char *)dup - (char *)hdr == be16_to_cpu(*xfs_dir2_data_unused_tag_p(dup)), file: fs/xfs/xfs_dir2_data.c, line: 828

Upon further digging, the tag found by xfs_dir2_data_unused_tag_p(dup)
contained "2" and not the proper offset, and I found that this value was
changed after the memmoves under "Use a stale leaf for our new entry."
in xfs_dir2_block_addname(), i.e.

                        memmove(&blp[mid + 1], &blp[mid],
                                (highstale - mid) * sizeof(*blp));

overwrote it.

What has happened is that the previous call to xfs_dir2_block_compact()
has rearranged things; it changes btp->count as well as the
blp array.  So after we make that call, we must recalculate the
proper pointer to the leaf entries by making another call to
xfs_dir2_block_leaf_p().

Dave provided a metadump image which led to a simple reproducer
(create a particular filename in the affected directory) and this
resolves the testcase as well as the bug on his live system.

Thanks also to dchinner for looking at this one with me.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-13 14:36:17 -06:00
Theodore Ts'o
7f5118629f ext4: trigger the lazy inode table initialization after resize
After we have finished extending the file system, we need to trigger a
the lazy inode table thread to zero out the inode tables.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-13 08:41:45 -05:00
Eryu Guan
15b49132fc ext4: check bh in ext4_read_block_bitmap()
Validate the bh pointer before using it, since
ext4_read_block_bitmap_nowait() might return NULL.

I've seen this in fsfuzz testing.

 EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616
 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0
 ...
 Call Trace:
  [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60
  [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80
  [<ffffffff811d0d36>] ? __getblk+0x36/0x70
  [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210
  [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140
  [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0
  [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100
  [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0
  [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230
  [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490
  [<ffffffff811b8cd7>] evict+0xa7/0x1c0
  [<ffffffff811b8ed3>] iput_final+0xe3/0x170
  [<ffffffff811b8f9e>] iput+0x3e/0x50
  [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90
  [<ffffffff81231d0b>] ext4_create+0xeb/0x170
  [<ffffffff811aae9c>] vfs_create+0xac/0xd0
  [<ffffffff811ac845>] lookup_open+0x185/0x1c0
  [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170
  [<ffffffff811acb54>] do_last+0x2d4/0x7a0
  [<ffffffff811af743>] path_openat+0xb3/0x480
  [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0
  [<ffffffff811afc49>] do_filp_open+0x49/0xa0
  [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150
  [<ffffffff8119da28>] do_sys_open+0x108/0x1f0
  [<ffffffff8119db51>] sys_open+0x21/0x30
  [<ffffffff81618959>] system_call_fastpath+0x16/0x1b

Also fix comment for ext4_read_block_bitmap_nowait()

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:33:25 -05:00
Wang Shilong
aebf02430d ext4: use unlikely to improve the efficiency of the kernel
Because the function 'sb_getblk' seldomly fails to return NULL
value,it will be better to use 'unlikely' to optimize it.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-01-12 16:28:47 -05:00
Theodore Ts'o
860d21e2c5 ext4: return ENOMEM if sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head.  So ENOMEM is more appropriate than EIO.  In addition,
make sure that the file system is marked as being inconsistent if
sb_getblk() fails.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-01-12 16:19:36 -05:00
Xi Wang
6d92d4f6a7 fs/exec.c: work around icc miscompilation
The tricky problem is this check:

	if (i++ >= max)

icc (mis)optimizes this check as:

	if (++i > max)

The check now becomes a no-op since max is MAX_ARG_STRINGS (0x7FFFFFFF).

This is "allowed" by the C standard, assuming i++ never overflows,
because signed integer overflow is undefined behavior.  This
optimization effectively reverts the previous commit 362e6663ef
("exec.c, compat.c: fix count(), compat_count() bounds checking") that
tries to fix the check.

This patch simply moves ++ after the check.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11 14:54:55 -08:00
Kees Cook
d9777b8de4 fs/xfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ben Myers <bpm@sgi.com>
2013-01-11 11:39:04 -08:00
Kees Cook
f11cb2271f fs/nilfs2: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2013-01-11 11:39:04 -08:00
Kees Cook
336d6d0323 fs/ecryptfs: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Tyler Hicks <tyhicks@canonical.com>
CC: Dustin Kirkland <dustin.kirkland@gazzang.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Tyler Hicks <tyhicks@canonical.com>
2013-01-11 11:39:04 -08:00
Kees Cook
1b6a78a522 fs/ceph: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Sage Weil <sage@inktank.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Sage Weil <sage@inktank.com>
2013-01-11 11:39:04 -08:00
Seiji Aguchi
9f244e9cfd pstore: Avoid deadlock in panic and emergency-restart path
[Issue]

When pstore is in panic and emergency-restart paths, it may be blocked
in those paths because it simply takes spin_lock.

This is an example scenario which pstore may hang up in a panic path:

 - cpuA grabs psinfo->buf_lock
 - cpuB panics and calls smp_send_stop
 - smp_send_stop sends IRQ to cpuA
 - after 1 second, cpuB gives up on cpuA and sends an NMI instead
 - cpuA is now in an NMI handler while still holding buf_lock
 - cpuB is deadlocked

This case may happen if a firmware has a bug and
cpuA is stuck talking with it more than one second.

Also, this is a similar scenario in an emergency-restart path:

 - cpuA grabs psinfo->buf_lock and stucks in a firmware
 - cpuB kicks emergency-restart via either sysrq-b or hangcheck timer.
   And then, cpuB is deadlocked by taking psinfo->buf_lock again.

[Solution]

This patch avoids the deadlocking issues in both panic and emergency_restart
paths by introducing a function, is_non_blocking_path(), to check if a cpu
can be blocked in current path.

With this patch, pstore is not blocked even if another cpu has
taken a spin_lock, in those paths by changing from spin_lock_irqsave
to spin_trylock_irqsave.

In addition, according to a comment of emergency_restart() in kernel/sys.c,
spin_lock shouldn't be taken in an emergency_restart path to avoid
deadlock. This patch fits the comment below.

<snip>
/**
 *      emergency_restart - reboot the system
 *
 *      Without shutting down any hardware or taking any locks
 *      reboot the system.  This is called when we know we are in
 *      trouble so this is our best effort to reboot.  This is
 *      safe to call in interrupt context.
 */
void emergency_restart(void)
<snip>

Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-01-11 10:20:50 -08:00
Dave Reisner
f1688e0431 debugfs: convert gid= argument from decimal, not octal
This patch technically breaks userspace, but I suspect that anyone who
actually used this flag would have encountered this brokenness, declared
it lunacy, and already sent a patch.

Signed-off-by: Dave Reisner <dreisner@archlinux.org>
Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11 05:56:01 -08:00
Jaegeuk Kim
9eaeba7013 f2fs: move f2fs_balance_fs to punch_hole
The f2fs_fallocate() has two operations: punch_hole and expand_size.

Only in the case of punch_hole, dirty node pages can be produced, so let's
trigger f2fs_balance_fs() in this case only.
Furthermore, let's trigger it at every data truncation routine.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-11 15:09:23 +09:00
Jaegeuk Kim
7d82db8316 f2fs: add f2fs_balance_fs in several interfaces
The f2fs_balance_fs() is to check the number of free sections and decide whether
it needs to conduct cleaning or not. If there are not enough free sections, the
cleaning job should be started.

In order to control an amount of free sections even under high utilization, f2fs
should call f2fs_balance_fs at all the VFS interfaces that are able to produce
dirty pages.
This patch adds the function calls in the missing interfaces as follows.

1. f2fs_setxattr()
The f2fs_setxattr() produces dirty node pages so that we should call
f2fs_balance_fs() either likewise doing in other VFS interfaces such as
f2fs_lookup(), f2fs_mkdir(), and so on.

2. f2fs_sync_file()
We should guarantee serving free sections for syncing metadata during fsync.
Previously, there is no space check before triggering checkpoint and
sync_node_pages.
Therefore, if a bunch of fsync calls are triggered under 100% of FS utilization,
f2fs is able to be faced with no free sections, resulting in BUG_ON().

3. f2fs_sync_fs()
Before calling write_checkpoint(), we should guarantee that there are minimum
free sections.

4. f2fs_write_inode()
f2fs_write_inode() is also able to produce dirty node pages.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-11 15:09:17 +09:00
Randy Dunlap
254adaa465 seq_file: fix new kernel-doc warnings
Fix kernel-doc warnings in fs/seq_file.c:

  Warning(fs/seq_file.c:304): No description found for parameter 'whence'
  Warning(fs/seq_file.c:304): Excess function parameter 'origin' description in 'seq_lseek'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-10 14:35:24 -08:00
Jaegeuk Kim
408e937561 f2fs: revisit the f2fs_gc flow
I'd like to revisit the f2fs_gc flow and rewrite as follows.

1. In practical, the nGC parameter of f2fs_gc is meaningless. So, let's
  remove it.
2. Background GC marks victim blocks as dirty one at a time.
3. Foreground GC should do cleaning job until acquiring enough free
  sections. Afterwards, it needs to do checkpoint.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-10 07:42:59 +09:00
Wang Sheng-Hui
210b907acf btrfs: remove unnecessary cur_trans set before goto loop in join_transaction
In the big loop, cur_trans will be set fs_info->running_transaction
before it's used. And after kmem_cache_free it and goto loop, it will
be setup again. No need to setup it immediately after freed.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-09 11:48:33 +01:00
Masanari Iida
8a168ca707 treewide: Fix typo in various drivers
Correct spelling typo in printk within various drivers.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-09 11:43:32 +01:00
Liu Bo
2c016dc2cb btrfs: fix comment typos
Convert 'hepler' to 'helper'.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-01-09 11:40:52 +01:00
Linus Torvalds
5c33d9b248 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) New sysctl ndisc_notify needs some documentation, from Hanns
    Frederic Sowa.

 2) Netfilter REJECT target doesn't set transport header of SKB
    correctly, from Mukund Jampala.

 3) Forcedeth driver needs to check for DMA mapping failures, from Larry
    Finger.

 4) brcmsmac driver can't use usleep_range while holding locks, use
    udelay instead.  From Niels Ole Salscheider.

 5) Fix unregister of netlink bridge multicast database handlers, from
    Vlad Yasevich and Rami Rosen.

 6) Fix checksum calculations in netfilter's ipv6 network prefix
    translation module.

 7) Fix high order page allocation failures in netfilter xt_recent, from
    Eric Dumazet.

 8) mac802154 needs to use netif_rx_ni() instead of netif_rx() because
    mac802154_process_data() can execute in process rather than
    interrupt context.  From Alexander Aring.

 9) Fix splice handling of MSG_SENDPAGE_NOTLAST, otherwise we elide one
    tcp_push() too many.  From Eric Dumazet and Willy Tarreau.

10) Fix skb->truesize tracking in XEN netfront driver, from Ian
    Campbell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
  xen/netfront: improve truesize tracking
  ipv4: fix NULL checking in devinet_ioctl()
  tcp: fix MSG_SENDPAGE_NOTLAST logic
  net/ipv4/ipconfig: really display the BOOTP/DHCP server's address.
  ip-sysctl: fix spelling errors
  mac802154: fix NOHZ local_softirq_pending 08 warning
  ipv6: document ndisc_notify in networking/ip-sysctl.txt
  ath9k: Fix Kconfig for ATH9K_HTC
  netfilter: xt_recent: avoid high order page allocations
  netfilter: fix missing dependencies for the NOTRACK target
  netfilter: ip6t_NPT: fix IPv6 NTP checksum calculation
  bridge: add empty br_mdb_init() and br_mdb_uninit() definitions.
  vxlan: allow live mac address change
  bridge: Correctly unregister MDB rtnetlink handlers
  brcmfmac: fix parsing rsn ie for ap mode.
  brcmsmac: add copyright information for Canonical
  rtlwifi: rtl8723ae: Fix warning for unchecked pci_map_single() call
  rtlwifi: rtl8192se: Fix warning for unchecked pci_map_single() call
  rtlwifi: rtl8192de: Fix warning for unchecked pci_map_single() call
  rtlwifi: rtl8192ce: Fix warning for unchecked pci_map_single() call
  ...
2013-01-08 07:31:49 -08:00
Linus Torvalds
127aa93066 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Misc small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Don't let read only caching for mandatory byte-range locked files
  CIFS: Fix write after setting a read lock for read oplock files
  Revert "CIFS: Fix write after setting a read lock for read oplock files"
  cifs: adjust sequence number downward after signing NT_CANCEL request
  cifs: move check for NULL socket into smb_send_rqst
2013-01-07 13:21:55 -08:00
David Teigland
f117228346 dlm: avoid scanning unchanged toss lists
Keep track of whether a toss list contains any
shrinkable rsbs.  If not, dlm_scand can avoid
scanning the list for rsbs to shrink.  Unnecessary
scanning can otherwise waste a lot of time because
the toss lists can contain a large number of rsbs
that are non-shrinkable (directory records).

Signed-off-by: David Teigland <teigland@redhat.com>
2013-01-07 12:02:49 -06:00
Linus Torvalds
f77637206d Bug fixes, including two regressions introduced in v3.8. The most
serious of these regressions is a buffer cache leak.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQ6rvpAAoJENNvdpvBGATwRrIP/1bokaspEpaHVKFAqgJzRNew
 cPdINNqc85m5MmVGmvPq4EMx7+86Z459sZAKafaXV/qVR/m7vmtfIiWi8bWPZUkv
 MrblMSPAQHERypMWA8eJkNfkyw39HJb4GMAIgh1TBUXO2ocb46cGpl0Fcum0twT4
 pv2C2JqNcLtIsekJsrqmvdqrNW+bMoMJZtzjFwHuIknmbo7eSFtgV17EFlcfWhJ7
 CZoJwvk2s/dnS6l4icwKZBNbKnap6oR9SptoymvH+ATPIEO4qJisRbID6XVhTZZ8
 3lsOWfGlGIVEy6wsQKwfeWdCzAyAjyQza7eeMdVkqm97YynFSUD0pMZ9tQ1FpQ9E
 JGAWshLFyA8+atY/JZ8xGQisY2R57WoJd2m0Bf3ockhB963iu+vnIxZ4BaKW3K9l
 WMRqkouA1ijaeeUNHV4FulJ0cG6ioIYTF6nM/jGTJXhF8ZXjJHb4ZiwjjWS8fZv1
 Ooe6GkHz29txi+hOET0vtwDUqkihGFlfNDbZ48/JjlS4sCy5ntcwt321Tn7olbo5
 O72k+oQLtMLJshrZwTuSQZZtiv/9638gtNGC6EUy7p5LTmo5CgaueEh2qhXweVGR
 f8X25RjNWREvE78JiJw6SzYfNeaLO6I0f+Hs4z8PLVc8K02IO0tK8mw8LNgvFHRF
 xCFhD2w916Dsh9cwyiRT
 =pnxL
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 regression fixes from Ted Ts'o:
 "Bug fixes, including two regressions introduced in v3.8.  The most
  serious of these regressions is a buffer cache leak."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()
  ext4: release buffer in failed path in dx_probe()
  ext4: fix configuration dependencies for ext4 ACLs and security labels
2013-01-07 09:22:50 -08:00
Linus Torvalds
4c9014f2ca NFS client bugfixes for Linux 3.8
- Fix a permissions problem when opening NFSv4 files that only have the
   exec bit set.
 - Fix a couple of typos in pNFS (inverted logic), and the mount parsing
   (missing pointer dereference).
 - Work around a series of deadlock issues due to workqueues using
   struct work_struct pointer address comparisons in the re-entrancy
   tests. Ensure that we don't free struct work_struct prematurely if
   our work function involves waiting for completion of other work
   items (e.g. by calling rpc_shutdown_client).
 - Revert the part of commit 168e4b3 that is causing unnecessary warnings
   to be issued in the nfsd callback code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ6uzfAAoJEGcL54qWCgDyRvIP/06F4xZChZxH3prNz2cU2EDJ
 2RO3lWoZf8aBk2+bhU5Fm9cuUcCm1raoRFsdWoJC/dnlXL+A7ZeuYDvXycNfTJW3
 vCIRvZTU8TAxwR9szNkPhRIB09FAacioP/K0q6pBfwx8my5NC1yIpOMDVmND+40Z
 oWm7ICip4vblxNVQMYp6/JrDcc7LCDcOG5j0EKO5aSxRE0Ki2TYqN0nk20v9caDe
 43jOA2HWGPQ+Zg3sty2o728u57RB70OBqRJ//2pzW2fEwSut+pf5pC1MoVJvtgST
 Utwbt/tMEFAuvo0WGHqjaYHoAezcLkjSvYdrh7Vz6WJ0gsonCB8V8UvgKsFt0XSM
 YWtXa/PzkDfXDVNwkzpjmDCUDJwzAiTllQP5cJzAhsz7GG0xzGypyqmIMcJ6UGxy
 sADGXAhpJR8sWn7oxh4znMh+JlxW3MLA+jSgkrlzGM7wMinncnp5RSQIlY2/8qhZ
 GiO6b6wiUYYHbZtV456S4M+PwA0kGqkSDP/PlrquXAW8jPFCBlKLf4+SFK2kqs25
 yyJQXD0QycbtfbCGxaFJkt1qUEgIGkTpU8UnyxHKccIXhR4ZDi/IZ3OQSLexE+8V
 L8HZXeT32+fGgPz479bplXJ2VnLQo6VXOrA5ofylqesWW5klDPMTz7KIEJSyyHwG
 kiSLyjhDx/2vAG+Nwusr
 =yxcq
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix a permissions problem when opening NFSv4 files that only have the
   exec bit set.

 - Fix a couple of typos in pNFS (inverted logic), and the mount parsing
   (missing pointer dereference).

 - Work around a series of deadlock issues due to workqueues using
   struct work_struct pointer address comparisons in the re-entrancy
   tests.  Ensure that we don't free struct work_struct prematurely if
   our work function involves waiting for completion of other work items
   (e.g. by calling rpc_shutdown_client).

 - Revert the part of commit 168e4b3 that is causing unnecessary
   warnings to be issued in the nfsd callback code.

* tag 'nfs-for-3.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: avoid dereferencing null pointer in initiate_bulk_draining
  SUNRPC: Partial revert of commit 168e4b39d1
  NFS: Ensure that we free the rpc_task after read and write cleanups are done
  SUNRPC: Ensure that we free the rpc_task after cleanups are done
  nfs: fix null checking in nfs_get_option_str()
  pnfs: Increase the refcount when LAYOUTGET fails the first time
  NFS: Fix access to suid/sgid executables
2013-01-07 08:36:45 -08:00
Eric Dumazet
ae62ca7b03 tcp: fix MSG_SENDPAGE_NOTLAST logic
commit 35f9c09fe9 (tcp: tcp_sendpages() should call tcp_push() once)
added an internal flag : MSG_SENDPAGE_NOTLAST meant to be set on all
frags but the last one for a splice() call.

The condition used to set the flag in pipe_to_sendpage() relied on
splice() user passing the exact number of bytes present in the pipe,
or a smaller one.

But some programs pass an arbitrary high value, and the test fails.

The effect of this bug is a lack of tcp_push() at the end of a
splice(pipe -> socket) call, and possibly very slow or erratic TCP
sessions.

We should both test sd->total_len and fact that another fragment
is in the pipe (pipe->nrbufs > 1)

Many thanks to Willy for providing very clear bug report, bisection
and test programs.

Reported-by: Willy Tarreau <w@1wt.eu>
Bisected-by: Willy Tarreau <w@1wt.eu>
Tested-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-01-06 20:58:13 -08:00
Guo Chao
fef0ebdb22 ext4: remove duplicate call to ext4_bread() in ext4_init_new_dir()
This fixes a buffer cache leak when creating a directory, introduced
in commit a774f9c20.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:40:25 -05:00
Guo Chao
0ecaef0644 ext4: release buffer in failed path in dx_probe()
If checksum fails, we should also release the buffer
read from previous iteration.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>-
Cc: stable@vger.kernel.org
--
 fs/ext4/namei.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
2013-01-06 23:38:47 -05:00
Valerie Aurora
96465efee1 ext4: fix configuration dependencies for ext4 ACLs and security labels
Commit "ext4: Remove CONFIG_EXT4_FS_XATTR" removed the configuration
dependencies for ext4 xattrs from the ext4 ACLs and security labels
configuration options, but did not replace them with a dependency on
ext4 itself.  Add back the dependency on ext4 so the options only show
up if ext4 is enabled.

Signed-off-by: Valerie Aurora <val@vaaconsulting.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Tao Ma <boyu.mt@taobao.com>
2013-01-06 23:38:44 -05:00
Nickolai Zeldovich
ecf0eb9edb nfs: avoid dereferencing null pointer in initiate_bulk_draining
Fix an inverted null pointer check in initiate_bulk_draining().

Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-01-05 14:26:51 -05:00
Trond Myklebust
6db6dd7d3f NFS: Ensure that we free the rpc_task after read and write cleanups are done
This patch ensures that we free the rpc_task after the cleanup callbacks
are done in order to avoid a deadlock problem that can be triggered if
the callback needs to wait for another workqueue item to complete.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Weston Andros Adamson <dros@netapp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Bruce Fields <bfields@fieldses.org>
Cc: stable@vger.kernel.org [>= 3.5]
2013-01-04 12:59:10 -05:00
Xi Wang
e25fbe380c nfs: fix null checking in nfs_get_option_str()
The following null pointer check is broken.

	*option = match_strdup(args);
	return !option;

The pointer `option' must be non-null, and thus `!option' is always false.
Use `!*option' instead.

The bug was introduced in commit c5cb09b6f8 ("Cleanup: Factor out some
cut-and-paste code.").

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-01-04 10:54:43 -05:00
Yanchuan Nian
39e88fcfb1 pnfs: Increase the refcount when LAYOUTGET fails the first time
The layout will be set unusable if LAYOUTGET fails. Is it reasonable to
increase the refcount iff LAYOUTGET fails the first time?

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-01-04 10:50:42 -05:00
Jaegeuk Kim
c335a86930 f2fs: check return value during recovery
This patch resolves Coverity #753102:

>>> No check of the return value of "f2fs_add_link(&dent, inode)".

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:46:27 +09:00
Jaegeuk Kim
c1b75eabec f2fs: avoid null dereference in f2fs_acl_from_disk
This patch resolves Coverity #751303:

>>> CID 753103: Explicit null dereferenced (FORWARD_NULL) Passing null
>>> pointer "value" to function "f2fs_acl_from_disk(char const *, size_t)",
	which dereferences it.

[Error path]
- value = NULL;
- retval = 0 by f2fs_getxattr();
- f2fs_acl_from_disk(value:NULL, ...);

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:46:27 +09:00
Jaegeuk Kim
d66d1f7687 f2fs: initialize newly allocated dnode structure
This patch resolves Coverity #753112.

In practical, the existing code flow does not fall into the reported errorneous
path. But, anyway, let's avoid this for future.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:46:27 +09:00
Huajun Li
7880ceedec f2fs: update f2fs partition info about SIT/NAT layout
Update partition info output under debug FS to reflect segment layout correctly.

Signed-off-by: Huajun Li <huajun.li.lee@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Namjae Jeon
24c366a9ea f2fs: remove unneeded INIT_LIST_HEAD at few places
While creating a new entry for addition to the list(orphan inode list
and fsync inode entry list), there is no need to call HEAD initialization
for these entries. So, remove that init part.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Namjae Jeon
3af60a49fd f2fs: fix time update in case of f2fs fallocate
After doing a punch hole or expanding inode doing fallocation.
The change and modification time are not update for the file.
So, update time after no issue is observed in fallocate.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Namjae Jeon
a07ef78435 f2fs: introduce f2fs_msg to ease adding information prints
Introduced f2fs_msg function to differentiate f2fs specific messages in
the log. And, added few informative prints in the mount path, to convey
proper error in case of mount failure.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-01-04 09:42:59 +09:00
Linus Torvalds
49569646b2 Driver core __dev* removal patches
Here are the remaining __dev* removal patches against the 3.8-rc2 tree.
 All of these patches were previously sent to the subsystem maintainers,
 most of them were picked up and pushed to you, but there were a number
 that fell through the cracks, and new drivers were added during the
 merge window, so this series cleans up the rest of the instances of
 these markings.
 
 Third time's the charm...
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlDmHOIACgkQMUfUDdst+ykTZgCePgK84Im3FFooEXJwaPbaf4ls
 lO4AoMEDoWK+BHWOsjQwFPOwFFPEN2Xh
 =6oAQ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core __dev* removal patches - take 3 - from Greg Kroah-Hartman:
 "Here are the remaining __dev* removal patches against the 3.8-rc2
  tree.  All of these patches were previously sent to the subsystem
  maintainers, most of them were picked up and pushed to you, but there
  were a number that fell through the cracks, and new drivers were added
  during the merge window, so this series cleans up the rest of the
  instances of these markings.

  Third time's the charm...

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fixed up trivial conflict with the pinctrl pull in pinctrl-sirf.c.

* tag 'driver-core-3.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
  misc: remove __dev* attributes.
  include: remove __dev* attributes.
  Documentation: remove __dev* attributes.
  Drivers: misc: remove __dev* attributes.
  Drivers: block: remove __dev* attributes.
  Drivers: bcma: remove __dev* attributes.
  Drivers: char: remove __dev* attributes.
  Drivers: clocksource: remove __dev* attributes.
  Drivers: ssb: remove __dev* attributes.
  Drivers: dma: remove __dev* attributes.
  Drivers: gpu: remove __dev* attributes.
  Drivers: infinband: remove __dev* attributes.
  Drivers: memory: remove __dev* attributes.
  Drivers: mmc: remove __dev* attributes.
  Drivers: iommu: remove __dev* attributes.
  Drivers: power: remove __dev* attributes.
  Drivers: message: remove __dev* attributes.
  Drivers: macintosh: remove __dev* attributes.
  Drivers: mfd: remove __dev* attributes.
  pstore: remove __dev* attributes.
  ...
2013-01-03 16:17:50 -08:00
Greg Kroah-Hartman
6ae141718e misc: remove __dev* attributes.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the last of the __dev* markings from the kernel from
a variety of different, tiny, places.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:16 -08:00
Greg Kroah-Hartman
f568f6ca81 pstore: remove __dev* attributes.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit from the pstore filesystem.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Anton Vorontsov <cbouatmailru@gmail.com>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:14 -08:00
Weston Andros Adamson
f8d9a897d4 NFS: Fix access to suid/sgid executables
nfs_open_permission_mask() should only check MAY_EXEC for files that
are opened with __FMODE_EXEC.

Also fix NFSv4 access-in-open path in a similar way -- openflags must be
used because fmode will not always have FMODE_EXEC set.

This patch fixes https://bugzilla.kernel.org/show_bug.cgi?id=49101

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-03 17:06:27 -05:00
Eric Sandeen
83a9ba0057 xfs: don't zero structure members after a memset(0)
Commit 408cc4e97a
added memset(0, ...) to allocation args structures,
so there is no need to explicitly set any of the fields
to 0 after that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-03 16:00:07 -06:00
Brian Foster
f755503206 xfs: remove int casts from debug dquot soft limit timer asserts
The int casts here make it easy to trigger an assert with a large
soft limit. For example, set a >4TB soft limit on an empty volume
to reproduce a (0 > -x) comparison due to an overflow of
d_blk_softlimit.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-01-03 15:58:39 -06:00
Linus Torvalds
2318aa2720 f2fs: bug fixes for 3.8-rc2
This patch-set includes two major bug fixes:
 - incorrect IUsed provided by *df -i*, and
 - lookup failure of parent inodes in corner cases.
 
 [Other Bug Fixes]
 - Fix error handling routines
 - Trigger recovery process correctly
 - Resolve build failures due to missing header files
 
 [Etc]
 - Add a MAINTAINERS entry for f2fs
 - Fix and clean up variables, functions, and equations
 - Avoid warnings during compilation
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ3QYaAAoJEEAUqH6CSFDSZwYP/jyx4ilQC5pT85NCBs744JGZ
 zGKcNLlWJcVpSwoepmjEJiMfvYE63imwmsG0gHcnUosvndTivrOkLUPReBE4bLLO
 6KR0J/HWxfRY+FAM6nGfoVDWAG/2mU/cCwDKCwGgAZp//5YmxQZTcp3Xcak6dHUQ
 z64O2XC/roK4w827lwpp7lFBnhY3snpaA+EFGe2Wwm+9r9BrJYP3FyFG9VVRR1Cw
 SQphbZrC2yo+03IN1sJV83QLfKt3+tONhctizAtMzVOVgM2ToVLNbz32SVj8pbnD
 WSMwVWYxVQwC8R9ZhUc2Z3hV+m9m9MswBMyK+U9QF5r/avFK4vKHJYLOcXmpzRcX
 voH5tl7OfVERqfMsle+7X6/Edz5xd4abF4b5yM+3h9pz4LRJYWkp3daeWPY++ZUR
 esM81f0+W55I/mk1STlI7N+KdVn3/Zpqi0UVkcZQ8y15NpOR+5zeLF4+8Skk4NGL
 emYJiho4GB2cpKw7gIQQtnGKSTIPnwRK6Lart7qJ1b2FfNwtJqAgALm7HDxbqSTN
 r3XXJv2E1u9HBu0gEKs5g223Poj7nBLwOQPkmIyx1ozD7bx19vQGnfsHza8a9L1y
 pXQ3dKU3o93+dHjMGky2cpe6B52nuvN+MBicqBSCJAIbSafmp8/JrBiN7hgQFdpj
 x/nXXZd+WPakvjHz03e5
 =dH3T
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs bug fixes from Jaegeuk Kim:
 "This patch-set includes two major bug fixes:
   - incorrect IUsed provided by *df -i*, and
   - lookup failure of parent inodes in corner cases.

  [Other Bug Fixes]
   - Fix error handling routines
   - Trigger recovery process correctly
   - Resolve build failures due to missing header files

  [Etc]
   - Add a MAINTAINERS entry for f2fs
   - Fix and clean up variables, functions, and equations
   - Avoid warnings during compilation"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: unify string length declarations and usage
  f2fs: clean up unused variables and return values
  f2fs: clean up the start_bidx_of_node function
  f2fs: remove unneeded variable from f2fs_sync_fs
  f2fs: fix fsync_inode list addition logic and avoid invalid access to memory
  f2fs: remove unneeded initialization of nr_dirty in dirty_seglist_info
  f2fs: handle error from f2fs_iget_nowait
  f2fs: fix equation of has_not_enough_free_secs()
  f2fs: add MAINTAINERS entry
  f2fs: return a default value for non-void function
  f2fs: invalidate the node page if allocation is failed
  f2fs: add missing #include <linux/prefetch.h>
  f2fs: do f2fs_balance_fs in front of dir operations
  f2fs: should recover orphan and fsync data
  f2fs: fix handling errors got by f2fs_write_inode
  f2fs: fix up f2fs_get_parent issue to retrieve correct parent inode number
  f2fs: fix wrong calculation on f_files in statfs
  f2fs: remove set_page_dirty for atomic f2fs_end_io_write
2013-01-03 11:41:43 -08:00
Linus Torvalds
ed4e6a94d3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull GFS2 fixes from Steven Whitehouse:
 "Here are four small bug fixes for GFS2.  There is no common theme here
  really, just a few items that were fixed recently.

  The first fixes lock name generation when the glock number is 0.  The
  second fixes a race allocating reservation structures and the final
  two fix a performance issue by making small changes in the allocation
  code."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: Reset rd_last_alloc when it reaches the end of the rgrp
  GFS2: Stop looking for free blocks at end of rgrp
  GFS2: Fix race in gfs2_rs_alloc
  GFS2: Initialize hex string to '0'
2013-01-03 11:38:14 -08:00
Linus Torvalds
007f6c3a63 Two self-explanatory fixes and a third patch which improves performance. When
overwriting a full page in the eCryptfs page cache, skip reading in and
 decrypting the corresponding lower page.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABCgAGBQJQ5HR+AAoJENaSAD2qAscKg/gQAJSGpz9Frh3QqV30smvbKASI
 vBcHpbEBMhpExzkcLF3Gqdj7KqcwpN3Nh+oAD1vNyvermeczazEebr5wFfNTv4eE
 TetUfa2e92RS0c0yxgS+9k1Fhxi8BCovNxmFfiq5iPFHSNwjixPBHLLZVFPCdp9N
 il/dV8Y7wg1exDikZQc8lqiVULZxvkBc+R/dgXFhAnwFxDMT2jiInXbBU4Onct0P
 +YX4FwrKnDCOg7bk8Mk/lW6mwAuhoelnuF3dy9v/soBeclOeTfmUmO44dv0D3IPY
 iGpGofhs+cDSKxOZ0XXocAdFdmY7fbcijppoF00XyZiuqcd59zc0l+LDRuCBcXD7
 SFSTzR0uFf8C0rM4Mjfz6WGbwW7Ae0KqLbFIVg03MJDCquOtDBr0Xdpviy1GYNo3
 H0Z3400olyGqp/3ZoEjefOoz9DbzqHtzhcMtGBN/ihyaolPJzS81pLTYCsja2SJM
 pHUjId3abWOVRgtrAk+XUO9Sn6W8Or5bug4+idYwD6LfUILz9OpHin/mplnHoF9F
 8lEjhzNHyvU3HQPyR4v/TidExyx7IBeP0tOLk4X2N+fmH45ukl/pPDNfpF/2lxpd
 mN7HK2H2cYtGrYSwSmwuG0q9W365vmk8mvu2Xz5aIMe9r5SeucgPjzZ3zg+kHgRE
 OqJljwln6TaSB/7o0MQ5
 =JNeQ
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull ecryptfs fixes from Tyler Hicks:
 "Two self-explanatory fixes and a third patch which improves
  performance: when overwriting a full page in the eCryptfs page cache,
  skip reading in and decrypting the corresponding lower page."

* tag 'ecryptfs-3.8-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  fs/ecryptfs/crypto.c: make ecryptfs_encode_for_filename() static
  eCryptfs: fix to use list_for_each_entry_safe() when delete items
  eCryptfs: Avoid unnecessary disk read and data decryption during writing
2013-01-02 17:33:50 -08:00
Linus Torvalds
5439ca6b8f Various bug fixes for ext4. Perhaps the most serious bug fixed is one
which could cause file system corruptions when performing file punch
 operations.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQ374OAAoJENNvdpvBGATwEGAP/jKUwjQhBZiF0k9dg1kQ5eTz
 bdli4fy1vxrEMIOym8IZa4nBQJVCkArwRgjc28gCBD6k9u6X3GPa26vUydsoPfP6
 odPdc9c9HtsbYQGuaq1SohID5HfjxHewTcUmCs4X4SpGcSurUcT7eQYWqSuIxFHR
 0nKk8NO4EcWh2uqIoGPrc8QpSdor0DXXYYjZmHCeVLH1n6PyoMsnrFMfO9KqMLUL
 vNR54CX9n1GRTfAfJNkNzcwfs8IfNkDUyv5hFpDh15tLltogU0TqnlAl3vSeZGSx
 vVfhwHmQTK/bJyC3YaoRZqq9CQJVk2f/OTBpJDFY/USaapuitJd6vqbmh7NiRNAN
 LaKmFt99MPfwyjEhIA7+J0LCTraAxc536q43oWWK5dAJhWI7DW0lbHARVeQTixNy
 KJ1Lp0pmmz1mX8/lugOnK1SPBF525kTaoiz2bWqg4oQgn7mBzUlgj+EV22/6Rq83
 TpKOKstl4BiZi8t5AhmFiwqtknCDiT5vUKQNy2kuM/oXtPJID/lM/TJbR5viYD3l
 AH3Ef7xj61CynFZ0oBeraGwtXc2BHJpJdWz+8uj0/VhFfC+uNUYapSLFwyiAVZKO
 xxaItT3ylfKpa0AWK6HBc2SLuL72SCHAPks06YKFtSyHtr5C8SCcafxU2DSOSi7K
 VrhkcH6STa77Br7a1ORt
 =9R/D
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bug fixes from Ted Ts'o:
 "Various bug fixes for ext4.  Perhaps the most serious bug fixed is one
  which could cause file system corruptions when performing file punch
  operations."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid hang when mounting non-journal filesystems with orphan list
  ext4: lock i_mutex when truncating orphan inodes
  ext4: do not try to write superblock on ro remount w/o journal
  ext4: include journal blocks in df overhead calcs
  ext4: remove unaligned AIO warning printk
  ext4: fix an incorrect comment about i_mutex
  ext4: fix deadlock in journal_unmap_buffer()
  ext4: split off ext4_journalled_invalidatepage()
  jbd2: fix assertion failure in jbd2_journal_flush()
  ext4: check dioread_nolock on remount
  ext4: fix extent tree corruption caused by hole punch
2013-01-02 09:57:34 -08:00
Hugh Dickins
a7a88b2373 mempolicy: remove arg from mpol_parse_str, mpol_to_str
Remove the unused argument (formerly no_context) from mpol_parse_str()
and from mpol_to_str().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 09:27:10 -08:00
Eric Wong
128dd1759d epoll: prevent missed events on EPOLL_CTL_MOD
EPOLL_CTL_MOD sets the interest mask before calling f_op->poll() to
ensure events are not missed.  Since the modifications to the interest
mask are not protected by the same lock as ep_poll_callback, we need to
ensure the change is visible to other CPUs calling ep_poll_callback.

We also need to ensure f_op->poll() has an up-to-date view of past
events which occured before we modified the interest mask.  So this
barrier also pairs with the barrier in wq_has_sleeper().

This should guarantee either ep_poll_callback or f_op->poll() (or both)
will notice the readiness of a recently-ready/modified item.

This issue was encountered by Andreas Voellmy and Junchang(Jason) Wang in:
http://thread.gmane.org/gmane.linux.kernel/1408782/

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Cc: Hans Verkuil <hans.verkuil@cisco.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Voellmy <andreas.voellmy@yale.edu>
Tested-by: "Junchang(Jason) Wang" <junchang.wang@yale.edu>
Cc: netdev@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02 09:16:43 -08:00
Bob Peterson
13d2eb0129 GFS2: Reset rd_last_alloc when it reaches the end of the rgrp
In function rg_mblk_search, it's searching for multiple blocks in
a given state (e.g. "free"). If there's an active block reservation
its goal is the next free block of that. If the resource group
contains the dinode's goal block, that's used for the search. But
if neither is the case, it uses the rgrp's last allocated block.
That way, consecutive allocations appear after one another on media.
The problem comes in when you hit the end of the rgrp; it would never
start over and search from the beginning. This became a problem,
since if you deleted all the files and data from the rgrp, it would
never start over and find free blocks. So it had to keep searching
further out on the media to allocate blocks. This patch resets the
rd_last_alloc after it does an unsuccessful search at the end of
the rgrp.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:05:27 +00:00
Bob Peterson
15bd50ad82 GFS2: Stop looking for free blocks at end of rgrp
This patch adds a return code check after calling function
gfs2_rbm_from_block while determining the free extent size.
That way, when the end of an rgrp is reached, it won't try
to process unaligned blocks after the end.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:05:10 +00:00
Abhijith Das
f1213cacc7 GFS2: Fix race in gfs2_rs_alloc
QE aio tests uncovered a race condition in gfs2_rs_alloc where it's possible
to come out of the function with a valid ip->i_res allocation but it gets
freed before use resulting in a NULL ptr dereference.

This patch envelopes the initial short-circuit check for non-NULL ip->i_res
into the mutex lock. With this patch, I was able to successfully run the
reproducer test multiple times.

Resolves: rhbz#878476
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:04:53 +00:00
Nathan Straz
ec1487528b GFS2: Initialize hex string to '0'
When generating the DLM lock name, a value of 0 would skip
the loop and leave the string unchanged.  This left locks with
a value of 0 unlabeled.  Initializing the string to '0' fixes this.

Signed-off-by: Nathan Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-01-02 10:04:00 +00:00
Pavel Shilovsky
63b7d3a41c CIFS: Don't let read only caching for mandatory byte-range locked files
If we have mandatory byte-range locks on a file we can't cache reads
because pagereading may have conflicts with these locks on the server.
That's why we should allow level2 oplocks for files without mandatory
locks only.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-01 23:04:30 -06:00
Pavel Shilovsky
88cf75aaaf CIFS: Fix write after setting a read lock for read oplock files
If we have a read oplock and set a read lock in it, we can't write to the
locked area - so, filemap_fdatawrite may fail with a no information for a
userspace application even if we request a write to non-locked area. Fix
this by writing directly to the server and then breaking oplock level from
level2 to None.

Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS
and SMB2 protocols.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-01 23:04:14 -06:00
Pavel Shilovsky
ca8aa29c60 Revert "CIFS: Fix write after setting a read lock for read oplock files"
that solution has data races and can end up two identical writes to the
server: when clientCanCacheAll value can be changed during the execution
of __generic_file_aio_write.

Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-01-01 22:59:55 -06:00
Ben Myers
56431cd194 Merge branch 'xfs-for-3.9' 2012-12-31 10:41:39 -06:00
Jeff Layton
31efee60f4 cifs: adjust sequence number downward after signing NT_CANCEL request
When a call goes out, the signing code adjusts the sequence number
upward by two to account for the request and the response. An NT_CANCEL
however doesn't get a response of its own, it just hurries the server
along to get it to respond to the original request more quickly.
Therefore, we must adjust the sequence number back down by one after
signing a NT_CANCEL request.

Cc: <stable@vger.kernel.org>
Reported-by: Tim Perry <tdparmor-sambabugs@yahoo.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-30 11:43:51 -06:00
Jeff Layton
ea702b80e0 cifs: move check for NULL socket into smb_send_rqst
Cai reported this oops:

[90701.616664] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[90701.625438] IP: [<ffffffff814a343e>] kernel_setsockopt+0x2e/0x60
[90701.632167] PGD fea319067 PUD 103fda4067 PMD 0
[90701.637255] Oops: 0000 [#1] SMP
[90701.640878] Modules linked in: des_generic md4 nls_utf8 cifs dns_resolver binfmt_misc tun sg igb iTCO_wdt iTCO_vendor_support lpc_ich pcspkr i2c_i801 i2c_core i7core_edac edac_core ioatdma dca mfd_core coretemp kvm_intel kvm crc32c_intel microcode sr_mod cdrom ata_generic sd_mod pata_acpi crc_t10dif ata_piix libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[90701.677655] CPU 10
[90701.679808] Pid: 9627, comm: ls Tainted: G        W    3.7.1+ #10 QCI QSSC-S4R/QSSC-S4R
[90701.688950] RIP: 0010:[<ffffffff814a343e>]  [<ffffffff814a343e>] kernel_setsockopt+0x2e/0x60
[90701.698383] RSP: 0018:ffff88177b431bb8  EFLAGS: 00010206
[90701.704309] RAX: ffff88177b431fd8 RBX: 00007ffffffff000 RCX: ffff88177b431bec
[90701.712271] RDX: 0000000000000003 RSI: 0000000000000006 RDI: 0000000000000000
[90701.720223] RBP: ffff88177b431bc8 R08: 0000000000000004 R09: 0000000000000000
[90701.728185] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
[90701.736147] R13: ffff88184ef92000 R14: 0000000000000023 R15: ffff88177b431c88
[90701.744109] FS:  00007fd56a1a47c0(0000) GS:ffff88105fc40000(0000) knlGS:0000000000000000
[90701.753137] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[90701.759550] CR2: 0000000000000028 CR3: 000000104f15f000 CR4: 00000000000007e0
[90701.767512] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[90701.775465] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[90701.783428] Process ls (pid: 9627, threadinfo ffff88177b430000, task ffff88185ca4cb60)
[90701.792261] Stack:
[90701.794505]  0000000000000023 ffff88177b431c50 ffff88177b431c38 ffffffffa014fcb1
[90701.802809]  ffff88184ef921bc 0000000000000000 00000001ffffffff ffff88184ef921c0
[90701.811123]  ffff88177b431c08 ffffffff815ca3d9 ffff88177b431c18 ffff880857758000
[90701.819433] Call Trace:
[90701.822183]  [<ffffffffa014fcb1>] smb_send_rqst+0x71/0x1f0 [cifs]
[90701.828991]  [<ffffffff815ca3d9>] ? schedule+0x29/0x70
[90701.834736]  [<ffffffffa014fe6d>] smb_sendv+0x3d/0x40 [cifs]
[90701.841062]  [<ffffffffa014fe96>] smb_send+0x26/0x30 [cifs]
[90701.847291]  [<ffffffffa015801f>] send_nt_cancel+0x6f/0xd0 [cifs]
[90701.854102]  [<ffffffffa015075e>] SendReceive+0x18e/0x360 [cifs]
[90701.860814]  [<ffffffffa0134a78>] CIFSFindFirst+0x1a8/0x3f0 [cifs]
[90701.867724]  [<ffffffffa013f731>] ? build_path_from_dentry+0xf1/0x260 [cifs]
[90701.875601]  [<ffffffffa013f731>] ? build_path_from_dentry+0xf1/0x260 [cifs]
[90701.883477]  [<ffffffffa01578e6>] cifs_query_dir_first+0x26/0x30 [cifs]
[90701.890869]  [<ffffffffa015480d>] initiate_cifs_search+0xed/0x250 [cifs]
[90701.898354]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.904486]  [<ffffffffa01554cb>] cifs_readdir+0x45b/0x8f0 [cifs]
[90701.911288]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.917410]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.923533]  [<ffffffff81195970>] ? fillonedir+0x100/0x100
[90701.929657]  [<ffffffff81195848>] vfs_readdir+0xb8/0xe0
[90701.935490]  [<ffffffff81195b9f>] sys_getdents+0x8f/0x110
[90701.941521]  [<ffffffff815d3b99>] system_call_fastpath+0x16/0x1b
[90701.948222] Code: 66 90 55 65 48 8b 04 25 f0 c6 00 00 48 89 e5 53 48 83 ec 08 83 fe 01 48 8b 98 48 e0 ff ff 48 c7 80 48 e0 ff ff ff ff ff ff 74 22 <48> 8b 47 28 ff 50 68 65 48 8b 14 25 f0 c6 00 00 48 89 9a 48 e0
[90701.970313] RIP  [<ffffffff814a343e>] kernel_setsockopt+0x2e/0x60
[90701.977125]  RSP <ffff88177b431bb8>
[90701.981018] CR2: 0000000000000028
[90701.984809] ---[ end trace 24bd602971110a43 ]---

This is likely due to a race vs. a reconnection event.

The current code checks for a NULL socket in smb_send_kvec, but that's
too late. By the time that check is done, the socket will already have
been passed to kernel_setsockopt. Move the check into smb_send_rqst, so
that it's checked earlier.

In truth, this is a bit of a half-assed fix. The -ENOTSOCK error
return here looks like it could bubble back up to userspace. The locking
rules around the ssocket pointer are really unclear as well. There are
cases where the ssocket pointer is changed without holding the srv_mutex,
but I'm not clear whether there's a potential race here yet or not.

This code seems like it could benefit from some fundamental re-think of
how the socket handling should behave. Until then though, this patch
should at least fix the above oops in most cases.

Cc: <stable@vger.kernel.org> # 3.7+
Reported-and-Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-30 11:38:58 -06:00
Leon Romanovsky
9836b8b949 f2fs: unify string length declarations and usage
This patch is intended to unify string length declarations and usage.
There are number of calls to strlen which return size_t object.
The size of this object depends on compiler if it will be bigger,
equal or even smaller than an unsigned int

Signed-off-by: Leon Romanovsky <leon@leon.nu>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:53 +09:00
Jaegeuk Kim
2b50638dec f2fs: clean up unused variables and return values
This patch cleans up a couple of unnecessary codes related to unused variables
and return values.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:52 +09:00
Jaegeuk Kim
ce19a5d432 f2fs: clean up the start_bidx_of_node function
This patch also resolves the following warning reported by kbuild test robot.

fs/f2fs/gc.c: In function 'start_bidx_of_node':
fs/f2fs/gc.c:453:21: warning: 'bidx' may be used uninitialized in this function

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:52 +09:00
Namjae Jeon
64c576fe51 f2fs: remove unneeded variable from f2fs_sync_fs
We can directly return '0' from the function, instead of introducing a
'ret' variable.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:45 +09:00
Namjae Jeon
fd8bb65f79 f2fs: fix fsync_inode list addition logic and avoid invalid access to memory
In function find_fsync_dnodes() - the fsync inodes gets added to the list, but
in one path suppose f2fs_iget results in error, in such case - error gets added
to the fsync inode list.
In next call to recover_data()->get_fsync_inode()
entry = list_entry(this, struct fsync_inode_entry, list);
                if (entry->inode->i_ino == ino)
This can result in "invalid access to memory" when it encounters 'error' as
entry in the fsync inode list.
So, add the fsync inode entry to the list only in case of no errors.
And, free the object at that point itself in case of issue.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:36 +09:00
Namjae Jeon
344324f10f f2fs: remove unneeded initialization of nr_dirty in dirty_seglist_info
Since, the memory for the object of dirty_seglist_info is allocated
using kzalloc - which returns zeroed out memory. So, there is no need
to initialize the nr_dirty values with zeroes.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:27:05 +09:00
Namjae Jeon
06025f4df8 f2fs: handle error from f2fs_iget_nowait
In case f2fs_iget_nowait returns error, it results in truncate_hole being
called with 'error' value as inode pointer. There is no check in truncate_hole
for valid inode, so it could result in crash due "invalid access to memory".
Avoid this by handling error condition properly.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:26:13 +09:00
Jaegeuk Kim
029cd28c1f f2fs: fix equation of has_not_enough_free_secs()
Practically, has_not_enough_free_secs() should calculate with the numbers of
current node and directory data blocks together.
Actually the equation was implemented in need_to_flush().

So, this patch removes need_flush() and moves the equation into
has_not_enough_free_secs().

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:24:10 +09:00
Jaegeuk Kim
12a67146e3 f2fs: return a default value for non-void function
This patch resolves a build warning reported by kbuild test robot.

"
fs/f2fs/segment.c: In function '__get_segment_type':
fs/f2fs/segment.c:806:1: warning: control reaches end of non-void
function [-Wreturn-type]
"

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:24:09 +09:00
Jaegeuk Kim
71e9fec548 f2fs: invalidate the node page if allocation is failed
The new_node_page() is processed as the following procedure.

1. A new node page is allocated.
2. Set PageUptodate with proper footer information.
3. Check if there is a free space for allocation
 4.a. If there is no space, f2fs returns with -ENOSPC.
 4.b. Otherwise, go next.

In the case of step #4.a, f2fs remains a wrong node page in the page cache
with the uptodate flag.

Also, even though a new node page is allocated successfully, an error can be
occurred afterwards due to allocation failure of the other data structures.
In such a case, remove_inode_page() would be triggered, so that we have to
clear uptodate flag in truncate_node() too.

So, we should remove the uptodate flag, if allocation is failed.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:24:09 +09:00
Geert Uytterhoeven
690e4a3ead f2fs: add missing #include <linux/prefetch.h>
m68k allmodconfig:

fs/f2fs/data.c: In function ‘read_end_io’:
fs/f2fs/data.c:311: error: implicit declaration of function ‘prefetchw’

fs/f2fs/segment.c: In function ‘f2fs_end_io_write’:
fs/f2fs/segment.c:628: error: implicit declaration of function ‘prefetchw’

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-28 11:22:43 +09:00
Theodore Ts'o
0e9a9a1ad6 ext4: avoid hang when mounting non-journal filesystems with orphan list
When trying to mount a file system which does not contain a journal,
but which does have a orphan list containing an inode which needs to
be truncated, the mount call with hang forever in
ext4_orphan_cleanup() because ext4_orphan_del() will return
immediately without removing the inode from the orphan list, leading
to an uninterruptible loop in kernel code which will busy out one of
the CPU's on the system.

This can be trivially reproduced by trying to mount the file system
found in tests/f_orphan_extents_inode/image.gz from the e2fsprogs
source tree.  If a malicious user were to put this on a USB stick, and
mount it on a Linux desktop which has automatic mounts enabled, this
could be considered a potential denial of service attack.  (Not a big
deal in practice, but professional paranoids worry about such things,
and have even been known to allocate CVE numbers for such problems.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:50 -05:00
Theodore Ts'o
721e3eba21 ext4: lock i_mutex when truncating orphan inodes
Commit c278531d39 added a warning when ext4_flush_unwritten_io() is
called without i_mutex being taken.  It had previously not been taken
during orphan cleanup since races weren't possible at that point in
the mount process, but as a result of this c278531d39, we will now see
a kernel WARN_ON in this case.  Take the i_mutex in
ext4_orphan_cleanup() to suppress this warning.

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
2012-12-27 01:42:48 -05:00
Eric W. Biederman
48c6d1217e f2fs: Don't assign e_id in f2fs_acl_from_disk
With user namespaces enabled building f2fs fails with:

 CC      fs/f2fs/acl.o
fs/f2fs/acl.c: In function ‘f2fs_acl_from_disk’:
fs/f2fs/acl.c:85:21: error: ‘struct posix_acl_entry’ has no member named ‘e_id’
make[2]: *** [fs/f2fs/acl.o] Error 1
make[2]: Target `__build' not remade because of errors.

e_id is a backwards compatibility field only used for file systems
that haven't been converted to use kuids and kgids.  When the posix
acl tag field is neither ACL_USER nor ACL_GROUP assigning e_id is
unnecessary.  Remove the assignment so f2fs will build with user
namespaces enabled.

Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Amit Sahrawat <a.sahrawat@samsung.com>
Acked-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25 20:05:15 -08:00
Jaegeuk Kim
1efef83202 f2fs: do f2fs_balance_fs in front of dir operations
In order to conserve free sections to deal with the worst-case scenarios, f2fs
should be able to freeze all the directory operations especially when there are
not enough free sections. The f2fs_balance_fs() is for this use.

When FS utilization becomes almost 100%, directory operations can be failed due
to -ENOSPC frequently, which produces some dirty node pages occasionally.

Previously, in such a case, f2fs_balance_fs() is not able to be triggered since
it is triggered only if the directory operation ends up with success.

So, this patch triggers f2fs_balance_fs() at first before handling directory
operations.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Jaegeuk Kim
30f0c75858 f2fs: should recover orphan and fsync data
The recovery routine should do all the time regardless of normal umount action.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Jaegeuk Kim
398b1ac5a5 f2fs: fix handling errors got by f2fs_write_inode
Ruslan reported that f2fs hangs with an infinite loop in f2fs_sync_file():

	while (sync_node_pages(sbi, inode->i_ino, &wbc) == 0)
		f2fs_write_inode(inode, NULL);

The reason was revealed that the cold flag is not set even thought this inode is
a normal file. Therefore, sync_node_pages() skips to write node blocks since it
only writes cold node blocks.

The cold flag is stored to the node_footer in node block, and whenever a new
node page is allocated, it is set according to its file type, file or directory.

But, after sudden-power-off, when recovering the inode page, f2fs doesn't recover
its cold flag.

So, let's assign the cold flag in more right places.

One more thing:
If f2fs_write_inode() returns an error due to whatever situations, there would
be no dirty node pages so that sync_node_pages() returns zero.
(i.e., zero means nothing was written.)

Reported-by: Ruslan N. Marchenko <me@ruff.mobi>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Namjae Jeon
38e0abdcfb f2fs: fix up f2fs_get_parent issue to retrieve correct parent inode number
Test Case:
[NFS Client]
ls -lR .

[NFS Server]
while [ 1 ]
do
echo 3 > /proc/sys/vm/drop_caches
done

Error on NFS Client: "No such file or directory"

When cache is dropped at the server, it results in lookup failure at the
NFS client due to non-connection with the parent. The default path is it
initiates a lookup by calculating the hash value for the name, even though
the hash values stored on the disk for "." and ".." is maintained as zero,
which results in failure from find_in_block due to not matching HASH values.
Fix up, by using the correct hashing values for these entries.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:52 +09:00
Jaegeuk Kim
1362b5e347 f2fs: fix wrong calculation on f_files in statfs
In f2fs_statfs(), f_files should be the total number of available inodes
instead of the currently allocated inodes.
So, this patch should resolve the reported bug below.

Note that, showing 10% usage is not a bug, since f2fs reveals whole volume size
as much as possible and shows the space overhead as *used*.
This policy is fair enough with respect to other file systems.

<Reported Bug>
(loop0 is backed by 1GiB file)

$ mkfs.f2fs /dev/loop0

F2FS-tools: Ver: 1.1.0 (2012-12-11)
Info: sector size = 512
Info: total sectors = 2097152 (in 512bytes)
Info: zone aligned segment0 blkaddr: 512
Info: format successful

$ mount /dev/loop0 mnt/

$ df mnt/
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop0       1046528 98312    929784  10%
/home/zeta/linux-devel/mtd-bench/mnt

$ df mnt/ -i
Filesystem     Inodes   IUsed  IFree IUse% Mounted on
/dev/loop0       1 -465918 465919     - /home/zeta/linux-devel/mtd-bench/mnt

Notice IUsed is negative. Also, 10% usage on a fresh f2fs seems too
much to be correct.

Reported-and-Tested-by: Ezequiel Garcia <elezegarcia@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:51 +09:00
Jaegeuk Kim
dfb7c0ceab f2fs: remove set_page_dirty for atomic f2fs_end_io_write
We should guarantee not to do *scheduling while atomic*.
I found, in atomic f2fs_end_io_write(), there is a set_page_dirty() call
to deal with IO errors.

But, set_page_dirty() calls:
 -> f2fs_set_data_page_dirty()
   -> set_dirty_dir_page()
      -> cond_resched() which results in scheduling.

In order to avoid this, I'd like to remove simply set_page_dirty(),
since the page is already marked as ERROR and f2fs will be operated
as the read-only mode as well.
So, there is no recovery issue with this.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2012-12-26 10:39:51 +09:00
Eric W. Biederman
dfb2ea45be proc: Allow proc_free_inum to be called from any context
While testing the pid namespace code I hit this nasty warning.

[  176.262617] ------------[ cut here ]------------
[  176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
[  176.265145] Hardware name: Bochs
[  176.265677] Modules linked in:
[  176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
[  176.266564] Call Trace:
[  176.266564]  [<ffffffff810a539f>] warn_slowpath_common+0x7f/0xc0
[  176.266564]  [<ffffffff810a53fa>] warn_slowpath_null+0x1a/0x20
[  176.266564]  [<ffffffff810ad9ea>] local_bh_enable_ip+0x7a/0xa0
[  176.266564]  [<ffffffff819308c9>] _raw_spin_unlock_bh+0x19/0x20
[  176.266564]  [<ffffffff8123dbda>] proc_free_inum+0x3a/0x50
[  176.266564]  [<ffffffff8111d0dc>] free_pid_ns+0x1c/0x80
[  176.266564]  [<ffffffff8111d195>] put_pid_ns+0x35/0x50
[  176.266564]  [<ffffffff810c608a>] put_pid+0x4a/0x60
[  176.266564]  [<ffffffff8146b177>] tty_ioctl+0x717/0xc10
[  176.266564]  [<ffffffff810aa4d5>] ? wait_consider_task+0x855/0xb90
[  176.266564]  [<ffffffff81086bf9>] ? default_spin_lock_flags+0x9/0x10
[  176.266564]  [<ffffffff810cab0a>] ? remove_wait_queue+0x5a/0x70
[  176.266564]  [<ffffffff811e37e8>] do_vfs_ioctl+0x98/0x550
[  176.266564]  [<ffffffff810b8a0f>] ? recalc_sigpending+0x1f/0x60
[  176.266564]  [<ffffffff810b9127>] ? __set_task_blocked+0x37/0x80
[  176.266564]  [<ffffffff810ab95b>] ? sys_wait4+0xab/0xf0
[  176.266564]  [<ffffffff811e3d31>] sys_ioctl+0x91/0xb0
[  176.266564]  [<ffffffff810a95f0>] ? task_stopped_code+0x50/0x50
[  176.266564]  [<ffffffff81939199>] system_call_fastpath+0x16/0x1b
[  176.266564] ---[ end trace 387af88219ad6143 ]---

It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
put_pid is called with another spinlock held and irqs disabled.

For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25 16:23:12 -08:00
Michael Tokarev
d096ad0f79 ext4: do not try to write superblock on ro remount w/o journal
When a journal-less ext4 filesystem is mounted on a read-only block
device (blockdev --setro will do), each remount (for other, unrelated,
flags, like suid=>nosuid etc) results in a series of scary messages
from kernel telling about I/O errors on the device.

This is becauese of the following code ext4_remount():

       if (sbi->s_journal == NULL)
                ext4_commit_super(sb, 1);

at the end of remount procedure, which forces writing (flushing) of
a superblock regardless whenever it is dirty or not, if the filesystem
is readonly or not, and whenever the device itself is readonly or not.

We only need call ext4_commit_super when the file system had been
previously mounted read/write.

Thanks to Eric Sandeen for help in diagnosing this issue.

Signed-off-By: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-25 14:08:16 -05:00
Eric Sandeen
0875a2b448 ext4: include journal blocks in df overhead calcs
To more accurately calculate overhead for "bsd" style
df reporting, we should count the journal blocks as
overhead as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Eric Whitney <enwlinux@gmail.com>
2012-12-25 13:56:01 -05:00
Eric Sandeen
a28a9178e8 ext4: remove unaligned AIO warning printk
Although I put this in, I now think it was a bad decision.  For most
users, there is very little to be done in this case.  They get the
message, once per day, with no real context or proposed action.  TBH,
it generates support calls when it probably does not need to; the
message sounds more dire than the situation really is.

Just nuke it.  Normal investigation via blktrace or whatnot can
reveal poor IO patterns if bad performance is encountered.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:33:13 -05:00
Andy Lutomirski
ad96f71155 ext4: fix an incorrect comment about i_mutex
i_mutex is not held when ->sync_file is called.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:31:52 -05:00
Jan Kara
53e872681f ext4: fix deadlock in journal_unmap_buffer()
We cannot wait for transaction commit in journal_unmap_buffer()
because we hold page lock which ranks below transaction start.  We
solve the issue by bailing out of journal_unmap_buffer() and
jbd2_journal_invalidatepage() with -EBUSY.  Caller is then responsible
for waiting for transaction commit to finish and try invalidation
again. Since the issue can happen only for page stradding i_size, it
is simple enough to manually call jbd2_journal_invalidatepage() for
such page from ext4_setattr(), check the return value and wait if
necessary.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:29:52 -05:00
Jan Kara
4520fb3c36 ext4: split off ext4_journalled_invalidatepage()
In data=journal mode we don't need delalloc or DIO handling in invalidatepage
and similarly in other modes we don't need the journal handling. So split
invalidatepage implementations.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-12-25 13:28:54 -05:00
Linus Torvalds
769cb858c2 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Misc small cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: eliminate cifsERROR variable
  cifs: don't compare uniqueids in cifs_prime_dcache unless server inode numbers are in use
  cifs: fix double-free of "string" in cifs_parse_mount_options
2012-12-21 17:09:07 -08:00
J. Bruce Fields
10532b560b Revert "nfsd: warn on odd reply state in nfsd_vfs_read"
This reverts commit 79f77bf9a4.

This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs.  The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 17:07:45 -08:00
Trond Myklebust
c4271c6e37 NFS: Kill fscache warnings when mounting without -ofsc
The fscache code will currently bleat a "non-unique superblock keys"
warning even if the user is mounting without the 'fsc' option.

There should be no reason to even initialise the superblock cache cookie
unless we're planning on using fscache for something, so ensure that we
check for the NFS_OPTION_FSCACHE flag before calling into the fscache
code.

Reported-by: Paweł Sikora <pawel.sikora@agmk.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 08:32:09 -08:00
David Howells
c129c29347 NFS: Provide stub nfs_fscache_wait_on_invalidate() for when CONFIG_NFS_FSCACHE=n
Provide a stub nfs_fscache_wait_on_invalidate() function for when
CONFIG_NFS_FSCACHE=n lest the following error appear:

  fs/nfs/inode.c: In function 'nfs_invalidate_mapping':
  fs/nfs/inode.c:887:2: error: implicit declaration of function 'nfs_fscache_wait_on_invalidate' [-Werror=implicit-function-declaration]
  cc1: some warnings being treated as errors

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 08:06:48 -08:00
Jan Kara
d7961c7fa4 jbd2: fix assertion failure in jbd2_journal_flush()
The following race is possible between start_this_handle() and someone
calling jbd2_journal_flush().

Process A                              Process B
start_this_handle().
  if (journal->j_barrier_count) # false
  if (!journal->j_running_transaction) { #true
    read_unlock(&journal->j_state_lock);
                                       jbd2_journal_lock_updates()
                                       jbd2_journal_flush()
                                         write_lock(&journal->j_state_lock);
                                         if (journal->j_running_transaction) {
                                           # false
                                         ... wait for committing trans ...
                                         write_unlock(&journal->j_state_lock);
    ...
    write_lock(&journal->j_state_lock);
    if (!journal->j_running_transaction) { # true
      jbd2_get_transaction(journal, new_transaction);
    write_unlock(&journal->j_state_lock);
    goto repeat; # eventually blocks on j_barrier_count > 0
                                         ...
                                         J_ASSERT(!journal->j_running_transaction);
                                           # fails

We fix the race by rechecking j_barrier_count after reacquiring j_state_lock
in exclusive mode.

Reported-by: yjwsignal@empal.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-12-21 00:15:51 -05:00
Linus Torvalds
96680d2b91 Merge branch 'for-next' of git://git.infradead.org/users/eparis/notify
Pull filesystem notification updates from Eric Paris:
 "This pull mostly is about locking changes in the fsnotify system.  By
  switching the group lock from a spin_lock() to a mutex() we can now
  hold the lock across things like iput().  This fixes a problem
  involving unmounting a fs and having inodes be busy, first pointed out
  by FAT, but reproducible with tmpfs.

  This also restores signal driven I/O for inotify, which has been
  broken since about 2.6.32."

Ugh.  I *hate* the timing of this.  It was rebased after the merge
window opened, and then left to sit with the pull request coming the day
before the merge window closes.  That's just crap.  But apparently the
patches themselves have been around for over a year, just gathering
dust, so now it's suddenly critical.

Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
Rothwell's fixes from -next.

* 'for-next' of git://git.infradead.org/users/eparis/notify:
  inotify: automatically restart syscalls
  inotify: dont skip removal of watch descriptor if creation of ignored event failed
  fanotify: dont merge permission events
  fsnotify: make fasync generic for both inotify and fanotify
  fsnotify: change locking order
  fsnotify: dont put marks on temporary list when clearing marks by group
  fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
  fsnotify: pass group to fsnotify_destroy_mark()
  fsnotify: use a mutex instead of a spinlock to protect a groups mark list
  fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
  fsnotify: take groups mark_lock before mark lock
  fsnotify: use reference counting for groups
  fsnotify: introduce fsnotify_get_group()
  inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()
2012-12-20 20:11:52 -08:00
Linus Torvalds
4c9a44aebe Merge branch 'akpm' (Andrew's patch-bomb)
Merge the rest of Andrew's patches for -rc1:
 "A bunch of fixes and misc missed-out-on things.

  That'll do for -rc1.  I still have a batch of IPC patches which still
  have a possible bug report which I'm chasing down."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
  keys: use keyring_alloc() to create module signing keyring
  keys: fix unreachable code
  sendfile: allows bypassing of notifier events
  SGI-XP: handle non-fatal traps
  fat: fix incorrect function comment
  Documentation: ABI: remove testing/sysfs-devices-node
  proc: fix inconsistent lock state
  linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
  memcg: don't register hotcpu notifier from ->css_alloc()
  checkpatch: warn on uapi #includes that #include <uapi/...
  revert "rtc: recycle id when unloading a rtc driver"
  mm: clean up transparent hugepage sysfs error messages
  hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
  hfsplus: rework processing of hfs_btree_write() returned error
  hfsplus: rework processing errors in hfsplus_free_extents()
  hfsplus: avoid crash on failed block map free
  kcmp: include linux/ptrace.h
  drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
  mm: cma: WARN if freed memory is still in use
  exec: do not leave bprm->interp on stack
  ...
2012-12-20 20:00:43 -08:00
Linus Torvalds
1f0377ff08 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS update from Al Viro:
 "fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
  misc stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
  vfs: make lremovexattr retry once on ESTALE error
  vfs: make removexattr retry once on ESTALE
  vfs: make llistxattr retry once on ESTALE error
  vfs: make listxattr retry once on ESTALE error
  vfs: make lgetxattr retry once on ESTALE
  vfs: make getxattr retry once on an ESTALE error
  vfs: allow lsetxattr() to retry once on ESTALE errors
  vfs: allow setxattr to retry once on ESTALE errors
  vfs: allow utimensat() calls to retry once on an ESTALE error
  vfs: fix user_statfs to retry once on ESTALE errors
  vfs: make fchownat retry once on ESTALE errors
  vfs: make fchmodat retry once on ESTALE errors
  vfs: have chroot retry once on ESTALE error
  vfs: have chdir retry lookup and call once on ESTALE error
  vfs: have faccessat retry once on an ESTALE error
  vfs: have do_sys_truncate retry once on an ESTALE error
  vfs: fix renameat to retry on ESTALE errors
  vfs: make do_unlinkat retry once on ESTALE errors
  vfs: make do_rmdir retry once on ESTALE errors
  vfs: add a flags argument to user_path_parent
  ...
2012-12-20 18:14:31 -08:00
Linus Torvalds
54d46ea993 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
 "sigaltstack infrastructure + conversion for x86, alpha and um,
  COMPAT_SYSCALL_DEFINE infrastructure.

  Note that there are several conflicts between "unify
  SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
  resolution is trivial - just remove definitions of SS_ONSTACK and
  SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
  include/uapi/linux/signal.h contains the unified variant."

Fixed up conflicts as per Al.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  alpha: switch to generic sigaltstack
  new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
  generic compat_sys_sigaltstack()
  introduce generic sys_sigaltstack(), switch x86 and um to it
  new helper: compat_user_stack_pointer()
  new helper: restore_altstack()
  unify SS_ONSTACK/SS_DISABLE definitions
  new helper: current_user_stack_pointer()
  missing user_stack_pointer() instances
  Bury the conditionals from kernel_thread/kernel_execve series
  COMPAT_SYSCALL_DEFINE: infrastructure
2012-12-20 18:05:28 -08:00
Scott Wolchok
a68c2f12b4 sendfile: allows bypassing of notifier events
do_sendfile() in fs/read_write.c does not call the fsnotify functions,
unlike its neighbors.  This manifests as a lack of inotify ACCESS events
when a file is sent using sendfile(2).

Addresses
  https://bugzilla.kernel.org/show_bug.cgi?id=12812

[akpm@linux-foundation.org: use fsnotify_modify(out.file), not fsnotify_access(), per Dave]
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Scott Wolchok <swolchok@umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:21 -08:00
Ravishankar N
c39540c6d1 fat: fix incorrect function comment
fat_search_long() returns 0 on success, -ENOENT/ENOMEM on failure.
Change the function comment accordingly.

While at it, fix some trivial typos.

Signed-off-by: Ravishankar N <cyberax82@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Xiaotian Feng
ee297209bf proc: fix inconsistent lock state
Lockdep found an inconsistent lock state when rcu is processing delayed
work in softirq.  Currently, kernel is using spin_lock/spin_unlock to
protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
context.

Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.

  =================================
  [ INFO: inconsistent lock state ]
  3.7.0 #36 Not tainted
  ---------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
   (proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
  {SOFTIRQ-ON-W} state was registered at:
     __lock_acquire+0x8ae/0xca0
     lock_acquire+0x199/0x200
     _raw_spin_lock+0x41/0x50
     proc_alloc_inum+0x4c/0xd0
     alloc_mnt_ns+0x49/0xc0
     create_mnt_ns+0x25/0x70
     mnt_init+0x161/0x1c7
     vfs_caches_init+0x107/0x11a
     start_kernel+0x348/0x38c
     x86_64_start_reservations+0x131/0x136
     x86_64_start_kernel+0x103/0x112
  irq event stamp: 2993422
  hardirqs last  enabled at (2993422):  _raw_spin_unlock_irqrestore+0x55/0x80
  hardirqs last disabled at (2993421):  _raw_spin_lock_irqsave+0x29/0x70
  softirqs last  enabled at (2993394):  _local_bh_enable+0x13/0x20
  softirqs last disabled at (2993395):  call_softirq+0x1c/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(proc_inum_lock);
    <Interrupt>
      lock(proc_inum_lock);

   *** DEADLOCK ***

  no locks held by swapper/1/0.

  stack backtrace:
  Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
  Call Trace:
   <IRQ>  [<ffffffff810a40f1>] ? vprintk_emit+0x471/0x510
    print_usage_bug+0x2a5/0x2c0
    mark_lock+0x33b/0x5e0
    __lock_acquire+0x813/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_free_inum+0x1c/0x50
    free_pid_ns+0x1c/0x50
    put_pid_ns+0x2e/0x50
    put_pid+0x4a/0x60
    delayed_put_pid+0x12/0x20
    rcu_process_callbacks+0x462/0x790
    __do_softirq+0x1b4/0x3b0
    call_softirq+0x1c/0x30
    do_softirq+0x59/0xd0
    irq_exit+0x54/0xd0
    smp_apic_timer_interrupt+0x95/0xa3
    apic_timer_interrupt+0x72/0x80
    cpuidle_enter_tk+0x10/0x20
    cpuidle_enter_state+0x17/0x50
    cpuidle_idle_call+0x287/0x520
    cpu_idle+0xba/0x130
    start_secondary+0x2b3/0x2bc

Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Vyacheslav Dubeyko
bffdd661bd hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
Add an error message for the case of failure of sync fs in
delayed_sync_fs() method.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Vyacheslav Dubeyko
81cc7fad55 hfsplus: rework processing of hfs_btree_write() returned error
Add to hfs_btree_write() a return of -EIO on failure of b-tree node
searching.  Also add logic ofor processing errors from hfs_btree_write()
in hfsplus_system_write_inode() with a message about b-tree writing
failure.

[akpm@linux-foundation.org: reduce scope of `err', print errno on error]
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Vyacheslav Dubeyko
1b243fd39b hfsplus: rework processing errors in hfsplus_free_extents()
Currently, it doesn't process error codes from the hfsplus_block_free()
call in hfsplus_free_extents() method.  Add some error code processing.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Alan Cox
5daa669c80 hfsplus: avoid crash on failed block map free
If the read fails we kmap an error code.  This doesn't end well.  Instead
print a critical error and pray.  This mirrors the rest of the fs
behaviour with critical error cases.

Acked-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Kees Cook
b66c598401 exec: do not leave bprm->interp on stack
If a series of scripts are executed, each triggering module loading via
unprintable bytes in the script header, kernel stack contents can leak
into the command line.

Normally execution of binfmt_script and binfmt_misc happens recursively.
However, when modules are enabled, and unprintable bytes exist in the
bprm->buf, execution will restart after attempting to load matching
binfmt modules.  Unfortunately, the logic in binfmt_script and
binfmt_misc does not expect to get restarted.  They leave bprm->interp
pointing to their local stack.  This means on restart bprm->interp is
left pointing into unused stack memory which can then be copied into the
userspace argv areas.

After additional study, it seems that both recursion and restart remains
the desirable way to handle exec with scripts, misc, and modules.  As
such, we need to protect the changes to interp.

This changes the logic to require allocation for any changes to the
bprm->interp.  To avoid adding a new kmalloc to every exec, the default
value is left as-is.  Only when passing through binfmt_script or
binfmt_misc does an allocation take place.

For a proof of concept, see DoTest.sh from:

   http://www.halfdog.net/Security/2012/LinuxKernelBinfmtScriptStackDataDisclosure/

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: halfdog <me@halfdog.net>
Cc: P J P <ppandit@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:19 -08:00
Jeff Layton
b729d75d19 vfs: make lremovexattr retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:11 -05:00
Jeff Layton
12f0621299 vfs: make removexattr retry once on ESTALE
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:10 -05:00
Jeff Layton
bd9bbc9842 vfs: make llistxattr retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:10 -05:00
Jeff Layton
10a90cf36e vfs: make listxattr retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:10 -05:00
Jeff Layton
3a3e159dbf vfs: make lgetxattr retry once on ESTALE
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:09 -05:00
Jeff Layton
60e66b48ca vfs: make getxattr retry once on an ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:09 -05:00
Jeff Layton
49e09e1cc5 vfs: allow lsetxattr() to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:09 -05:00
Jeff Layton
68f1bb8bb8 vfs: allow setxattr to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:08 -05:00
Jeff Layton
a69201d6f0 vfs: allow utimensat() calls to retry once on an ESTALE error
Clearly, we can't handle the NULL filename case, but we can deal with
the case where there's a real pathname.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:08 -05:00
Jeff Layton
96948fc606 vfs: fix user_statfs to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:07 -05:00
Jeff Layton
99a5df37a0 vfs: make fchownat retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:07 -05:00
Jeff Layton
14ff690c0f vfs: make fchmodat retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:07 -05:00
Jeff Layton
2771261ec5 vfs: have chroot retry once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:06 -05:00
Jeff Layton
0291c0a551 vfs: have chdir retry lookup and call once on ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:06 -05:00
Jeff Layton
87fa55952b vfs: have faccessat retry once on an ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:05 -05:00
Jeff Layton
48f7530d3f vfs: have do_sys_truncate retry once on an ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:05 -05:00
Jeff Layton
c6a9428401 vfs: fix renameat to retry on ESTALE errors
...as always, rename is the messiest of the bunch. We have to track
whether to retry or not via a separate flag since the error handling
is already quite complex.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:05 -05:00
Jeff Layton
5d18f8133c vfs: make do_unlinkat retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:04 -05:00
Jeff Layton
c6ee920698 vfs: make do_rmdir retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:04 -05:00
Jeff Layton
9e790bd65c vfs: add a flags argument to user_path_parent
...so we can pass in LOOKUP_REVAL. For now, nothing does yet.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:04 -05:00
Jeff Layton
442e31ca5a vfs: fix linkat to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:03 -05:00
Jeff Layton
f46d3567b2 vfs: fix symlinkat to retry on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:03 -05:00
Jeff Layton
b76d8b8226 vfs: fix mkdirat to retry once on an ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:02 -05:00
Jeff Layton
972567f14c vfs: fix mknodat to retry on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:02 -05:00
Jeff Layton
1ac12b4b6d vfs: turn is_dir argument to kern_path_create into a lookup_flags arg
Where we can pass in LOOKUP_DIRECTORY or LOOKUP_REVAL. Any other flags
passed in here are currently ignored.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:02 -05:00
Jeff Layton
7955119e02 vfs: fix readlinkat to retry on ESTALE
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:01 -05:00
Jeff Layton
836fb7e7b9 vfs: make fstatat retry on ESTALE errors from getattr call
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:01 -05:00
Al Viro
21e89c0c48 Merge branch 'fscache' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into for-linus 2012-12-20 18:49:14 -05:00
NeilBrown
b911a6bdee vfs: d_obtain_alias() needs to use "/" as default name.
NFS appears to use d_obtain_alias() to create the root dentry rather than
d_make_root.  This can cause 'prepend_path()' to complain that the root
has a weird name if an NFS filesystem is lazily unmounted.  e.g.  if
"/mnt" is an NFS mount then

 { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }

will cause a WARN message like
   WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
   ...
   Root dentry has weird name <>

to appear in kernel logs.

So change d_obtain_alias() to use "/" rather than "" as the anonymous
name.

Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:49:10 -05:00
Marco Stornelli
d30357f2f0 vfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:46:29 -05:00
Marco Stornelli
9014da7525 ntfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:55 -05:00
Marco Stornelli
2d1b399b22 nilfs2: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:54 -05:00
Marco Stornelli
3e7a806928 ncpfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:54 -05:00
Marco Stornelli
7fc7cd00f6 minix: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:53 -05:00
Marco Stornelli
5dfc2821e8 logfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:53 -05:00
Marco Stornelli
d506848567 hfsplus: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:52 -05:00
Marco Stornelli
86dd07d66a jfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:52 -05:00
Marco Stornelli
70b31c4c88 hpfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:40:00 -05:00
David Howells
91c7fbbf63 FS-Cache: Clear remaining page count on retrieval cancellation
Provide fscache_cancel_op() with a pointer to a function it should invoke under
lock if it cancels an operation.

Use this to clear the remaining page count upon cancellation of a pending
retrieval operation so that fscache_release_retrieval_op() doesn't get an
assertion failure (see below).  This can happen when a signal occurs, say from
CTRL-C being pressed during data retrieval.

FS-Cache: Assertion failed
3 == 0 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/page.c:237!
invalid opcode: 0000 [#641] SMP
Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
CPU 0
Pid: 6075, comm: slurp-q Tainted: GF     D      3.7.0-rc8-fsdevel+ #411                  /DG965RY
RIP: 0010:[<ffffffffa007f328>]  [<ffffffffa007f328>] fscache_release_retrieval_op+0x75/0xff [fscache]
RSP: 0000:ffff88001c6d7988  EFLAGS: 00010296
RAX: 000000000000000f RBX: ffff880014cdfe00 RCX: ffffffff6c102000
RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
RBP: ffff88001c6d7998 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffe00
R13: ffff88001c6d7ab4 R14: ffff88001a8638a0 R15: ffff88001552b190
FS:  00007f877aaf0700(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff11378fd2 CR3: 000000001c6c6000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process slurp-q (pid: 6075, threadinfo ffff88001c6d6000, task ffff88001c6c4080)
Stack:
 ffffffffa007ec07 ffff880014cdfe00 ffff88001c6d79c8 ffffffffa007db4d
 ffffffffa007ec07 ffff880014cdfe00 00000000fffffe00 ffff88001c6d7ab4
 ffff88001c6d7a38 ffffffffa008116d 0000000000000000 ffff88001c6c4080
Call Trace:
 [<ffffffffa007ec07>] ? fscache_cancel_op+0x194/0x1cf [fscache]
 [<ffffffffa007db4d>] fscache_put_operation+0x135/0x2ed [fscache]
 [<ffffffffa007ec07>] ? fscache_cancel_op+0x194/0x1cf [fscache]
 [<ffffffffa008116d>] __fscache_read_or_alloc_pages+0x413/0x4bc [fscache]
 [<ffffffff810ac8ae>] ? __alloc_pages_nodemask+0x195/0x75c
 [<ffffffffa00aab0f>] __nfs_readpages_from_fscache+0x86/0x13d [nfs]
 [<ffffffffa00a5fe0>] nfs_readpages+0x186/0x1bd [nfs]
 [<ffffffff810d23c8>] ? alloc_pages_current+0xc7/0xe4
 [<ffffffff810a68b5>] ? __page_cache_alloc+0x84/0x91
 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0
 [<ffffffff810afaa3>] __do_page_cache_readahead+0x237/0x2e0
 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0
 [<ffffffff810afe3e>] ra_submit+0x1c/0x20
 [<ffffffff810b019b>] ondemand_readahead+0x359/0x382
 [<ffffffff810b0279>] page_cache_sync_readahead+0x38/0x3a
 [<ffffffff810a77b5>] generic_file_aio_read+0x26b/0x637
 [<ffffffffa00f1852>] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4]
 [<ffffffffa009cc85>] nfs_file_read+0xaa/0xcf [nfs]
 [<ffffffff810db5b3>] do_sync_read+0x91/0xd1
 [<ffffffff810dbb8b>] vfs_read+0x9b/0x144
 [<ffffffff810dbc78>] sys_read+0x44/0x75
 [<ffffffff81422892>] system_call_fastpath+0x16/0x1b

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:35:15 +00:00
David Howells
1f372dff1d FS-Cache: Mark cancellation of in-progress operation
Mark as cancelled an operation that is in progress rather than pending at the
time it is cancelled, and call fscache_complete_op() to cancel an operation so
that blocked ops can be started.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:34:00 +00:00
David Howells
7ef001e937 FS-Cache: One of the write operation paths doesn't set the object state
In fscache_write_op(), if the object is determined to have become inactive or
to have lost its cookie, we don't move the operation state from in-progress,
and so an assertion in fscache_put_operation() fails with an assertion (see
below).

Instrumenting fscache_op_work_func() indicates that it called
fscache_write_op() before calling fscache_put_operation() - where the assertion
failed.  The assertion at line 433 indicates that the operation state is
IN_PROGRESS rather than being COMPLETE or CANCELLED.

Instrumenting fscache_write_op() showed that it was being called on an object
that had had its cookie removed and that this was due to relinquishment of the
cookie by the netfs.  At this point fscache no longer has access to the pages
of netfs data that were requested to be written, and so simply cancelling the
operation is the thing to do.

FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:433!
invalid opcode: 0000 [#1] SMP
Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
CPU 0
Pid: 1035, comm: kworker/u:3 Tainted: GF            3.7.0-rc8-fsdevel+ #411                  /DG965RY
RIP: 0010:[<ffffffffa007db22>]  [<ffffffffa007db22>] fscache_put_operation+0x11a/0x2ed [fscache]
RSP: 0018:ffff88003e32bcf8  EFLAGS: 00010296
RAX: 000000000000000f RBX: ffff88001818eb78 RCX: ffffffff6c102000
RDX: ffffffff8102d1ad RSI: ffffffff6c102000 RDI: ffffffff8102d1d6
RBP: ffff88003e32bd18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa00811da
R13: 0000000000000001 R14: 0000000100625d26 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff7dd31c68 CR3: 000000003d730000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:3 (pid: 1035, threadinfo ffff88003e32a000, task ffff88003bb38080)
Stack:
 ffffffff8102d1ad ffff88001818eb78 ffffffffa00811da 0000000000000001
 ffff88003e32bd48 ffffffffa007f0ad ffff88001818eb78 ffffffff819583c0
 ffff88003df24e00 ffff88003882c3e0 ffff88003e32bde8 ffffffff81042de0
Call Trace:
 [<ffffffff8102d1ad>] ? vprintk_emit+0x3c6/0x41a
 [<ffffffffa00811da>] ? __fscache_read_or_alloc_pages+0x4bc/0x4bc [fscache]
 [<ffffffffa007f0ad>] fscache_op_work_func+0xec/0x123 [fscache]
 [<ffffffff81042de0>] process_one_work+0x21c/0x3b0
 [<ffffffff81042d82>] ? process_one_work+0x1be/0x3b0
 [<ffffffffa007efc1>] ? fscache_operation_gc+0x23e/0x23e [fscache]
 [<ffffffff8104332e>] worker_thread+0x202/0x2df
 [<ffffffff8104312c>] ? rescuer_thread+0x18e/0x18e
 [<ffffffff81047c1c>] kthread+0xd0/0xd8
 [<ffffffff81421bfa>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffff81047b4c>] ? __init_kthread_worker+0x55/0x55
 [<ffffffff814227ec>] ret_from_fork+0x7c/0xb0
 [<ffffffff81047b4c>] ? __init_kthread_worker+0x55/0x55

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:20:40 +00:00
David Howells
9c04caa81b FS-Cache: Fix signal handling during waits
wait_on_bit() with TASK_INTERRUPTIBLE returns 1 rather than a negative error
code, so change what we check for.  This means that the signal handling in
fscache_wait_for_retrieval_activation()  should now work properly.

Without this, the following bug can be seen if CTRL-C is pressed during
fscache read operation:

FS-Cache: Assertion failed
2 == 3 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/page.c:347!
invalid opcode: 0000 [#1] SMP
Modules linked in: cachefiles(F) nfsv4(F) nfsv3(F) nfsv2(F) nfs(F) fscache(F) auth_rpcgss(F) nfs_acl(F) lockd(F) sunrpc(F)
CPU 1
Pid: 15006, comm: slurp-q Tainted: GF            3.7.0-rc8-fsdevel+ #411                  /DG965RY
RIP: 0010:[<ffffffffa007fcb4>]  [<ffffffffa007fcb4>] fscache_wait_for_retrieval_activation+0x167/0x177 [fscache]
RSP: 0018:ffff88002a4c39a8  EFLAGS: 00010292
RAX: 000000000000001a RBX: ffff88002d3dc158 RCX: 0000000000008685
RDX: ffffffff8102ccd6 RSI: 0000000000000001 RDI: ffffffff8102d1d6
RBP: ffff88002a4c39c8 R08: 0000000000000002 R09: 0000000000000000
R10: ffffffff8163afa0 R11: ffff88003bd11900 R12: ffffffffa00868c8
R13: ffff880028306458 R14: ffff88002d3dc1b0 R15: ffff88001372e538
FS:  00007f17426a0700(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f1742494a44 CR3: 0000000031bd7000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process slurp-q (pid: 15006, threadinfo ffff88002a4c2000, task ffff880023de3040)
Stack:
 ffff88002d3dc158 ffff88001372e538 ffff88002a4c3ab4 ffff8800283064e0
 ffff88002a4c3a38 ffffffffa0080f6d 0000000000000000 ffff880023de3040
 ffff88002a4c3ac8 ffffffff810ac8ae ffff880028306458 ffff88002a4c3bc8
Call Trace:
 [<ffffffffa0080f6d>] __fscache_read_or_alloc_pages+0x24f/0x4bc [fscache]
 [<ffffffff810ac8ae>] ? __alloc_pages_nodemask+0x195/0x75c
 [<ffffffffa00aab0f>] __nfs_readpages_from_fscache+0x86/0x13d [nfs]
 [<ffffffffa00a5fe0>] nfs_readpages+0x186/0x1bd [nfs]
 [<ffffffff810d23c8>] ? alloc_pages_current+0xc7/0xe4
 [<ffffffff810a68b5>] ? __page_cache_alloc+0x84/0x91
 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0
 [<ffffffff810afaa3>] __do_page_cache_readahead+0x237/0x2e0
 [<ffffffff810af912>] ? __do_page_cache_readahead+0xa6/0x2e0
 [<ffffffff810afe3e>] ra_submit+0x1c/0x20
 [<ffffffff810b019b>] ondemand_readahead+0x359/0x382
 [<ffffffff810b0279>] page_cache_sync_readahead+0x38/0x3a
 [<ffffffff810a77b5>] generic_file_aio_read+0x26b/0x637
 [<ffffffffa00f1852>] ? nfs_mark_delegation_referenced+0xb/0xb [nfsv4]
 [<ffffffffa009cc85>] nfs_file_read+0xaa/0xcf [nfs]
 [<ffffffff810db5b3>] do_sync_read+0x91/0xd1
 [<ffffffff810dbb8b>] vfs_read+0x9b/0x144
 [<ffffffff810dbc78>] sys_read+0x44/0x75
 [<ffffffff81422892>] system_call_fastpath+0x16/0x1b

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:20:23 +00:00
David Howells
a4ff146881 NFS4: Open files for fscaching
nfs4_file_open() should open files for fscaching.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:19:42 +00:00
David Howells
969695215f FS-Cache: Add transition to handle invalidate immediately after lookup
Add a missing transition to the FS-Cache object state machine to handle an
invalidation event occuring between the back end completing the object lookup
by calling fscache_obtained_object() (which moves to state OBJECT_AVAILABLE)
and the backend returning to fscache_lookup_object() and thence to
fscache_object_state_machine() which then does a goto lookup_transit to handle
the transition - but lookup_transit doesn't handle EV_INVALIDATE.

Without this, the following BUG can be logged:

	FS-Cache: Unsupported event 2 [5/f7] in state OBJECT_AVAILABLE
	------------[ cut here ]------------
	kernel BUG at fs/fscache/object.c:357!

Where event 2 is EV_INVALIDATE.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:19:22 +00:00
David Howells
8c209ce721 NFS: nfs_migrate_page() does not wait for FS-Cache to finish with a page
nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably
leading to the following bad-page-state:

 BUG: Bad page state in process python-bin  pfn:17d39b
 page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null)
index:38686 (Tainted: G    B      ---------------- )
 Pid: 31053, comm: python-bin Tainted: G    B      ----------------
2.6.32-71.24.1.el6.x86_64 #1
 Call Trace:
 [<ffffffff8111bfe7>] bad_page+0x107/0x160
 [<ffffffff8111ee69>] free_hot_cold_page+0x1c9/0x220
 [<ffffffff8111ef19>] __pagevec_free+0x59/0xb0
 [<ffffffff8104b988>] ? flush_tlb_others_ipi+0x128/0x130
 [<ffffffff8112230c>] release_pages+0x21c/0x250
 [<ffffffff8115b92a>] ? remove_migration_pte+0x28a/0x2b0
 [<ffffffff8115f3f8>] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70
 [<ffffffff81122687>] ____pagevec_lru_add+0x167/0x180
 [<ffffffff811226f8>] __lru_cache_add+0x58/0x70
 [<ffffffff81122731>] lru_cache_add_lru+0x21/0x40
 [<ffffffff81123f49>] putback_lru_page+0x69/0x100
 [<ffffffff8115c0bd>] migrate_pages+0x13d/0x5d0
 [<ffffffff81122687>] ? ____pagevec_lru_add+0x167/0x180
 [<ffffffff81152ab0>] ? compaction_alloc+0x0/0x370
 [<ffffffff8115255c>] compact_zone+0x4cc/0x600
 [<ffffffff8111cfac>] ? get_page_from_freelist+0x15c/0x820
 [<ffffffff810672f4>] ? check_preempt_wakeup+0x1c4/0x3c0
 [<ffffffff8115290e>] compact_zone_order+0x7e/0xb0
 [<ffffffff81152a49>] try_to_compact_pages+0x109/0x170
 [<ffffffff8111e94d>] __alloc_pages_nodemask+0x5ed/0x850
 [<ffffffff814c9136>] ? thread_return+0x4e/0x778
 [<ffffffff81150d43>] alloc_pages_vma+0x93/0x150
 [<ffffffff81167ea5>] do_huge_pmd_anonymous_page+0x135/0x340
 [<ffffffff814cb6f6>] ? rwsem_down_read_failed+0x26/0x30
 [<ffffffff81136755>] handle_mm_fault+0x245/0x2b0
 [<ffffffff814ce383>] do_page_fault+0x123/0x3a0
 [<ffffffff814cbdf5>] page_fault+0x25/0x30

nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
- even if __GFP_WAIT is set.  The reason that doesn't wait is that
fscache_maybe_release_page() might deadlock the allocator as the work threads
writing to the cache may all end up sleeping on memory allocation.

However, I wonder if that is actually a problem.  There are a number of things
I can do to deal with this:

 (1) Make nfs_migrate_page() wait.

 (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.

 (3) Set a timeout around the wait.

 (4) Make nfs_migrate_page() return an error if the page is still busy.

For the moment, I'll select (2) and (4).

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2012-12-20 22:12:03 +00:00
David Howells
8d76349d35 FS-Cache: Exclusive op submission can BUG if there's been an I/O error
The function to submit an exclusive op (fscache_submit_exclusive_op()) can BUG
if there's been an I/O error because it may see the parent cache object in an
unexpected state.  It should only BUG if there hasn't been an I/O error.

In this case the problem was produced by remounting the cache partition to be
R/O.  The EROFS state was detected and the cache was aborted, but not
everything handled the aborting correctly.

SysRq : Emergency Remount R/O
EXT4-fs (sda6): re-mounted. Opts: (null)
Emergency Remount complete
CacheFiles: I/O Error: Failed to update xattr with error -30
FS-Cache: Cache cachefiles stopped due to I/O error
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:128!
invalid opcode: 0000 [#1] SMP 
CPU 0 
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

Pid: 6612, comm: kworker/u:2 Not tainted 3.1.0-rc8-fsdevel+ #1093                  /DG965RY
RIP: 0010:[<ffffffffa00739c0>]  [<ffffffffa00739c0>] fscache_submit_exclusive_op+0x2ad/0x2c2 [fscache]
RSP: 0018:ffff880000853d40  EFLAGS: 00010206
RAX: ffff880038ac72a8 RBX: ffff8800181f2260 RCX: ffffffff81f2b2b0
RDX: 0000000000000001 RSI: ffffffff8179a478 RDI: ffff8800181f2280
RBP: ffff880000853d60 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff880038ac7268
R13: ffff8800181f2280 R14: ffff88003a359190 R15: 000000010122b162
FS:  0000000000000000(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000034cc4a77f0 CR3: 0000000010e96000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/u:2 (pid: 6612, threadinfo ffff880000852000, task ffff880014c3c040)
Stack:
 ffff8800181f2260 ffff8800181f2310 ffff880038ac7268 ffff8800181f2260
 ffff880000853dc0 ffffffffa0072375 ffff880037ecfe00 ffff88003a359198
 ffff880000853dc0 0000000000000246 0000000000000000 ffff88000a91d308
Call Trace:
 [<ffffffffa0072375>] fscache_object_work_func+0x792/0xe65 [fscache]
 [<ffffffff81047e44>] process_one_work+0x1eb/0x37f
 [<ffffffff81047de6>] ? process_one_work+0x18d/0x37f
 [<ffffffffa0071be3>] ? fscache_enqueue_dependents+0xd8/0xd8 [fscache]
 [<ffffffff810482e4>] worker_thread+0x15a/0x21a
 [<ffffffff8104818a>] ? rescuer_thread+0x188/0x188
 [<ffffffff8104bf96>] kthread+0x7f/0x87
 [<ffffffff813ad6f4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0
 [<ffffffff813abd1d>] ? retint_restore_args+0xe/0xe
 [<ffffffff8104bf17>] ? __init_kthread_worker+0x53/0x53
 [<ffffffff813ad6f0>] ? gs_change+0xb/0xb


Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:10:58 +00:00
David Howells
75bc411388 FS-Cache: Limit the number of I/O error reports for a cache
Limit the number of I/O error reports for a cache to 1 to prevent massive
amounts of noise.  After the first I/O error the cache is taken off line
automatically, so must be restarted to resume caching.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:10:44 +00:00
David Howells
c2d35bfe4b FS-Cache: Don't mask off the object event mask when printing it
Don't mask off the object event mask when printing it.  That way it can be seen
if threre are bits set that shouldn't be.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:08:53 +00:00
David Howells
03acc4be5e FS-Cache: Initialise the object event mask with the calculated mask
Initialise the object event mask with the calculated mask rather than unmasking
undefined events also.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:08:39 +00:00
David Howells
b4cf1e08c8 CacheFiles: Add missing retrieval completions
CacheFiles is missing some calls to fscache_retrieval_complete() in the error
handling/collision paths of its reader functions.

This can be seen by the following assertion tripping in fscache_put_operation()
whereby the operation being destroyed is still in the in-progress state and has
not been cancelled or completed:

FS-Cache: Assertion failed
3 == 5 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/operation.c:408!
invalid opcode: 0000 [#1] SMP
CPU 2
Modules linked in: xfs ioatdma dca loop joydev evdev
psmouse dcdbas pcspkr serio_raw i5000_edac edac_core i5k_amb shpchp
pci_hotplug sg sr_mod]

Pid: 8062, comm: httpd Not tainted 3.1.0-rc8 #1 Dell Inc. PowerEdge 1950/0DT097
RIP: 0010:[<ffffffff81197b24>]  [<ffffffff81197b24>] fscache_put_operation+0x304/0x330
RSP: 0018:ffff880062f739d8  EFLAGS: 00010296
RAX: 0000000000000025 RBX: ffff8800c5122e84 RCX: ffffffff81ddf040
RDX: 00000000ffffffff RSI: 0000000000000082 RDI: ffffffff81ddef30
RBP: ffff880062f739f8 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000003 R12: ffff8800c5122e40
R13: ffff880037a2cd20 R14: ffff880087c7a058 R15: ffff880087c7a000
FS:  00007f63dcf636e0(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0c0a91f000 CR3: 0000000062ec2000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process httpd (pid: 8062, threadinfo ffff880062f72000, task ffff880087e58000)
Stack:
 ffff880062f73bf8 0000000000000000 ffff880062f73bf8 ffff880037a2cd20
 ffff880062f73a68 ffffffff8119aa7e ffff88006540e000 ffff880062f73ad4
 ffff88008e9a4308 ffff880037a2cd20 ffff880062f73a48 ffff8800c5122e40
Call Trace:
 [<ffffffff8119aa7e>] __fscache_read_or_alloc_pages+0x1fe/0x530
 [<ffffffff81250780>] __nfs_readpages_from_fscache+0x70/0x1c0
 [<ffffffff8123142a>] nfs_readpages+0xca/0x1e0
 [<ffffffff815f3c06>] ? rpc_do_put_task+0x36/0x50
 [<ffffffff8122755b>] ? alloc_nfs_open_context+0x4b/0x110
 [<ffffffff815ecd1a>] ? rpc_call_sync+0x5a/0x70
 [<ffffffff810e7e9a>] __do_page_cache_readahead+0x1ca/0x270
 [<ffffffff810e7f61>] ra_submit+0x21/0x30
 [<ffffffff810e818d>] ondemand_readahead+0x11d/0x250
 [<ffffffff810e83b6>] page_cache_sync_readahead+0x36/0x60
 [<ffffffff810dffa4>] generic_file_aio_read+0x454/0x770
 [<ffffffff81224ce1>] nfs_file_read+0xe1/0x130
 [<ffffffff81121bd9>] do_sync_read+0xd9/0x120
 [<ffffffff8114088f>] ? mntput+0x1f/0x40
 [<ffffffff811238cb>] ? fput+0x1cb/0x260
 [<ffffffff81122938>] vfs_read+0xc8/0x180
 [<ffffffff81122af5>] sys_read+0x55/0x90

Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:07:40 +00:00
David Howells
de242c0b8b NFS: Use FS-Cache invalidation
Use the new FS-Cache invalidation facility from NFS to deal with foreign
changes being detected on the server rather than attempting to retire the old
cookie and get a new one.

The problem with the old method was that NFS did not wait for all outstanding
storage and retrieval ops on the cache to complete.  There was no automatic
wait between the calls to ->readpages() and calls to invalidate_inode_pages2()
as the latter can only wait on locked pages that have been added to the
pagecache (which they haven't yet on entry to ->readpages()).

This was leading to oopses like the one below when an outstanding read got cut
off from its cookie by a premature release.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [<ffffffffa0075118>] __fscache_read_or_alloc_pages+0x1dd/0x315 [fscache]
PGD 15889067 PUD 15890067 PMD 0
Oops: 0000 [#1] SMP
CPU 0
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

Pid: 4544, comm: tar Not tainted 3.1.0-rc4-fsdevel+ #1064                  /DG965RY
RIP: 0010:[<ffffffffa0075118>]  [<ffffffffa0075118>] __fscache_read_or_alloc_pages+0x1dd/0x315 [fscache]
RSP: 0018:ffff8800158799e8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800070d41e0 RCX: ffff8800083dc1b0
RDX: 0000000000000000 RSI: ffff880015879960 RDI: ffff88003e627b90
RBP: ffff880015879a28 R08: 0000000000000002 R09: 0000000000000002
R10: 0000000000000001 R11: ffff880015879950 R12: ffff880015879aa4
R13: 0000000000000000 R14: ffff8800083dc158 R15: ffff880015879be8
FS:  00007f671e9d87c0(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000a8 CR3: 000000001587f000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tar (pid: 4544, threadinfo ffff880015878000, task ffff880015875040)
Stack:
 ffffffffa00b1759 ffff8800070dc158 ffff8800000213da ffff88002a286508
 ffff880015879aa4 ffff880015879be8 0000000000000001 ffff88002a2866e8
 ffff880015879a88 ffffffffa00b20be 00000000000200da ffff880015875040
Call Trace:
 [<ffffffffa00b1759>] ? nfs_fscache_wait_bit+0xd/0xd [nfs]
 [<ffffffffa00b20be>] __nfs_readpages_from_fscache+0x7e/0x13f [nfs]
 [<ffffffff81095fe7>] ? __alloc_pages_nodemask+0x156/0x662
 [<ffffffffa0098763>] nfs_readpages+0xee/0x187 [nfs]
 [<ffffffff81098a5e>] __do_page_cache_readahead+0x1be/0x267
 [<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
 [<ffffffff81098d7b>] ra_submit+0x1c/0x20
 [<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
 [<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
 [<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
 [<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
 [<ffffffff810c22c4>] do_sync_read+0xba/0xfa
 [<ffffffff810a62c9>] ? might_fault+0x4e/0x9e
 [<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
 [<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
 [<ffffffff810c29a4>] vfs_read+0xaa/0x13a
 [<ffffffff810c2a79>] sys_read+0x45/0x6c
 [<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b

Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:06:33 +00:00
David Howells
9dc8d9bfe4 CacheFiles: Implement invalidation
Implement invalidation for CacheFiles.  This is in two parts:

 (1) Provide an invalidation method (which just truncates the backing file).

 (2) Abort attempts to copy anything read from the backing file whilst
     invalidation is in progress.

Question: CacheFiles uses truncation in a couple of places.  It has been using
notify_change() rather than sys_truncate() or something similar.  This means
it bypasses a bunch of checks and suchlike that it possibly should be making
(security, file locking, lease breaking, vfsmount write).  Should it be using
vfs_truncate() as added by a preceding patch or should it use notify_write()
and assume that anyone poking around in the cache files on disk gets
everything they deserve?

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:06:08 +00:00
David Howells
a02de96085 VFS: Make more complete truncate operation available to CacheFiles
Make a more complete truncate operation available to CacheFiles (including
security checks and suchlike) so that it can use this to clear invalidated
cache files.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 22:05:41 +00:00
Linus Torvalds
982197277c Merge branch 'for-3.8' of git://linux-nfs.org/~bfields/linux
Pull nfsd update from Bruce Fields:
 "Included this time:

   - more nfsd containerization work from Stanislav Kinsbursky: we're
     not quite there yet, but should be by 3.9.

   - NFSv4.1 progress: implementation of basic backchannel security
     negotiation and the mandatory BACKCHANNEL_CTL operation.  See

       http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues

     for remaining TODO's

   - Fixes for some bugs that could be triggered by unusual compounds.
     Our xdr code wasn't designed with v4 compounds in mind, and it
     shows.  A more thorough rewrite is still a todo.

   - If you've ever seen "RPC: multiple fragments per record not
     supported" logged while using some sort of odd userland NFS client,
     that should now be fixed.

   - Further work from Jeff Layton on our mechanism for storing
     information about NFSv4 clients across reboots.

   - Further work from Bryan Schumaker on his fault-injection mechanism
     (which allows us to discard selective NFSv4 state, to excercise
     rarely-taken recovery code paths in the client.)

   - The usual mix of miscellaneous bugs and cleanup.

  Thanks to everyone who tested or contributed this cycle."

* 'for-3.8' of git://linux-nfs.org/~bfields/linux: (111 commits)
  nfsd4: don't leave freed stateid hashed
  nfsd4: free_stateid can use the current stateid
  nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
  nfsd: warn on odd reply state in nfsd_vfs_read
  nfsd4: fix oops on unusual readlike compound
  nfsd4: disable zero-copy on non-final read ops
  svcrpc: fix some printks
  NFSD: Correct the size calculation in fault_inject_write
  NFSD: Pass correct buffer size to rpc_ntop
  nfsd: pass proper net to nfsd_destroy() from NFSd kthreads
  nfsd: simplify service shutdown
  nfsd: replace boolean nfsd_up flag by users counter
  nfsd: simplify NFSv4 state init and shutdown
  nfsd: introduce helpers for generic resources init and shutdown
  nfsd: make NFSd service structure allocated per net
  nfsd: make NFSd service boot time per-net
  nfsd: per-net NFSd up flag introduced
  nfsd: move per-net startup code to separated function
  nfsd: pass net to __write_ports() and down
  nfsd: pass net to nfsd_set_nrthreads()
  ...
2012-12-20 14:04:11 -08:00
David Howells
ef778e7ae6 FS-Cache: Provide proper invalidation
Provide a proper invalidation method rather than relying on the netfs retiring
the cookie it has and getting a new one.  The problem with this is that isn't
easy for the netfs to make sure that it has completed/cancelled all its
outstanding storage and retrieval operations on the cookie it is retiring.

Instead, have the cache provide an invalidation method that will cancel or wait
for all currently outstanding operations before invalidating the cache, and
will cause new operations to queue up behind that.  Whilst invalidation is in
progress, some requests will be rejected until the cache can stack a barrier on
the operation queue to cause new operations to be deferred behind it.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:04:07 +00:00
Linus Torvalds
40889e8d9f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few different groups of commits here.  The largest is
  Alex's ongoing work to enable the coming RBD features (cloning,
  striping).  There is some cleanup in libceph that goes along with it.

  Cyril and David have fixed some problems with NFS reexport (leaking
  dentries and page locks), and there is a batch of patches from Yan
  fixing problems with the fs client when running against a clustered
  MDS.  There are a few bug fixes mixed in for good measure, many of
  which will be going to the stable trees once they're upstream.

  My apologies for the late pull.  There is still a gremlin in the rbd
  map/unmap code and I was hoping to include the fix for that as well,
  but we haven't been able to confirm the fix is correct yet; I'll send
  that in a separate pull once it's nailed down."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
  rbd: get rid of rbd_{get,put}_dev()
  libceph: register request before unregister linger
  libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
  libceph: init event->node in ceph_osdc_create_event()
  libceph: init osd->o_node in create_osd()
  libceph: report connection fault with warning
  libceph: socket can close in any connection state
  rbd: don't use ENOTSUPP
  rbd: remove linger unconditionally
  rbd: get rid of RBD_MAX_SEG_NAME_LEN
  libceph: avoid using freed osd in __kick_osd_requests()
  ceph: don't reference req after put
  rbd: do not allow remove of mounted-on image
  libceph: Unlock unprocessed pages in start_read() error path
  ceph: call handle_cap_grant() for cap import message
  ceph: Fix __ceph_do_pending_vmtruncate
  ceph: Don't add dirty inode to dirty list if caps is in migration
  ceph: Fix infinite loop in __wake_requests
  ceph: Don't update i_max_size when handling non-auth cap
  bdi_register: add __printf verification, fix arg mismatch
  ...
2012-12-20 14:00:13 -08:00
David Howells
9f10523f89 FS-Cache: Fix operation state management and accounting
Fix the state management of internal fscache operations and the accounting of
what operations are in what states.

This is done by:

 (1) Give struct fscache_operation a enum variable that directly represents the
     state it's currently in, rather than spreading this knowledge over a bunch
     of flags, who's processing the operation at the moment and whether it is
     queued or not.

     This makes it easier to write assertions to check the state at various
     points and to prevent invalid state transitions.

 (2) Add an 'operation complete' state and supply a function to indicate the
     completion of an operation (fscache_op_complete()) and make things call
     it.  The final call to fscache_put_operation() can then check that an op
     in the appropriate state (complete or cancelled).

 (3) Adjust the use of object->n_ops, ->n_in_progress, ->n_exclusive to better
     govern the state of an object:

	(a) The ->n_ops is now the number of extant operations on the object
	    and is now decremented by fscache_put_operation() only.

	(b) The ->n_in_progress is simply the number of objects that have been
	    taken off of the object's pending queue for the purposes of being
	    run.  This is decremented by fscache_op_complete() only.

	(c) The ->n_exclusive is the number of exclusive ops that have been
	    submitted and queued or are in progress.  It is decremented by
	    fscache_op_complete() and by fscache_cancel_op().

     fscache_put_operation() and fscache_operation_gc() now no longer try to
     clean up ->n_exclusive and ->n_in_progress.  That was leading to double
     decrements against fscache_cancel_op().

     fscache_cancel_op() now no longer decrements ->n_ops.  That was leading to
     double decrements against fscache_put_operation().

     fscache_submit_exclusive_op() now decides whether it has to queue an op
     based on ->n_in_progress being > 0 rather than ->n_ops > 0 as the latter
     will persist in being true even after all preceding operations have been
     cancelled or completed.  Furthermore, if an object is active and there are
     runnable ops against it, there must be at least one op running.

 (4) Add a remaining-pages counter (n_pages) to struct fscache_retrieval and
     provide a function to record completion of the pages as they complete.

     When n_pages reaches 0, the operation is deemed to be complete and
     fscache_op_complete() is called.

     Add calls to fscache_retrieval_complete() anywhere we've finished with a
     page we've been given to read or allocate for.  This includes places where
     we just return pages to the netfs for reading from the server and where
     accessing the cache fails and we discard the proposed netfs page.

The bugs in the unfixed state management manifest themselves as oopses like the
following where the operation completion gets out of sync with return of the
cookie by the netfs.  This is possible because the cache unlocks and returns
all the netfs pages before recording its completion - which means that there's
nothing to stop the netfs discarding them and returning the cookie.


FS-Cache: Cookie 'NFS.fh' still has outstanding reads
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
CPU 1
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

Pid: 400, comm: kswapd0 Not tainted 3.1.0-rc7-fsdevel+ #1090                  /DG965RY
RIP: 0010:[<ffffffffa007050a>]  [<ffffffffa007050a>] __fscache_relinquish_cookie+0x170/0x343 [fscache]
RSP: 0018:ffff8800368cfb00  EFLAGS: 00010282
RAX: 000000000000003c RBX: ffff880023cc8790 RCX: 0000000000000000
RDX: 0000000000002f2e RSI: 0000000000000001 RDI: ffffffff813ab86c
RBP: ffff8800368cfb50 R08: 0000000000000002 R09: 0000000000000000
R10: ffff88003a1b7890 R11: ffff88001df6e488 R12: ffff880023d8ed98
R13: ffff880023cc8798 R14: 0000000000000004 R15: ffff88003b8bf370
FS:  0000000000000000(0000) GS:ffff88003bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000008ba008 CR3: 0000000023d93000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 400, threadinfo ffff8800368ce000, task ffff88003b8bf040)
Stack:
 ffff88003b8bf040 ffff88001df6e528 ffff88001df6e528 ffffffffa00b46b0
 ffff88003b8bf040 ffff88001df6e488 ffff88001df6e620 ffffffffa00b46b0
 ffff88001ebd04c8 0000000000000004 ffff8800368cfb70 ffffffffa00b2c91
Call Trace:
 [<ffffffffa00b2c91>] nfs_fscache_release_inode_cookie+0x3b/0x47 [nfs]
 [<ffffffffa008f25f>] nfs_clear_inode+0x3c/0x41 [nfs]
 [<ffffffffa0090df1>] nfs4_evict_inode+0x2f/0x33 [nfs]
 [<ffffffff810d8d47>] evict+0xa1/0x15c
 [<ffffffff810d8e2e>] dispose_list+0x2c/0x38
 [<ffffffff810d9ebd>] prune_icache_sb+0x28c/0x29b
 [<ffffffff810c56b7>] prune_super+0xd5/0x140
 [<ffffffff8109b615>] shrink_slab+0x102/0x1ab
 [<ffffffff8109d690>] balance_pgdat+0x2f2/0x595
 [<ffffffff8103e009>] ? process_timeout+0xb/0xb
 [<ffffffff8109dba3>] kswapd+0x270/0x289
 [<ffffffff8104c5ea>] ? __init_waitqueue_head+0x46/0x46
 [<ffffffff8109d933>] ? balance_pgdat+0x595/0x595
 [<ffffffff8104bf7a>] kthread+0x7f/0x87
 [<ffffffff813ad6b4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81026b98>] ? finish_task_switch+0x45/0xc0
 [<ffffffff813abcdd>] ? retint_restore_args+0xe/0xe
 [<ffffffff8104befb>] ? __init_kthread_worker+0x53/0x53
 [<ffffffff813ad6b0>] ? gs_change+0xb/0xb

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 21:58:26 +00:00
David Howells
ef46ed888e FS-Cache: Make cookie relinquishment wait for outstanding reads
Make fscache_relinquish_cookie() log a warning and wait if there are any
outstanding reads left on the cookie it was given.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 21:58:25 +00:00
David Howells
37491a1339 CacheFiles: Make some debugging statements conditional
Downgrade some debugging statements to not unconditionally print stuff, but
rather be conditional on the appropriate module parameter setting.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 21:58:25 +00:00
David Howells
0f972b5696 FS-Cache: Check that there are no read ops when cookie relinquished
Check that the netfs isn't trying to relinquish a cookie that still has read
operations in progress upon it.  If there are, then give log a warning and BUG.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 21:58:25 +00:00
David Howells
5f4f9f4af1 CacheFiles: Downgrade the requirements passed to the allocator
Downgrade the requirements passed to the allocator in the gfp flags parameter.
FS-Cache/CacheFiles can handle OOM conditions simply by aborting the attempt to
store an object or a page in the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 21:58:25 +00:00
Linus Torvalds
1ca22254b3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull two btrfs reverts from Chris Mason:
 "I had missed that for two of the patches in my last pull, we had
  included different fixes during 3.7."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
  Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
2012-12-20 13:57:09 -08:00
Linus Torvalds
a13eea6bd9 Introduce a new file system, Flash-Friendly File System (F2FS), to Linux 3.8.
Highlights:
 - Add initial f2fs source codes
 - Fix an endian conversion bug
 - Fix build failures on random configs
 - Fix the power-off-recovery routine
 - Minor cleanup, coding style, and typos patches
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQxuJcAAoJEEAUqH6CSFDSq80QAI3i7NgUkx4h225MnbJdEKRb
 YX1MfSPmgE0q/15XS2qQu/s9NGJmXLV1IR9EtRSBlCQjwWhbx9Q9URktGkWslFnx
 6mBLy8EvVKDMVdwoUS8ZY6IjfKbmSnoIHTZrGaT9+9d7k8nlOQLaj3qQF4wBuw1+
 +qhJQV642v8qw7JiVVFgxcBSLpAS9cbdOA0vxfWncMwmRLaEO45W5+rob8ZN8WaS
 BUiYIiue8vlB0VDIYfpl/sSPJC/Bn1XsLKZoS2WJl8CKioE1ptLjT3acUBbabUxp
 hNLl8Ae0PylDMFpH8hrBXhleznrVqEMOTos/Z80/UyBny2sCxJFnaQ60TayUo2l2
 hYk5Wbyj8K7IBJEke23Fepild2PnGz22zf2v+tLxxVgPH5j7/l2XHfy9gPvDbd1P
 4ENiJUC3LE49Mi4TvEIFqhbrcJfD9C+v3bxpWGe8CevrpYZaB8tv/6nQXJCC/Ixp
 tMWqLKlHyXGmk5DZpiSFaj0/GbTPT0UGqZVRzzSXQpKqxJU6eTnXDa6aLUEYH8fH
 grOCriaJrd8SgL3l7RokQSQEyRHuNjMm1tlUQWOObE+y0nJjWb9Amwn1yUtJuNzx
 Np4nnlMhxwJ48P3LeeheSCuOUbxJtOzOR8MVXm7deYiGQbYaqB1/+9TbjOZBSX4O
 fpbCXrmqe1pUBukftZsL
 =iMoX
 -----END PGP SIGNATURE-----

Merge tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull new F2FS filesystem from Jaegeuk Kim:
 "Introduce a new file system, Flash-Friendly File System (F2FS), to
  Linux 3.8.

  Highlights:
   - Add initial f2fs source codes
   - Fix an endian conversion bug
   - Fix build failures on random configs
   - Fix the power-off-recovery routine
   - Minor cleanup, coding style, and typos patches"

From the Kconfig help text:

  F2FS is based on Log-structured File System (LFS), which supports
  versatile "flash-friendly" features. The design has been focused on
  addressing the fundamental issues in LFS, which are snowball effect
  of wandering tree and high cleaning overhead.

  Since flash-based storages show different characteristics according to
  the internal geometry or flash memory management schemes aka FTL, F2FS
  and tools support various parameters not only for configuring on-disk
  layout, but also for selecting allocation and cleaning algorithms.

and there's an article by Neil Brown about it on lwn.net:

  http://lwn.net/Articles/518988/

* tag 'for-3.8-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (36 commits)
  f2fs: fix tracking parent inode number
  f2fs: cleanup the f2fs_bio_alloc routine
  f2fs: introduce accessor to retrieve number of dentry slots
  f2fs: remove redundant call to f2fs_put_page in delete entry
  f2fs: make use of GFP_F2FS_ZERO for setting gfp_mask
  f2fs: rewrite f2fs_bio_alloc to make it simpler
  f2fs: fix a typo in f2fs documentation
  f2fs: remove unused variable
  f2fs: move error condition for mkdir at proper place
  f2fs: remove unneeded initialization
  f2fs: check read only condition before beginning write out
  f2fs: remove unneeded memset from init_once
  f2fs: show error in case of invalid mount arguments
  f2fs: fix the compiler warning for uninitialized use of variable
  f2fs: resolve build failures
  f2fs: adjust kernel coding style
  f2fs: fix endian conversion bugs reported by sparse
  f2fs: remove unneeded version.h header file from f2fs.h
  f2fs: update the f2fs document
  f2fs: update Kconfig and Makefile
  ...
2012-12-20 13:54:52 -08:00
David Howells
c4d6d8dbf3 CacheFiles: Fix the marking of cached pages
Under some circumstances CacheFiles defers the marking of pages with PG_fscache
so that it can take advantage of pagevecs to reduce the number of calls to
fscache_mark_pages_cached() and the netfs's hook to keep track of this.

There are, however, two problems with this:

 (1) It can lead to the PG_fscache mark being applied _after_ the page is set
     PG_uptodate and unlocked (by the call to fscache_end_io()).

 (2) CacheFiles's ref on the page is dropped immediately following
     fscache_end_io() - and so may not still be held when the mark is applied.
     This can lead to the page being passed back to the allocator before the
     mark is applied.

Fix this by, where appropriate, marking the page before calling
fscache_end_io() and releasing the page.  This means that we can't take
advantage of pagevecs and have to make a separate call for each page to the
marking routines.

The symptoms of this are Bad Page state errors cropping up under memory
pressure, for example:

BUG: Bad page state in process tar  pfn:002da
page:ffffea0000009fb0 count:0 mapcount:0 mapping:          (null) index:0x1447
page flags: 0x1000(private_2)
Pid: 4574, comm: tar Tainted: G        W   3.1.0-rc4-fsdevel+ #1064
Call Trace:
 [<ffffffff8109583c>] ? dump_page+0xb9/0xbe
 [<ffffffff81095916>] bad_page+0xd5/0xea
 [<ffffffff81095d82>] get_page_from_freelist+0x35b/0x46a
 [<ffffffff810961f3>] __alloc_pages_nodemask+0x362/0x662
 [<ffffffff810989da>] __do_page_cache_readahead+0x13a/0x267
 [<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
 [<ffffffff81098d7b>] ra_submit+0x1c/0x20
 [<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
 [<ffffffff81098ee2>] ? ondemand_readahead+0x163/0x29a
 [<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
 [<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
 [<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
 [<ffffffff810c22c4>] do_sync_read+0xba/0xfa
 [<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
 [<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
 [<ffffffff810c29a4>] vfs_read+0xaa/0x13a
 [<ffffffff810c2a79>] sys_read+0x45/0x6c
 [<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b

As can be seen, PG_private_2 (== PG_fscache) is set in the page flags.

Instrumenting fscache_mark_pages_cached() to verify whether page->mapping was
set appropriately showed that sometimes it wasn't.  This led to the discovery
that sometimes the page has apparently been reclaimed by the time the marker
got to see it.

Reported-by: M. Stevens <m@tippett.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2012-12-20 21:54:30 +00:00
Marco Stornelli
c8cf464bc5 hfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
41ddaeeb9d bfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
1dc1834f42 affs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
6229518384 adfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
a6ff03771e ocfs2: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
a8f5293aac omfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
46f6955710 procfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
cfac4b47c6 reiserfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
fa4d62ae17 sysv: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Marco Stornelli
83f6e3710a ufs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Jan Kara
72651cac88 fs: Fix imbalance in freeze protection in mark_files_ro()
File descriptors (even those for writing) do not hold freeze protection.
Thus mark_files_ro() must call __mnt_drop_write() to only drop protection
against remount read-only. Calling mnt_drop_write_file() as we do now
results in:

[ BUG: bad unlock balance detected! ]
3.7.0-rc6-00028-g88e75b6 #101 Not tainted
-------------------------------------
kworker/1:2/79 is trying to release lock (sb_writers) at:
[<ffffffff811b33b4>] mnt_drop_write+0x24/0x30
but there are no more locks to release!

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 13:57:36 -05:00