bdi_start_writeback now never gets a superblock passed, so we can just remove
that case. And to further untangle the code and flatten the call stack
split it into two trivial helpers for it's two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bdi_writeback_all only has one caller, so fold it to simplify the code and
flatten the call stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
When we call writeback_inodes_wb from writeback_inodes_sb we always have
s_umount held, which currently makes the whole operation a no-op.
But if we are called to write out inodes for a specific superblock we always
have s_umount held, so replace the incorrect logic checking for WB_SYNC_ALL
which only worked by coincidence with the proper check for an explicit
superblock argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Make sure that not only sync_filesystem but all callers of writeback_inodes_sb
have the superblock protected against remount. As-is this disables all
functionality for these callers, but the next patch relies on this locking to
fix writeback_inodes_sb for sync_filesystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
If we want to rely on s_umount in the caller we need to wait for completion
of the I/O submission before returning to the caller. Refactor
bdi_sync_writeback into a bdi_queue_work_onstack helper and use it for this
case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The code dealing with bdi_work->state and completion of a bdi_work is a
major mess currently. This patch makes sure we directly use one set of
flags to deal with it, and use it consistently, which means:
- always notify about completion from the rcu callback. We only ever
wait for it from on-stack callers, so this simplification does not
even cause a theoretical slowdown currently. It also makes sure we
don't miss out on the notification if we ever add other callers to
wait for it.
- make earlier completion notification depending on the on-stack
allocation, not the sync mode. If we introduce new callers that
want to do WB_SYNC_NONE writeback from on-stack callers this will
be nessecary.
Also rename bdi_wait_on_work_clear to bdi_wait_on_work_done and inline
a few small functions into their only caller to make the code
understandable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
As it stands this check compares the number of pages to the page size.
This makes no sense and makes the fcntl fail in almost any sane case.
Fix it by checking if nr_pages is not zero (it can become zero only if
arg is too big and round_pipe_size() overflows).
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
pipe_set_size() needs to copy pipe bufs from the old circular buffer
to the new.
The current code gets this wrong in multiple ways, resulting in oops.
Test program is available here:
http://www.kernel.org/pub/linux/kernel/people/mszeredi/piperesize/
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We do the same BUG_ON() just a line later when calling into
__bd_abort_claiming().
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
I don't like the subtle multi-context code in bd_claim (ie. detects where it
has been called based on bd_claiming). It seems clearer to just require a new
function to finish a 2-part claim.
Also improve commentary in bd_start_claiming as to how it should
be used.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
bd_start_claiming has an unbalanced module_put introduced in 6b4517a79.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.35' of git://linux-nfs.org/~bfields/linux:
nfsd4: shut down callback queue outside state lock
nfsd: nfsd_setattr needs to call commit_metadata
Now that the background flush code has been fixed, we shouldn't need to
silently multiply the wbc->nr_to_write to get good writeback. Remove
that code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reportedly causes a lockdep warning on nfsd shutdown. That looks
like a false positive to me, but there's no reason why this needs the
state lock anyway.
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* git://git.infradead.org/~dwmw2/mtd-2.6.35:
jffs2: update ctime when changing the file's permission by setfacl
jffs2: Fix NFS race by using insert_inode_locked()
jffs2: Fix in-core inode leaks on error paths
mtd: Fix NAND submenu
mtd/r852: update card detect early.
mtd/r852: Fixes in case of DMA timeout
mtd/r852: register IRQ as last step
drivers/mtd: Use memdup_user
docbook: make mtd nand module init static
jffs2 didn't update the ctime of the file when its permission was changed.
Steps to reproduce:
# touch aaa
# stat -c %Z aaa
1275289822
# setfacl -m 'u::x,g::x,o::x' aaa
# stat -c %Z aaa
1275289822 <- unchanged
But, according to the spec of the ctime, jffs2 must update it.
Port of ext3 patch by Miao Xie <miaox@cn.fujitsu.com>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
A few functions were still modifying i_flags in a racy manner.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: improve xfs_isilocked
xfs: skip writeback from reclaim context
xfs: remove done roadmap item from xfs-delayed-logging-design.txt
xfs: fix race in inode cluster freeing failing to stale inodes
xfs: fix access to upper inodes without inode64
xfs: fix might_sleep() warning when initialising per-ag tree
fs/xfs/quota: Add missing mutex_unlock
xfs: remove duplicated #include
xfs: convert more trace events to DEFINE_EVENT
xfs: xfs_trace.c: remove duplicated #include
xfs: Check new inode size is OK before preallocating
xfs: clean up xlog_align
xfs: cleanup log reservation calculactions
xfs: be more explicit if RT mount fails due to config
xfs: replace E2BIG with EFBIG where appropriate
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: remove obsolete declarations of cache constructor and destructor
nilfs2: fix style issue in nilfs_destroy_cachep
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Minix: Clean up left over label
fix truncate inode time modification breakage
fix setattr error handling in sysfs, configfs
fcntl: return -EFAULT if copy_to_user fails
wrong type for 'magic' argument in simple_fill_super()
fix the deadlock in qib_fs
mqueue doesn't need make_bad_inode()
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: make blk_init_free_list and elevator_init idempotent
block: avoid unconditionally freeing previously allocated request_queue
pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
pipe: change the privilege required for growing a pipe beyond system max
pipe: adjust minimum pipe size to 1 page
block: disable preemption before using sched_clock()
cciss: call BUG() earlier
Preparing 8.3.8rc2
drbd: Reduce verbosity
drbd: use drbd specific ratelimit instead of global printk_ratelimit
drbd: fix hang on local read errors while disconnected
drbd: Removed the now empty w_io_error() function
drbd: removed duplicated #includes
drbd: improve usage of MSG_MORE
drbd: need to set socket bufsize early to take effect
drbd: improve network latency, TCP_QUICKACK
drbd: Revert "drbd: Create new current UUID as late as possible"
brd: support discard
Revert "writeback: fix WB_SYNC_NONE writeback from umount"
Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"
...
The data chunk is mmaped with 'len' which remains unchanged, so use that
when unmapping in the error path rather than trying to recalculate (and
incorrectly so) the value used originally.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David McCullough <davidm@snapgear.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The stack and data have different alignment requirements, so don't force
them to wear the same shoe. Increase the data alignment to match that
which the elf2flt linker script has always been using: 0x20 bytes. Not
only does this bring the kernel loader in line with the toolchain, but it
also fixes a swath of gcc tests which try to force larger alignment values
but randomly fail when the FLAT loader fails to deliver.
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Woodhouse <David.Woodhouse@intel.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: David McCullough <davidm@snapgear.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jie Zhang <jie@codesourcery.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A call to access_ok is missing a compat_ptr conversion. Introduced with
b83733639a "compat: factor out
compat_rw_copy_check_uvector from compat_do_readv_writev"
fs/compat.c: In function 'compat_rw_copy_check_uvector':
fs/compat.c:629: warning: passing argument 1 of '__access_ok' makes pointer from integer without a cast
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mtime and ctime should be changed only if the file size has actually
changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
sequence has caused regressions where they always update timestamps.
There is some strange cases in POSIX where truncate(2) must not update
times unless the size has acutally changed, see 6e656be89.
This area is all still rather buggy in different ways in a lot of
filesystems and needs a cleanup and audit (ideally the vfs will provide
a simple attribute or call to direct all filesystems exactly which
attributes to change). But coming up with the best solution will take a
while and is not appropriate for rc anyway.
So fix recent regression for now.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sysfs and configfs setattr functions have error cases after the generic inode's
attributes have been changed. Fix consistency by changing the generic inode
attributes only when it is guaranteed to succeed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
copy_to_user() returns the number of bytes remaining, but we want to
return -EFAULT.
ret = fcntl(fd, F_SETOWN_EX, NULL);
With the original code ret would be 8 here.
V2: Takuya Yoshikawa pointed out a similar issue in f_getown_ex()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's used to superblock ->s_magic, which is unsigned long.
Signed-off-by: Roberto Sassu <roberto.sassu@polito.it>
Reviewed-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
CC: stable@kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sysfs and configfs setattr functions have error cases after the generic inode's
attributes have been changed. Fix consistency by changing the generic inode
attributes only when it is guaranteed to succeed.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This changes the interface to be based on bytes instead. The API
matches that of F_SETPIPE_SZ in that it rounds up the passed in
size so that the resulting page array is a power-of-2 in size.
The proc file is renamed to /proc/sys/fs/pipe-max-size to
reflect this change.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Change it to CAP_SYS_RESOURCE, as that more accurately models what
we want to control.
Suggested-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
We don't need to pages to guarantee the POSIX requirement
that upto a page size write must be atomic to an empty
pipe.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
New inodes need to be locked as we're creating them, so they don't get used
by other things (like NFSd) before they're ready.
Pointed out by Al Viro.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Use rwsem_is_locked to make the assertations for shared locks work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Allowing writeback from reclaim context causes massive problems with stack
overflows as we can call into the writeback code which tends to be a heavy
stack user both in the generic code and XFS from random contexts that
perform memory allocations.
Follow the example of btrfs (and in slightly different form ext4) and refuse
to write out data from reclaim context. This issue should really be handled
by the VM so that we can tune better for this case, but until we get it
sorted out there we have to hack around this in each filesystem with a
complex writeback path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
When an inode cluster is freed, it needs to mark all inodes in memory as
XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes
have a different life cycle to the buffer, and once the buffer is torn down
during transaction completion, we must ensure none of the inodes get written
back (which is what XFS_ISTALE does).
Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being
marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these
inodes either during inode reclaim or tail pushing on the AIL. The buffer is
read back, but no longer contains inodes and so triggers assert failures and
shutdowns. This was reproducable with at run.dbench10 invocation from xfstests.
There are two main causes of xfs_ifree_cluster() failing. The first is simple -
it checks in-memory inodes it finds in the per-ag icache to see if they are
clean without holding the flush lock. if they are clean it skips them
completely. However, If an inode is flushed delwri, it will
appear clean, but is not guaranteed to be written back until the flush lock has
been dropped. Hence we may have raced on the clean check and the inode may
actually be dirty. Hence always mark inodes found in memory stale before we
check properly if they are clean.
The second is more complex, and makes the first problem easier to hit.
Basically the in-memory inode scan is done with full knowledge it can be racing
with inode flushing and AIl tail pushing, which means that inodes that it can't
get the flush lock on might not be attached to the buffer after then in-memory
inode scan due to IO completion occurring. This is actually documented in the
code as "needs better interlocking". i.e. this is a zero-day bug.
Effectively, the in-memory scan must be done while the inode buffer is locked
and Io cannot be issued on it while we do the in-memory inode scan. This
ensures that inodes we couldn't get the flush lock on are guaranteed to be
attached to the cluster buffer, so we can then catch all in-memory inodes and
mark them stale.
Now that the inode cluster buffer is locked before the in-memory scan is done,
there is no need for the two-phase update of the in-memory inodes, so simplify
the code into two loops and remove the allocation of the temporary buffer used
to hold locked inodes across the phases.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Dan Roseberg has reported a problem with the MOVE_EXT ioctl. If the
donor file is an append-only file, we should not allow the operation
to proceed, lest we end up overwriting the contents of an append-only
file.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dan Rosenberg <dan.j.rosenberg@gmail.com>
The conversion of write_inode_now calls to commit_metadata in commit
f501912a35 missed out the call in nfsd_setattr.
But without this conversion we can't guarantee that a SETATTR request
has actually been commited to disk with XFS, which causes a regression
from 2.6.32 (only for NFSv2, but anyway).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
fscache_write_op() makes unnecessary checks of the page variable to see if it
is NULL. It can't be NULL at those points as the kernel would already have
crashed a little higher up where we examined page->index.
Furthermore, unless radix_tree_gang_lookup_tag() can return 1 but no page, a
NULL pointer crash should not be encountered there as we can only get there if
r_t_g_l_t() returned 1.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 315e995c63 is causing OOM kills
when stress-testing a CIFS filesystem. The VFS readpages operation takes
a page reference. The older code just handed this reference off to the
page cache, but the new code takes an extra one. The simplest fix is to
put the new reference after add_to_page_cache_lru.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix a possible null pointer dereference in afs_alloc_server(): the server
pointer is NULL if there was an allocation failure, and under such a
condition, we can't dereference it in the _leave() statement.
Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clear_user() returns the number of bytes that could not be copied rather than
an error code. So we should return -EFAULT rather than directly returning the
results.
Without this patch, positive values may be returned to elf_fdpic_map_file()
and the following error handlings do not function as expected.
1.
ret = elf_fdpic_map_file_constdisp_on_uclinux(params, file, mm);
if (ret < 0)
return ret;
2.
ret = elf_fdpic_map_file_by_direct_mmap(params, file, mm);
if (ret < 0)
return ret;
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com>
CC: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>