Commit Graph

21410 Commits

Author SHA1 Message Date
Theodore Ts'o
6fd7a46781 ext4: enable mblk_io_submit by default
Now that we've fixed the file corruption bug in commit d50bdd5aa5,
it's time to enable mblk_io_submit by default.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 13:53:09 -05:00
Curt Wohlgemuth
c7f5938adc ext4: fix ext4_da_block_invalidatepages() to handle page range properly
If ext4_da_block_invalidatepages() is called because of a
failure from ext4_map_blocks() in mpage_da_map_and_submit(),
it's supposed to clean up -- including unlock -- all the
pages in the mpd structure.  But these values may not match
up, even on a system in which block size == page size:

   mpd->b_blocknr != mpd->first_page
   mpd->b_size != (mpd->next_page - mpd->first_page)

ext4_da_block_invalidatepages() has been using b_blocknr and
b_size; this patch changes it to use first_page and
next_page.

Tested:  I injected a small number (5%) of failures in
ext4_map_blocks() in the case that the flags contain
EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
kernel.  Without this patch, I got hung tasks every time.
With this patch, I see no hangs in many runs of fsstress.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 12:27:52 -05:00
Curt Wohlgemuth
e0fd9b9076 ext4: mark multi-page IO complete on mapping failure
In mpage_da_map_and_submit(), if we have a delayed block
allocation failure from ext4_map_blocks(), we need to mark
the IO as complete, by setting

      mpd->io_done = 1;

Otherwise, we could end up submitting the pages in an outer
loop; since they are unlocked on mapping failure in
ext4_da_block_invalidatepages(), this will cause a bug check
in mpage_da_submit_io().

I tested this by injected failures into ext4_map_blocks().
Without this patch, a simple fsstress run will bug check;
with the patch, it works fine.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-26 12:25:52 -05:00
Coly Li
5a54b2f199 ext4: mballoc: don't replace the current preallocation group unnecessarily
In ext4_mb_check_group_pa(), the current preallocation space is
replaced with a new preallocation space when the two have the same
distance from the goal block.

This doesn't actually gain us anything, so change things so that the
function only switches to the new preallocation group if its distance
from the goal block is strictly smaller than the current preallocaiton
group's distance from the goal block.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-24 14:10:05 -05:00
Coly Li
58696f3ab2 ext4: clarify description of ac_g_ex in struct ext4_allocation_context
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 14:10:00 -05:00
Coly Li
7c78605929 mballoc: add comments to ext4_mb_mark_free_simple()
This patch adds comments to ext4_mb_mark_free_simple to make it more
understandable.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 13:24:25 -05:00
Coly Li
235772da3e ext4: remove unncessary call mb_find_buddy() in debugging code
In __mb_check_buddy(), look at the code below:
  591         fstart = -1;
  592         buddy = mb_find_buddy(e4b, 0, &max);
  593         for (i = 0; i < max; i++) {
  594                 if (!mb_test_bit(i, buddy)) {
  595                         MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
  596                         if (fstart == -1) {
  597                                 fragments++;
  598                                 fstart = i;
  599                         }
  600                         continue;
  601                 }
  602                 fstart = -1;
  603                 /* check used bits only */
  604                 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
  605                         buddy2 = mb_find_buddy(e4b, j, &max2);
  606                         k = i >> j;
  607                         MB_CHECK_ASSERT(k < max2);
  608                         MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
  609                 }
  610         }
  611         MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
  612         MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
  613
  614         grp = ext4_get_group_info(sb, e4b->bd_group);
  615         buddy = mb_find_buddy(e4b, 0, &max);

On line 592, buddy is fetched by mb_find_buddy() with order 0, between
line 593 to line 615, buddy is not changed, therefore there is
no need to fetch buddy again from mb_find_buddy() with order 0 again.

We can safely remove the second mb_find_buddy() on line 615.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 13:24:18 -05:00
Coly Li
84b775a354 ext4: code cleanup in mb_find_buddy()
Current code calculate max no matter whether order is zero, it's
unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
+ 3)" only when order == 0.

Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
2011-02-24 12:51:59 -05:00
Eric Sandeen
ea66333694 ext4: enable acls and user_xattr by default
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.

Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel.  At some point
we can start to deprecate the options, perhaps.

For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:51:51 -05:00
Lukas Czerner
5c2ed62fd4 ext4: Adjust minlen with discard_granularity in the FITRIM ioctl
Discard granularity tells us the minimum size of extent that can be
discarded by the device.  If the user supplies a minimum extent that
should be discarded (range.minlen) which is smaller than the discard
granularity, increase minlen to the discard granularity, since there's
no point submitting trim requests that the device will reject anyway.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 17:49:51 -05:00
Lukas Czerner
4143179218 ext4: check if device support discard in FITRIM ioctl
For a device that does not support discard, the FITRIM ioctl returns
-EOPNOTSUPP when blkdev_issue_discard() returns this error code, which
is how the user is informed that the device does not support discard.

If there are no suitable free extents to be trimmed, then FITRIM will
return success even though the device does not support discard, which
could confuse the user.  So check explicitly if the device supports
discard and return an error code at the beginning of the FITRIM ioctl
processing.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:42:32 -05:00
Lukas Czerner
0b75a84012 ext4: mark file-local functions and variables as static
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-23 12:22:49 -05:00
Alexander V. Lukyanov
5dbd571d87 ext4: allow inode_readahead_blks=0 (linux-2.6.37)
I cannot disable inode-read-ahead feature of ext4 (on 2.6.37):

# echo 0 > /sys/fs/ext4/sda2/inode_readahead_blks 
bash: echo: write error: Invalid argument

On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.

This patch fixes the problem by checking for zero explicitly.

Signed-off-by: Alexander V. Lukyanov <lav@netis.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:33:21 -05:00
Peter Huewe
7dc576158d ext4: Fix sparse warning: Using plain integer as NULL pointer
This patch fixes the warning "Using plain integer as NULL pointer",
generated by sparse, by replacing the offending 0s with NULL.

Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 21:01:42 -05:00
Theodore Ts'o
da488945f4 ext4: fix compile warnings with EXT4FS_DEBUG enabled
Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on,
we get following compile warnings. This patch fixes them.

  CC      fs/ext4/hash.o
  CC      fs/ext4/resize.o
fs/ext4/resize.c: In function 'setup_new_group_blocks':
fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
  CC      fs/ext4/extents.o
  CC      fs/ext4/ext4_jbd2.o
  CC      fs/ext4/migrate.o

Reported-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-21 20:39:58 -05:00
Linus Torvalds
3abb17e82f vfs: fix BUG_ON() in fs/namei.c:1461
When Al moved the nameidata_dentry_drop_rcu_maybe() call into the
do_follow_link function in commit 844a391799 ("nothing in
do_follow_link() is going to see RCU"), he mistakenly left the

	BUG_ON(inode != path->dentry->d_inode);

behind.  Which would otherwise be ok, but that BUG_ON() really needs to
be _after_ dropping RCU, since the dentry isn't necessarily stable
otherwise.

So complete the code movement in that commit, and move the BUG_ON() into
do_follow_link() too.  This means that we need to pass in 'inode' as an
argument (just for this one use), but that's a small thing.  And
eventually we may be confident enough in our path lookup that we can
just remove the BUG_ON() and the unnecessary inode argument.

Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-16 08:56:55 -08:00
Linus Torvalds
f60c153d50 Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux:
  nfsd: break lease on unlink due to rename
  nfsd4: acquire only one lease per file
  nfsd4: modify fi_delegations under recall_lock
  nfsd4: remove unused deleg dprintk's.
  nfsd4: split lease setting into separate function
  nfsd4: fix leak on allocation error
  nfsd4: add helper function for lease setup
  nfsd4: split up nfsd_break_deleg_cb
  NFSD: memory corruption due to writing beyond the stat array
  NFSD: use nfserr for status after decode_cb_op_status
  nfsd: don't leak dentry count on mnt_want_write failure
2011-02-15 12:06:38 -08:00
Linus Torvalds
055d219441 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  get rid of nameidata_dentry_drop_rcu() calling nameidata_drop_rcu()
  drop out of RCU in return_reval
  split do_revalidate() into RCU and non-RCU cases
  in do_lookup() split RCU and non-RCU cases of need_revalidate
  nothing in do_follow_link() is going to see RCU
2011-02-15 08:06:36 -08:00
Linus Torvalds
007a14af26 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check return value of alloc_extent_map()
  Btrfs - Fix memory leak in btrfs_init_new_device()
  btrfs: prevent heap corruption in btrfs_ioctl_space_info()
  Btrfs: Fix balance panic
  Btrfs: don't release pages when we can't clear the uptodate bits
  Btrfs: fix page->private races
2011-02-15 08:00:35 -08:00
Martin Schwidefsky
261cd298a8 s390: remove task_show_regs
task_show_regs used to be a debugging aid in the early bringup days
of Linux on s390. /proc/<pid>/status is a world readable file, it
is not a good idea to show the registers of a process. The only
correct fix is to remove task_show_regs.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-15 07:34:16 -08:00
Al Viro
4e924a4f53 get rid of nameidata_dentry_drop_rcu() calling nameidata_drop_rcu()
can't happen anymore and didn't work right anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-02-15 02:26:54 -05:00
Al Viro
f60aef7ec6 drop out of RCU in return_reval
... thus killing the need to handle drop-from-RCU in d_revalidate()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-02-15 02:26:54 -05:00
Al Viro
f5e1c1c1af split do_revalidate() into RCU and non-RCU cases
fixing oopsen in lookup_one_len()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-02-15 02:26:54 -05:00
Al Viro
24643087e7 in do_lookup() split RCU and non-RCU cases of need_revalidate
and use unlikely() instead of gotos, for fsck sake...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-02-15 02:26:54 -05:00
Al Viro
844a391799 nothing in do_follow_link() is going to see RCU
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-02-15 02:26:53 -05:00
Tsutomu Itoh
c26a920373 Btrfs: check return value of alloc_extent_map()
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:21:37 -05:00
Ilya Dryomov
67100f255d Btrfs - Fix memory leak in btrfs_init_new_device()
Memory allocated by calling kstrdup() should be freed.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:21:31 -05:00
Dan Rosenberg
51788b1bdd btrfs: prevent heap corruption in btrfs_ioctl_space_info()
Commit bf5fc093c5 refactored
btrfs_ioctl_space_info() and introduced several security issues.

space_args.space_slots is an unsigned 64-bit type controlled by a
possibly unprivileged caller.  The comparison as a signed int type
allows providing values that are treated as negative and cause the
subsequent allocation size calculation to wrap, or be truncated to 0.
By providing a size that's truncated to 0, kmalloc() will return
ZERO_SIZE_PTR.  It's also possible to provide a value smaller than the
slot count.  The subsequent loop ignores the allocation size when
copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.

The fix changes the slot count type and comparison typecast to u64,
which prevents truncation or signedness errors, and also ensures that we
don't copy more data than we've allocated in the subsequent loop.  Note
that zero-size allocations are no longer possible since there is already
an explicit check for space_args.space_slots being 0 and truncation of
this value is no longer an issue.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:04:23 -05:00
Yan, Zheng
6848ad6461 Btrfs: Fix balance panic
Mark the cloned backref_node as checked in clone_backref_node()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 16:00:03 -05:00
Chris Mason
e3f24cc521 Btrfs: don't release pages when we can't clear the uptodate bits
Btrfs tracks uptodate state in an rbtree as well as in the
page bits.  This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.

But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.

The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits.  This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.

So, what happens today looks like this:

releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success

The end result is the page being gone, but btrfs thinking the range is
up to date.   Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.

This commit fixes things to fail the releasepage when we can't clear the
extent state bits.  It covers both data pages and metadata tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 13:04:01 -05:00
Chris Mason
eb14ab8ed2 Btrfs: fix page->private races
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata.  Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.

This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-02-14 13:03:52 -05:00
J. Bruce Fields
83f6b0c182 nfsd: break lease on unlink due to rename
4795bb37ef "nfsd: break lease on unlink,
link, and rename", only broke the lease on the file that was being
renamed, and didn't handle the case where the target path refers to an
already-existing file that will be unlinked by a rename--in that case
the target file should have any leases broken as well.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
acfdf5c383 nfsd4: acquire only one lease per file
Instead of acquiring one lease each time another client opens a file,
nfsd can acquire just one lease to represent all of them, and reference
count it to determine when to release it.

This fixes a regression introduced by
c45821d263 "locks: eliminate fl_mylease
callback": after that patch, only the struct file * is used to determine
who owns a given lease.  But since we recently converted the server to
share a single struct file per open, if we acquire multiple leases on
the same file from nfsd, it then becomes impossible on unlocking a lease
to determine which of those leases (all of whom share the same struct
file *) we meant to remove.

Thanks to Takashi Iwai <tiwai@suse.de> for catching a bug in a previous
version of this patch.

Tested-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
5d926e8c2f nfsd4: modify fi_delegations under recall_lock
Modify fi_delegations only under the recall_lock, allowing us to use
that list on lease breaks.

Also some trivial cleanup to simplify later changes.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
65bc58f518 nfsd4: remove unused deleg dprintk's.
These aren't all that useful, and get in the way of the next steps.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:19 -05:00
J. Bruce Fields
edab9782b5 nfsd4: split lease setting into separate function
Splitting some code into a separate function which we'll be adding some
more to.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
dd239cc05f nfsd4: fix leak on allocation error
Also share some common exit code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
22d38c4c10 nfsd4: add helper function for lease setup
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
6b57d9c86d nfsd4: split up nfsd_break_deleg_cb
We'll be adding some more code here soon.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
Konstantin Khorenko
3aa6e0aa8a NFSD: memory corruption due to writing beyond the stat array
If nfsd fails to find an exported via NFS file in the readahead cache, it
should increment corresponding nfsdstats counter (ra_depth[10]), but due to a
bug it may instead write to ra_depth[11], corrupting the following field.

In a kernel with NFSDv4 compiled in the corruption takes the form of an
increment of a counter of the number of NFSv4 operation 0's received; since
there is no operation 0, this is harmless.

In a kernel with NFSDv4 disabled it corrupts whatever happens to be in the
memory beyond nfsdstats.

Signed-off-by: Konstantin Khorenko <khorenko@openvz.org>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
Benny Halevy
0af3f814cc NFSD: use nfserr for status after decode_cb_op_status
Bugs introduced in 85a5648019
"NFSD: Update XDR decoders in NFSv4 callback client"

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:35:18 -05:00
J. Bruce Fields
541ce98c10 nfsd: don't leak dentry count on mnt_want_write failure
The exit cleanup isn't quite right here.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2011-02-14 10:31:08 -05:00
Linus Torvalds
c8e0b00ed1 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: call __jbd2_log_start_commit with j_state_lock write locked
  ext4: serialize unaligned asynchronous DIO
  ext4: make grpinfo slab cache names static
  ext4: Fix data corruption with multi-block writepages support
  ext4: fix up ext4 error handling
  ext4: unregister features interface on module unload
  ext4: fix panic on module unload when stopping lazyinit thread
2011-02-12 09:10:24 -08:00
Theodore Ts'o
e447183180 jbd2: call __jbd2_log_start_commit with j_state_lock write locked
On an SMP ARM system running ext4, I've received a report that the
first J_ASSERT in jbd2_journal_commit_transaction has been triggering:

	J_ASSERT(journal->j_running_transaction != NULL);

While investigating possible causes for this problem, I noticed that
__jbd2_log_start_commit() is getting called with j_state_lock only
read-locked, in spite of the fact that it's possible for it might
j_commit_request.  Fix this by grabbing the necessary information so
we can test to see if we need to start a new transaction before
dropping the read lock, and then calling jbd2_log_start_commit() which
will grab the write lock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:18:24 -05:00
Eric Sandeen
e9e3bcecf4 ext4: serialize unaligned asynchronous DIO
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.

The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and 
dio_zero_block() will zero out the unwritten portions.  When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.

Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO.  I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write().  But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.

I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex.  So that won't work.

This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment.  I've
tested a backport of this patch with qemu, and it does
avoid the corruption.  It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.

Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.

[tytso@mit.edu: Keep the mutex as a hashed array instead
 of bloating the ext4 inode]

[tytso@mit.edu: Fix up namespace issues so that global
 variables are protected with an "ext4_" prefix.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:17:34 -05:00
Eric Sandeen
2892c15ddd ext4: make grpinfo slab cache names static
In 2.6.37 I was running into oopses with repeated module
loads & unloads.  I tracked this down to:

fb1813f4 ext4: use dedicated slab caches for group_info structures

(this was in addition to the features advert unload problem)

The kstrdup & subsequent kfree of the cache name was causing
a double free.  In slub, at least, if I read it right it allocates
& frees the name itself, slab seems to do something different...
so in slub I think we were leaking -our- cachep->name, and double
freeing the one allocated by slub.

After getting lost in slab/slub/slob a bit, I just looked at other
sized-caches that get allocated.  jbd2, biovec, sgpool all do it
more or less the way jbd2 does.  Below patch follows the jbd2
method of dynamically allocating a cache at mount time from
a list of static names.

(This might also possibly fix a race creating the caches with
parallel mounts running).

[Folded in a fix from Dan Carpenter which fixed an off-by-one error in
the original patch]

Cc: stable@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-02-12 08:12:18 -05:00
Linus Torvalds
d40b0c3482 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: use single thread workqueues
2011-02-11 16:29:57 -08:00
Linus Torvalds
3aec46c1e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: don't always drop malformed replies on the floor (try #3)
  cifs: clean up checks in cifs_echo_request
  [CIFS] Do not send SMBEcho requests on new sockets until SMBNegotiate
2011-02-11 16:29:50 -08:00
Boaz Harrosh
d863b50ab0 vfs: call rcu_barrier after ->kill_sb()
In commit fa0d7e3de6 ("fs: icache RCU free inodes"), we use rcu free
inode instead of freeing the inode directly.  It causes a crash when we
rmmod immediately after we umount the volume[1].

So we need to call rcu_barrier after we kill_sb so that the inode is
freed before we do rmmod.  The idea is inspired by Aneesh Kumar.
rcu_barrier will wait for all callbacks to end before preceding.  The
original patch was done by Tao Ma, but synchronize_rcu() is not enough
here.

1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2

Tested-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 16:12:19 -08:00
Linus Torvalds
2dab597441 Fix possible filp_cachep memory corruption
In commit 31e6b01f41 ("fs: rcu-walk for path lookup") we started doing
path lookup using RCU, which then falls back to a careful non-RCU lookup
in case of problems (LOOKUP_REVAL).  So do_filp_open() has this "re-do
the lookup carefully" looping case.

However, that means that we must not release the open-intent file data
if we are going to loop around and use it once more!

Fix this by moving the release of the open-intent data to the function
that allocates it (do_filp_open() itself) rather than the helper
functions that can get called multiple times (finish_open() and
do_last()).  This makes the logic for the lifetime of that field much
more obvious, and avoids the possible double free.

Reported-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-11 15:53:38 -08:00