Commit Graph

33426 Commits

Author SHA1 Message Date
David Sterba
8b87dc17fb btrfs: add mount option to set commit interval
I'ts hardcoded to 30 seconds which is fine for most users. Higher values
defer data being synced to permanent storage with obvious consequences
when the system crashes. The upper bound is not forced, but a warning is
printed if it's more than 300 seconds (5 minutes).

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:51 -04:00
Josef Bacik
9ec7267751 Btrfs: stop using GFP_ATOMIC when allocating rewind ebs
There is no reason we can't just set the path to blocking and then do normal
GFP_NOFS allocations for these extent buffers.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:50 -04:00
Josef Bacik
db7f3436c1 Btrfs: deal with enomem in the rewind path
We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the
tree mod log.  Instead of BUG_ON()'ing in this case pass up ENOMEM.  I looked
back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret)
in this path.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:49 -04:00
Josef Bacik
ebdad913aa Btrfs: check our parent dir when doing a compare send
When doing a send with a parent subvol we will check to see if the file we are
acting on is being overwritten and move it if we think it may be needed further
down the line during the send.  We check this by checking its directory and
making sure it existed in the parent and making sure the file existed in the
parent.  The problem with this check is that if we create a directory and a file
in that directory, and then snapshot, and then remove and re-create that same
directory and file with different inode numbers and then try to snapshot and
send with the original parent we will try and save the original file inside of
that directory.  This is a problem because during the receive we move the
directory out of the way because it is a completely new inode, which makes us
unable to find the old file inside of the directory when we try to move that out
of the way for the overwrite.  We fix this by checking the parent directory of
the inode we think we are overwriting.  If the parent directory generation in
the send root != the parent directory generation in the parent root then we know
it is a completely new directory and we need not bother with moving the file out
of the way because it would have been completely destroyed.  This fixes bz
60673.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:48 -04:00
Josef Bacik
36cce92287 Btrfs: handle errors when doing slow caching
Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait
forever if a drive failed.  This is because we just bail out if there is an
error while trying to cache a block group, we don't update anybody who may be
waiting.  So this introduces a new enum for the cache state in case of error and
makes everybody bail out if we have an error.  Alex tested and verified this
patch fixed his problem.  This fixes bz 59431.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:47 -04:00
Filipe David Borba Manana
0f0fe8f710 Btrfs: add missing error handling to read_tree_block
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:46 -04:00
Dave Jones
eb2067f713 Fix leak in __btrfs_map_block error path
If we bail out when the stripe alloc fails, we need to undo the
earlier allocation of raid_map.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:45 -04:00
Filipe David Borba Manana
f5929cd814 Btrfs: add missing error check to find_parent_nodes
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:44 -04:00
Filipe David Borba Manana
395927a9d8 Btrfs: optimize function btrfs_read_chunk_tree
After reading all device items from the chunk tree, don't
exit the loop and then navigate down the tree again to find
the chunk items. Instead just read all device items and
chunk items with a single tree search. This is possible
because all device items are found before any chunk item in
the chunks tree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:43 -04:00
Josef Bacik
6596a92819 Btrfs: don't bug_on when we fail when cleaning up transactions
There is no reason for this sort of jackassery.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:42 -04:00
Josef Bacik
b6c60c8018 Btrfs: change how we queue blocks for backref checking
Previously we only added blocks to the list to have their backrefs checked if
the level of the block is right above the one we are searching for.  This is
because we want to make sure we don't add the entire path up to the root to the
lists to make sure we process things one at a time.  This assumes that if any
blocks in the path to the root are going to be not checked (shared in other
words) then they will be in the level right above the current block on up.  This
isn't quite right though since we can have blocks higher up the list that are
shared because they are attached to a reloc root.  But we won't add this block
to be checked and then later on we will BUG_ON(!upper->checked).  So instead
keep track of wether or not we've queued a block to be checked in this current
search, and if we haven't go ahead and queue it to be checked.  This patch fixed
the panic I was seeing where we BUG_ON(!upper->checked).  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:41 -04:00
Josef Bacik
d062d13cf1 Btrfs: check to see if we have an inline item properly
If our item isn't big enough to have an actual inline item when we have skinny
metadata enabled just return 1 in find_inline_backref so we can move on to the
next item.  This probably wasn't causing a problem since we check the values of
ptr and end properly, but just in case this will keep us from doing extra work.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:40 -04:00
Josef Bacik
151a41bc46 Btrfs: fix what bits we clear when erroring out from delalloc
First of all we no longer set EXTENT_DIRTY when we dirty an extent so this patch
removes the clearing of EXTENT_DIRTY since it confuses me.  This patch also adds
clearing EXTENT_DEFRAG and also doing EXTENT_DO_ACCOUNTING when we have errors.
This is because if we are clearing delalloc without adding an ordered extent
then we need to make sure the enospc handling stuff is accounted for.  Also if
this range was DEFRAG we need to make sure that bit is cleared so we dont leak
it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:39 -04:00
Josef Bacik
c2790a2e2b Btrfs: cleanup arguments to extent_clear_unlock_delalloc
This patch removes the io_tree argument for extent_clear_unlock_delalloc since
we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree
operations from the page operations.  This way we just pass in the extent bits
we want to clear and then pass in the operations we want done to the pages.
This is because I'm going to fix what extent bits we clear in some cases and
rather than add a bunch of new flags we'll just use the actual extent bits we
want to clear.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:38 -04:00
Anand Jain
8068a47e2a btrfs: use BTRFS_SUPER_INFO_SIZE macro at btrfs_read_dev_super()
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:37 -04:00
Miao Xie
125bac016d Btrfs: cache the extent map struct when reading several pages
When we read several pages at once, we needn't get the extent map object
every time we deal with a page, and we can cache the extent map object.
So, we can reduce the search time of the extent map, and besides that, we
also can reduce the lock contention of the extent map tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:36 -04:00
Miao Xie
9974090bdd Btrfs: batch the extent state operation when reading pages
In the past, we cached the checksum value in the extent state object, so we
had to split the extent state object by the block size, or we had no space
to keep this checksum value. But it increased the lock contention of the
extent state tree.

Now we removed this limit by caching the checksum into the bio object, so
it is unnecessary to do the extent state operations by the block size, we
can do it in batches, in this way, we can reduce the lock contention of
the extent state tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:35 -04:00
Miao Xie
883d0de485 Btrfs: batch the extent state operation in the end io handle of the read page
Before applying this patch, we set the uptodate flag and unlock the extent
by the page size, it is unnecessary, we can do it in batches, it can reduce
the lock contention of the extent state tree.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:34 -04:00
Miao Xie
facc8a2247 Btrfs: don't cache the csum value into the extent state tree
Before applying this patch, we cached the csum value into the extent state
tree when reading some data from the disk, this operation increased the lock
contention of the state tree.

Now, we just store the csum value into the bio structure or other unshared
structure, so we can reduce the lock contention.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:33 -04:00
Miao Xie
f2a09da9d0 Btrfs: add branch prediction hints in the read page end IO function
This patch add some branch prediction hints into the end IO function
of the read page, it reduced the percentage of the branch misses from
5.5% to 4.9%.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:32 -04:00
Miao Xie
09a7f7a289 Btrfs: remove unnecessary argument of bio_readpage_error()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:31 -04:00
Wang Shilong
8507d216a4 Btrfs: add missing mounting options in btrfs_show_options()
Some options are missing in btrfs_show_options(), this patch
adds them.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:30 -04:00
Wang Shilong
1493381f2f Btrfs: use u64 for subvolid when parsing mount options
Although for most time, int is enough for subvolid, we should
ensure safety in theory.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:29 -04:00
Wang Shilong
2c334e87f3 Btrfs: add sanity checks regarding to parsing mount options
I just notice the following commands succeed:
	mount <dev> <mnt> -o thread_pool=-1

This is ridiculous, only positive thread_pool makes sense,this
patch adds sanity checks for them, and also catches the error of
ENOMEM if allocating memory fails.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:28 -04:00
Miao Xie
3cd846d1d7 Btrfs, raid56: fix memory leak when allocating pages for p/q stripes failed
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:27 -04:00
Dan Carpenter
3dc0e818af btrfs/raid56: fix and cleanup some error paths
The alloc_rbio() frees "raid_map" and "bbio" on error, so there is a
potential double free bug in raid56_parity_write().  The
raid56_parity_write() and raid56_parity_recover() functions should still
free "raid_map" and "bbio" on error if other errors occur though, so I
have added some more calls to kfree().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:26 -04:00
Josef Bacik
2112ac800d Btrfs: don't bother autodefragging if our root is going away
We can end up with inodes on the auto defrag list that exist on roots that are
going to be deleted.  This is extra work we don't need to do, so just bail if
our root has 0 root refs.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:25 -04:00
Josef Bacik
b37b39cd6b Btrfs: cleanup reloc roots properly on error
I was hitting the BUG_ON() at the end of merge_reloc_roots() because we were
aborting the transaction at some point previously and then getting an error when
we tried to drop the reloc root.  I fixed btrfs_drop_snapshot to re-add us to
the dead roots list if we failed, but this isn't the right thing to do for reloc
roots since it uses root->root_list for it's own stuff in order to know what
needs to be cleaned up.  So fix btrfs_drop_snapshot to only do the re-add if we
aren't dropping for reloc, and handle errors from merge_reloc_root() by dropping
the reloc root we are processing since it won't be on the list of roots to
cleanup.  With this patch my reproducer no longer panics.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:24 -04:00
Josef Bacik
50f1319cb5 Btrfs: reset ret in record_one_backref
I was getting warnings when running find ./ -type f -exec btrfs fi defrag -f {}
\; from record_one_backref because ret was set.  Turns out it was because it was
set to 1 because the search slot didn't come out exact and we never reset it.
So reset it to 0 right after the search so we don't leak this and get
uneccessary warnings.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:23 -04:00
Anand Jain
a1b83ac52d btrfs: fix get set label blocking against balance
btrfs_ioctl_get_fslabel() and btrfs_ioctl_set_fslabel()
used root->fs_info->volume_mutex mutex which caused operations
like balance to block set/get label operation until its
completion and generally balance operation takes a long
time to complete, so it will be annoying to the user when
cli appears hung

also this patch will add a bit of optimization within
the btrfs_ioctl_get_falabel() function.

v1->v2:
   use fs_info->super_lock instead of uuid_mutex

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:04:15 -04:00
Stefan Behrens
d4c34f6bff Btrfs: Print key type in decimal everywhere
This is confusing, sometimes the key type is printed in hex (without
a leading "0x" which makes things even more complicated), sometimes
in decimal...
Change it to be in decimal everywhere.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:40 -04:00
Liu Bo
599c75ec3f Btrfs/tracepoint: update delayed ref tracepoints
This shows exactly how btrfs processes the delayed refs onto disks,
which is very helpful on understanding delayed ref mechanism and
debugging related bugs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:39 -04:00
chandan
1095cc0d92 btrfs_read_block_groups: Use enums to index
btrfs_space_info->block_groups.

The current code uses integer literals to index
btrfs_space_info->block_groups[] array. Instead use corresponding
enums from 'enum btrfs_raid_types'.

Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:38 -04:00
Qu Wenruo
3cae210fa5 btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert
Some codes still use the cpu_to_lexx instead of the
BTRFS_SETGET_STACK_FUNCS declared in ctree.h.

Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec
and other structures.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:37 -04:00
Wang Shilong
1e7bac1ef7 Btrfs: set qgroup_ulist to be null after calling ulist_free()
We call ulist_free(qgroup_ulist) in btrfs_free_qgroup_config(),
and btrfs_free_qgroup_config() may be called in two cases:

(1)umount filesystem
(2)disabling quota

However, if we firstly disable quota and then umount filesystem,
a double free happens. Fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:36 -04:00
Filipe David Borba Manana
647f63bd36 Btrfs: add missing error checks to add_data_references
The function relocation.c:add_data_references() was not checking
if all calls to __add_tree_block() and find_data_references() were
succeeding or not.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:35 -04:00
David Sterba
ccf39f92f3 btrfs: make errors in btrfs_num_copies less noisy
The log message level 'critical' is verbose enough, 'emergency' beeps on
all terminals.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:34 -04:00
Liu Bo
52ee28d249 Btrfs: make free space caching faster with many non-inline extent references
So to cache free space, we iterate every extent item to gather free space info.

When we have say 10,000 non-inline extent refs(such as BTRFS_EXTENT_DATA_REF),
it takes quite a long time, and since inline extent refs and non-inline ones have
same objectid in their keys, we can just re-search the tree with the next address
to skip non-inline references.

(This is found by dedup feature because dedup extents can end up with many
non-inline extent refs.)

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:24 -04:00
Jeff Mahoney
ee3441b490 btrfs: fall back to global reservation when removing subvolumes
I recently did some ENOSPC testing that involved filling the disk
while create and removing snapshots in a loop. During the test cycle,
I ran into an ENOSPC when trying to remove a snapshot, leaving the fs
stuck in ENOSPC even after a umount/mount cycle.

This patch allow subvolume removal to fall back onto the global
block reservation in order to succeed when it would have failed
otherwise.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:23 -04:00
Filipe David Borba Manana
74be951087 Btrfs: optimize btrfs_lookup_extent_info()
If we're looking for a metadata item in the tree and the
search fails with return value of 1, and the slot doesn't
point to the first item in the leaf, check if the previous
item in the leaf corresponds to an extent item for the same
object id - if it does, then don't do another tree search
to get it.

This optimization is already done by btrfs-progs.

V2: updated commit message.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:22 -04:00
Carey Underwood
d790155457 Btrfs: Release uuid_mutex for shrink during device delete
Device scanning waits on the uuid_mutex, which can result in a very long
wait if dev delete is shrinking the device.

Signed-off-by: Carey Underwood <cwillu@cwillu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:21 -04:00
Josef Bacik
b2aaaa3b8c Btrfs: set lockdep class before locking new extent buffer
We've been seeing spurious complaints out of lockdep because the lock class name
changes.  This is happening because when we drop a snapshot we will lock a block
before we've read it in, which sets the lockdep class to whatever the default
is.  Then once we read the thing in we reset the lockdep class to what it is
supposed to be, which blows lockdeps' mind.  This patch should fix the problem,
it appears to be the only place where we do this sort of thing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:20 -04:00
Stefan Agner
59516f6017 Btrfs: return -1 when lzo compression makes data bigger
With this fix the lzo code behaves like the zlib code by returning an
error
code when compression does not help reduce the size of the file.
This is currently not a bug since the compressed size is checked again
in
the calling method compress_file_range.

Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:19 -04:00
Josef Bacik
c8cc634165 Btrfs: stop using GFP_ATOMIC for the tree mod log allocations
Previously we held the tree mod lock when adding stuff because we use it to
check and see if we truly do want to track tree modifications.  This is
admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty
hard and often is not nice.  So instead do our basic checks to see if we don't
need to track modifications, and if those pass then do our allocation, and then
when we go to insert the new modification check if we still care, and if we
don't just free up our mod and return.  Otherwise we're good to go and we can
carry on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 07:57:17 -04:00
Eric W. Biederman
c7b96acf14 userns: Kill nsown_capable it makes the wrong thing easy
nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID.  For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing.  So remove nsown_capable.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-30 23:44:11 -07:00
Maxime Bizon
3bd11cf56e pstore/ram: (really) fix undefined usage of rounddown_pow_of_two
Previous attempt to fix was b042e47491

Suggested use of is_power_of_2() was bogus because is_power_of_2(0) is
false (documented behaviour).

Signed-off-by: Maxime Bizon <mbizon@freebox.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-30 15:57:01 -07:00
J. Bruce Fields
248f807b47 nfsd4: nfsd4_create_clid_dir prints uninitialized data
Take the easy way out and just remove the printk.

Reported-by: David Howells <dhowells@redhat.com>
2013-08-30 17:30:52 -04:00
J. Bruce Fields
bf7bd3e98b nfsd4: fix leak of inode reference on delegation failure
This fixes a regression from 68a3396178
"nfsd4: shut down more of delegation earlier".

After that commit, nfs4_set_delegation() failures result in
nfs4_put_delegation being called, but nfs4_put_delegation doesn't free
the nfs4_file that has already been set by alloc_init_deleg().

This can result in an oops on later unmounting the exported filesystem.

Note also delaying the fi_had_conflict check we're able to return a
better error (hence give 4.1 clients a better idea why the delegation
failed; though note CONFLICT isn't an exact match here, as that's
supposed to indicate a current conflict, but all we know here is that
there was one recently).

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-08-30 17:30:52 -04:00
J. Bruce Fields
3477565e6a Revert "nfsd: nfs4_file_get_access: need to be more careful with O_RDWR"
This reverts commit df66e75395.

nfsd4_lock can get a read-only or write-only reference when only a
read-write open is available.  This is normal.

Cc: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-08-30 17:30:45 -04:00
J. Bruce Fields
b8297cec2d Linux 3.11-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSCDSjAAoJEHm+PkMAQRiGDXMIAI7Loae0Oqb1eoeJkvjyZsBS
 OJDeeEcn+k58VbxVHyRdc7hGo4yI4tUZm172SpnOaM8sZ/ehPU7zBrwJK2lzX334
 /jAM3uvVPfxA2nu0I4paNpkED/NQ8NRRsYE1iTE8dzHXOH6dA3mgp5qfco50rQvx
 rvseXpME4KIAJEq4jnyFZF5+nuHiPueM9JftPmSSmJJ3/KY9kY1LESovyWd7ttg1
 jYSVPFal9J0E+tl2UQY5g9H16GqhhjYn+39Iei6Q5P4bL4ZubQgTRQTN9nyDc06Z
 ezQtGoqZ8kEz/2SyRlkda6PzjSEhgXlc8mCL5J7AW+dMhTHHx2IrosjiCA80kG8=
 =c0rK
 -----END PGP SIGNATURE-----

Merge tag 'v3.11-rc5' into for-3.12 branch

For testing purposes I want some nfs and nfsd bugfixes (specifically,
58cd57bfd9 and previous nfsd patches, and
Trond's 4f3cc4809a).
2013-08-30 16:42:49 -04:00
Eric Sandeen
914ed44b17 Fix wrong flag ASSERT in xfs_attr_shortform_getvalue
This ASSERT is testing an if_flags flag value against
a di_aformat enum value.  di_aformat is never assigned
XFS_IFINLINE.

This happens to work for now, because XFS_IFINLINE has
the same value as XFS_DINODE_FMT_LOCAL, and that's tested
just before we call this function.

However, I think the intention is to assert that we have
read in the data, i.e. XFS_IFINLINE on if_flags, before
we use if_data.  This is done in other places through the
code as well.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 15:20:50 -05:00
Dave Chinner
904c17e683 xfs: finish removing IOP_* macros.
In optimising the CIL operations, some of the IOP_* macros for
calling log item operations were removed. Remove the rest of them as
Christoph requested.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 14:14:35 -05:00
Dave Chinner
239567033c xfs: inode log reservations are too small
We've been seeing occasional problems with log space leaks and
transaction underruns such as this for some time:

 XFS (dm-0): xlog_write: reservation summary:
   trans type  = FSYNC_TS (36)
   unit res    = 2740 bytes
   current res = -4 bytes
   total reg   = 0 bytes (o/flow = 0 bytes)
   ophdrs      = 0 (ophdr space = 0 bytes)
   ophdr + reg = 0 bytes
   num regions = 0

Turns out that xfstests generic/311 is reliably reproducing this
problem with the test it runs at sequence 16 of it execution. It is
a 100% reliable reproducer with the mkfs configuration of "-b
size=1024 -m crc=1" on a 10GB scratch device.

The problem? Inode forks in btree format are logged in memory
format, not disk format (i.e. bmbt format, not bmdr format). That
means there is a btree block header being logged, when such a
structure is never written to the inode fork in bmdr format. The
bmdr header in the inode is only 4 bytes, while the bmbt header is
24 bytes for v4 filesystems and 72 bytes for v5 filesystems.

We currently reserve the inode size plus the rounded up overhead of
a logging a buffer, which is 128 bytes. That means the reservation
for a 512 byte inode is 640 bytes. What we can actually log is:

	inode core, data and attr fork = 512 bytes
	inode log format + log op header = 56 + 12 = 68 bytes
	data fork bmbt hdr = 24/72 bytes
	attr fork bmbt hdr = 24/72 bytes

So, for a v2 inodes we can log at least 628 bytes, but if we split that
inode over the end of the log across log buffers, we need to also
another log op header, which takes us to 640 bytes. If there's
another reservation taken out of this that I haven't taken into
account (perhaps multiple iclog splits?) or I haven't corectly
calculated the bmbt format space used (entirely possible), then
we will overun it.

For v3 inodes the maximum is actually 724 bytes, and even a
single maximally sized btree format fork can blow it (652 bytes).
And that's exactly what is happening with the FSYNC_TS transaction
in the above output - it's consumed 644 bytes of space after the CIL
context took the space reserved for it (2100 bytes).

This problem has always been present in the XFS code - the btree
format inode forks have always been logged in this manner. Hence
there has always been the possibility of an overrun with such a
transaction. The CRC code has just exposed it frequently enough to
be able to debug and understand the root cause....

So, let's fix all the inode log space reservations.

[ I'm so glad we spent the effort to clean up the transaction
  reservation code. This is an easy fix now. ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:59:30 -05:00
Brian Foster
b121099d84 xfs: check correct status variable for xfs_inobt_get_rec() call
The call to xfs_inobt_get_rec() in xfs_dialloc_ag() passes 'j' as
the output status variable. The immediately following
XFS_WANT_CORRUPTED_GOTO() checks the value of 'i,' which is from
the previous lookup call and has already been checked. Fix the
corruption check to use 'j.'

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:48:35 -05:00
Dave Chinner
d8914002a0 xfs: inode buffers may not be valid during recovery readahead
CRC enabled filesystems fail log recovery with 100% reliability on
xfstests xfs/085 with the following failure:

XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS (vdb): Corruption detected. Unmount and run xfs_repair
XFS (vdb): bad inode magic/vsn daddr 144 #0 (magic=0)
XFS: Assertion failed: 0, file: fs/xfs/xfs_inode_buf.c, line: 95

The problem is that the inode buffer has not been recovered before
the readahead on the inode buffer is issued. The checkpoint being
recovered actually allocates the inode chunk we are doing readahead
from, so what comes from disk during readahead is essentially
random and the verifier barfs on it.

This inode buffer readahead problem affects non-crc filesystems,
too, but xfstests does not trigger it at all on such
configurations....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:45:49 -05:00
Dave Chinner
50d5c8d8e9 xfs: check LSN ordering for v5 superblocks during recovery
Log recovery has some strict ordering requirements which unordered
or reordered metadata writeback can defeat. This can occur when an
item is logged in a transaction, written back to disk, and then
logged in a new transaction before the tail of the log is moved past
the original modification.

The result of this is that when we read an object off disk for
recovery purposes, the buffer that we read may not contain the
object type that recovery is expecting and hence at the end of the
checkpoint being recovered we have an invalid object in memory.

This isn't usually a problem, as recovery will then replay all the
other checkpoints and that brings the object back to a valid and
correct state, but the issue is that while the object is in the
invalid state it can be flushed to disk. This results in the object
verifier failing and triggering a corruption shutdown of log
recover. This is correct behaviour for the verifiers - the problem
is that we are not detecting that the object we've read off disk is
newer than the transaction we are replaying.

All metadata in v5 filesystems has the LSN of it's last modification
stamped in it. This enabled log recover to read that field and
determine the age of the object on disk correctly. If the LSN of the
object on disk is older than the transaction being replayed, then we
replay the modification. If the LSN of the object matches or is more
recent than the transaction's LSN, then we should avoid overwriting
the object as that is what leads to the transient corrupt state.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:44:53 -05:00
Dave Chinner
b58fa554e9 xfs: btree block LSN escaping to disk uninitialised
When testing LSN ordering code for v5 superblocks, it was discovered
that the the LSN embedded in the generic btree blocks was
occasionally uninitialised. These values didn't get written to disk
by metadata writeback - they got written by previous transactions in
log recovery.

The issue is here that the when the block is first allocated and
initialised, the LSN field was not initialised - it gets overwritten
before IO is issued on the buffer - but the value that is logged by
transactions that modify the header before it is written to disk
(and initialised) contain garbage. Hence the first recovery of the
buffer will stamp garbage into the LSN field, and that can cause
subsequent transactions to not replay correctly.

The fix is simply to initialise the bb_lsn field to zero when we
initialise the block for the first time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 13:43:34 -05:00
Benjamin LaHaise
77d30b14d2 aio: fix rcu sparse warnings introduced by ioctx table lookup patch
Sseveral sparse warnings were caused by missing rcu_dereference() annotations
for dereferencing mm->ioctx_table.  Thankfully, none of those were actual bugs
as the deref was protected by a spin lock in all instances.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2013-08-30 11:12:50 -04:00
Dave Chinner
3780437612 XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568
The calculation doesn't take into account the size of the dir v3
header, so overestimates the hash entries in a node. This causes
directory buffer overruns when splitting and merging nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-30 09:48:59 -05:00
Benjamin LaHaise
79bd1bcf1a aio: remove unnecessary debugging from aio_free_ring()
The commit 36bc08cc01 ("fs/aio: Add support to aio ring pages migration")
added some debugging code that is not required and resulted in a build error
when 98474236f7 ("vfs: make the dentry cache use the lockref infrastructure")
was added to the tree.  The code is not required, so just delete it.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-08-30 10:22:04 -04:00
Trond Myklebust
d7631250b2 NFSv4: Fix a potentially Oopsable condition in __nfs_idmap_unregister
Ensure that __nfs_idmap_unregister can be called twice without
consequences.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30 09:19:38 -04:00
Trond Myklebust
c219066103 SUNRPC: Replace clnt->cl_principal
The clnt->cl_principal is being used exclusively to store the service
target name for RPCSEC_GSS/krb5 callbacks. Replace it with something that
is stored only in the RPCSEC_GSS-specific code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30 09:19:36 -04:00
Trond Myklebust
2d9db75005 NFS: Fix up two use-after-free issues with the new tracing code
We don't want to pass the context argument to trace_nfs_atomic_open_exit()
after it has been released.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-30 09:19:34 -04:00
Dave Chinner
0f0d334595 xfs: fix bad dquot buffer size in log recovery readahead
xfstests xfs/087 fails 100% reliably with this assert:

XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS: Assertion failed: bp->b_flags & XBF_STALE, file: fs/xfs/xfs_buf.c, line: 548

while trying to read a dquot buffer in xlog_recover_dquot_ra_pass2().

The issue is that the buffer length to read that is passed to
xfs_buf_readahead is in units of filesystem blocks, not disk blocks.
(i.e. FSB, not daddr). Fix it but putting the correct conversion in
place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-29 10:51:35 -05:00
Dave Chinner
84a5b7300c xfs: don't account buffer cancellation during log recovery readahead
When doing readhaead in log recovery, we check to see if buffers are
cancelled before doing readahead. If we find a cancelled buffer,
however, we always decrement the reference count we have on it, and
that means that readahead is causing a double decrement of the
cancelled buffer reference count.

This results in log recovery *replaying cancelled buffers* as the
actual recovery pass does not find the cancelled buffer entry in the
commit phase of the second pass across a transaction. On debug
kernels, this results in an ASSERT failure like so:

XFS: Assertion failed: !(flags & XFS_BLF_CANCEL), file: fs/xfs/xfs_log_recover.c, line: 1815

xfstests generic/311 reproduces this ASSERT failure with 100%
reproducability.

Fix it by making readahead only peek at the buffer cancelled state
rather than the full accounting that xlog_check_buffer_cancelled()
does.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-29 10:37:06 -05:00
Eric W. Biederman
7dc5dbc879 sysfs: Restrict mounting sysfs
Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
over the net namespace.  The principle here is if you create or have
capabilities over it you can mount it, otherwise you get to live with
what other people have mounted.

Instead of testing this with a straight forward ns_capable call,
perform this check the long and torturous way with kobject helpers,
this keeps direct knowledge of namespaces out of sysfs, and preserves
the existing sysfs abstractions.

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-28 21:35:14 -07:00
Linus Torvalds
c95389b4cd Merge branch 'akpm' (patches from Andrew Morton)
Merge fixes from Andrew Morton:
 "Five fixes.

  err, make that six.  let me try again"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers
  memcg: check that kmem_cache has memcg_params before accessing it
  drivers/base/memory.c: fix show_mem_removable() to handle missing sections
  IPC: bugfix for msgrcv with msgtyp < 0
  Omnikey Cardman 4000: pull in ioctl.h in user header
  timer_list: correct the iterator for timer_list
2013-08-28 19:31:33 -07:00
Goldwyn Rodrigues
49fa8140e4 fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers
While using pacemaker/corosync, the node numbers are generated using IP
address as opposed to serial node number generation.  This may not fit
in a 8-byte string.  Use a bigger string to print the complete node
number.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 19:26:38 -07:00
Waiman Long
98474236f7 vfs: make the dentry cache use the lockref infrastructure
This just replaces the dentry count/lock combination with the lockref
structure that contains both a count and a spinlock, and does the
mechanical conversion to use the lockref infrastructure.

There are no semantic changes here, it's purely syntactic.  The
reference lockref implementation uses the spinlock exactly the same way
that the old dcache code did, and the bulk of this patch is just
expanding the internal "d_count" use in the dcache code to use
"d_lockref.count" instead.

This is purely preparation for the real change to make the reference
count updates be lockless during the 3.12 merge window.

[ As with the previous commit, this is a rewritten version of a concept
  originally from Waiman, so credit goes to him, blame for any errors
  goes to me.

  Waiman's patch had some semantic differences for taking advantage of
  the lockless update in dget_parent(), while this patch is
  intentionally a pure search-and-replace change with no semantic
  changes.     - Linus ]

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 18:24:59 -07:00
Dan Carpenter
28d7b5684b Squashfs: sanity check information from disk
We read the size of the name from the disk, but a larger name than
expected would cause memory corruption.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2013-08-29 01:23:29 +01:00
Eric Sandeen
ad4eec6135 ext4: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.

The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.

Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.

# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

and it'll mount even if the devnum has changed, as shown here:

# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0 
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1

Change the journal device number:

# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile 

And today it will fail:

# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal

But with this new mount option, we can specify the new path:

# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#

(which does update the encoded device number, incidentally):

# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device:	          0x0701

But best of all we can just always mount by journal-path, and
it'll always work:

# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#

So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-08-28 19:05:07 -04:00
Darrick J. Wong
bdfb6ff4a2 ext4: mark group corrupt on group descriptor checksum
If the group descriptor fails validation, mark the whole blockgroup
corrupt so that the inode/block allocators skip this group.  The
previous approach takes the risk of writing to a damaged group
descriptor; hopefully it was never the case that the [ib]bitmap fields
pointed to another valid block and got dirtied, since the memset would
fill the page with 1s.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 18:46:56 -04:00
Darrick J. Wong
87a39389be ext4: mark block group as corrupt on inode bitmap error
If we detect either a discrepancy between the inode bitmap and the
inode counts or the inode bitmap fails to pass validation checks, mark
the block group corrupt and refuse to allocate or deallocate inodes
from the group.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 18:32:58 -04:00
Darrick J. Wong
163a203ddb ext4: mark block group as corrupt on block bitmap error
When we notice a block-bitmap corruption (because of device failure or
something else), we should mark this group as corrupt and prevent
further block allocations/deallocations from it. Currently, we end up
generating one error message for every block in the bitmap. This
potentially could make the system unstable as noticed in some
bugs. With this patch, the error will be printed only the first time
and mark the entire block group as corrupted. This prevents future
access allocations/deallocations from it.

Also tested by corrupting the block
bitmap and forcefully introducing the mb_free_blocks error:
(1) create a largefile (2Gb)
$ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
(2) umount filesystem. use dumpe2fs to see which block-bitmaps
are in use by largefile and note their block numbers
(3) use dd to zero-out the used block bitmaps
$ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
(4) mount the FS and delete the largefile.
(5) recreate the largefile. verify that the new largefile does not
get any blocks from the groups marked as bad.
Without the patch, we will see mb_free_blocks error for each bit in
each zero'ed out bitmap at (4). With the patch, we only see the error
once per blockgroup:
[  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.

Google-Bug-Id: 7258357

[darrick.wong@oracle.com]
Further modifications (by Darrick) to make more obvious that this corruption
bit applies to blocks only.  Set the corruption flag if the block group bitmap
verification fails.

Original-author: Aditya Kali <adityakali@google.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 17:35:51 -04:00
Darrick J. Wong
dbde0abed8 ext4: fix type declaration of ext4_validate_block_bitmap
The block_group parameter to ext4_validate_block_bitmap is both used
as a ext4_group_t inside the function and the same type is passed in
by all callers.  We might as well use the typedef consistently instead
of open-coding the 'unsigned int'.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 15:59:51 -04:00
Darrick J. Wong
48d9eb97dc ext4: error out if verifying the block bitmap fails
The block bitmap verification code assumes that calling ext4_error()
either panics the system or makes the fs readonly.  However, this is
not always true: when 'errors=continue' is specified, an error is
printed but we don't return any indication of error to the caller,
which is (probably) the block allocator, which pretends that the crud
we read in off the disk is a usable bitmap.  Yuck.

A block bitmap that fails the check should at least return no bitmap
to the caller.  The block allocator should be told to go look in a
different group, but that's a separate issue.

The easiest way to reproduce this is to modify bg_block_bitmap (on a
^flex_bg fs) to point to a block outside the block group; or you can
create a metadata_csum filesystem and zero out the block bitmaps.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 15:35:27 -04:00
Darrick J. Wong
18a6ea1e5c jbd2: Fix endian mixing problems in the checksumming code
In the jbd2 checksumming code, explicitly declare separate variables with
endianness information so that we don't get confused and screw things up again.
Also fixes sparse warnings.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:59:58 -04:00
Zheng Liu
d7b2a00c2e ext4: isolate ext4_extents.h file
After applied the commit (4a092d73), we have reduced the number of
source files that need to #include ext4_extents.h.  But we can do
better.

This commit defines ext4_zeroout_es() in extents.c and move
EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
indirect.c and ioctl.c.  Meanwhile we just need to include this file in
extent_status.c when ES_AGGRESSIVE_TEST is defined.  Otherwise, this
commit removes a duplicated declaration in trace/events/ext4.h.

After applied this patch, we just need to include ext4_extents.h file
in {super,migrate,move_extents,extents}.c, and it is easy for us to
define a new extent disk layout.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:47:06 -04:00
Anatol Pomozov
70261f568f ext4: Fix misspellings using 'codespell' tool
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:40:12 -04:00
Dmitry Monakhov
7afe5aa59e ext4: convert write_begin methods to stable_page_writes semantics
Use wait_for_stable_page() instead of wait_on_page_writeback()

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2013-08-28 14:30:47 -04:00
Andi Shyti
27b1b22882 ext4: fix use of potentially uninitialized variables in debugging code
If ext_debugging is enabled and path[depth].p_ext is NULL, len
and lblock are printed non initialized

Signed-off-by: Andi Shyti <andi@etezian.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-28 14:00:00 -04:00
Linus Torvalds
f0cc6ffb8c Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink"
This reverts commit bb2314b479.

It wasn't necessarily wrong per se, but we're still busily discussing
the exact details of this all, so I'm going to revert it for now.

It's true that you can already do flink() through /proc and that flink()
isn't new.  But as Brad Spengler points out, some secure environments do
not mount proc, and flink adds a new interface that can avoid path
lookup of the source for those kinds of environments.

We may re-do this (and even mark it for stable backporting back in 3.11
and possibly earlier) once the whole discussion about the interface is done.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 09:18:05 -07:00
Sha Zhengju
7d6e1f5461 ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
Following we will begin to add memcg dirty page accounting around
__set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
avoid exporting those details to filesystems.

Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate
codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions.
Thanks very much for Sage and Yan Zheng's coaching!

I tested it in a two server's ceph environment that one is client and the other is
mds/osd/mon, and run the following fsx test from xfstests:

  ./fsx   1MB -N 50000 -p 10000 -l 1048576
  ./fsx  10MB -N 50000 -p 10000 -l 10485760
  ./fsx 100MB -N 50000 -p 10000 -l 104857600

The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed
successfully without triggering any of WARN_ON.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-27 16:29:44 -07:00
Steven Whitehouse
9d35814355 GFS2: Merge ordered and writeback writepage
The writepages function was recently merged between writeback
and ordered mode. This completes the change by doing the same
with writepage. The remaining differences in writepage were
left over from some earlier time and not actually doing anything
useful.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-27 21:22:07 +01:00
majianpeng
ee7289bfad ceph: allow sync_read/write return partial successed size of read/write.
For sync_read/write, it may do multi stripe operations.If one of those
met erro, we return the former successed size rather than a error value.
There is a exception for write-operation met -EOLDSNAPC.If this occur,we
retry the whole write again.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-27 12:28:46 -07:00
majianpeng
02ae66d8b2 ceph: fix bugs about handling short-read for sync read mode.
cephfs . show_layout
>layyout.data_pool:     0
>layout.object_size:   4194304
>layout.stripe_unit:   4194304
>layout.stripe_count:  1

TestA:
>dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
>dd if=/dev/urandom of=test bs=1M count=2 seek=4  oflag=direct
>dd if=test of=/dev/null bs=6M count=1 iflag=direct
The messages from func striped_read are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
ceph:           file.c:350  : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT
ceph:           file.c:381  : zero tail 4194304
ceph:           file.c:390  : striped_read returns 6291456
The hole of file is from 2M--4M.But actualy it zero the last 4M include
the last 2M area which isn't a hole.
Using this patch, the messages are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
ceph:           file.c:358  :  zero gap 2097152 to 4194304
ceph:           file.c:350  : striped_read 4194304~2097152 (read 4194304) got 2097152
ceph:           file.c:384  : striped_read returns 6291456

TestB:
>echo majianpeng > test
>dd if=test of=/dev/null bs=2M count=1 iflag=direct
The messages are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
ceph:           file.c:350  : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT
ceph:           file.c:390  : striped_read returns 11
For this case,it did once more striped_read.It's no meaningless.
Using this patch, the message are:
ceph:           file.c:350  : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
ceph:           file.c:384  : striped_read returns 11

Big thanks to Yan Zheng for the patch.

Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-27 12:28:45 -07:00
Li Wang
e907574323 ceph: remove useless variable revoked_rdcache
Cleanup in handle_cap_grant().

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-27 12:28:44 -07:00
Sage Weil
b314a90d8f ceph: fix fallocate division
We need to use do_div to divide by a 64-bit value.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-08-27 12:26:29 -07:00
Gu Zheng
749ebfd174 f2fs: use strncasecmp() simplify the string comparison
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-27 21:50:12 +09:00
Jaegeuk Kim
8cb8268809 f2fs: fix omitting to update inode page
The f2fs_set_link updates its parent inode number, so we should sync this to
the inode block.
Otherwise, the data can be lost after sudden-power-off.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-27 21:49:04 +09:00
Linus Torvalds
83c425d222 One JFS patch to fix an incompatibility with NFSv4 resulting in the nfs
client reporting a readdir loop.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.21 (GNU/Linux)
 
 iQIcBAABAgAGBQJSG9IHAAoJEDaohF61QIxkptgP/jTEooTZ2uMmIouqj6rhrc81
 avrqtB26Ww74XBzuyiTuSBLUUJXGaLveC/i9rR2XOyA7cXJUbrqvqpZ2OExOURsf
 gHeLX20RKwKgaRQ6R9Xcri6wjWql4YHL/z3tI/DGpDkMfA0siocbHW+GZSfmMZ8c
 EOcAECY+tqrAgFL1oESTinglGNZ2V1f/IyKB8DiUDgDuMkV6Zw3083Ph7xB/bw9T
 Jffg6nvkST2mzZNTKgU8W3extW/X9GxxleYLED2jwEYioOFVAlvSKZTFD65NVUya
 1l3YH68jROnDSoHmgOup0C6i51/e7QFuBNdyR/8UK2OyOXkJ1wneXutUIiysWWty
 dY9+kxSSdpNuZvflGrZlP7yoEjGgRO/893owUcusiSgLTpQpNs2y7OVjCBwsHGa7
 AIWyu+FyOnNnmO6oiYkmNQlE3bAoz3z0CO7IuP5lm5HRgMIxp+k2k8yOHon/PcuF
 juQsbMydGcNVAjJlQuxCDh1uijOGLbDol/NekpUnyL02oy294raiWdy4IUaqWIHG
 Y5i1yeVwFbr7QsI8RCbEliCRwP5YMWSp4irEUozJTDplAn4AnJ5AJdYq9KM5JHMm
 qBHXl/asWbxGQZhmig2xzojZ+HVPFVWilpBz7OvsMXJaulrRqxV3DgAudIlZPP4B
 mB8DPzctwmgzmlgCqzRW
 =+aYs
 -----END PGP SIGNATURE-----

Merge tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy

Pull jfs fix from Dave Kleikamp:
 "One JFS patch to fix an incompatibility with NFSv4 resulting in the
  nfs client reporting a readdir loop"

* tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy:
  jfs: fix readdir cookie incompatibility with NFSv4
2013-08-26 19:22:49 -07:00
Eric W. Biederman
e51db73532 userns: Better restrictions on when proc and sysfs can be mounted
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.

Verify that the mounted filesystem is not covered in any significant
way.  I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.

Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs.  This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-26 19:17:03 -07:00
Eric W. Biederman
4ce5d2b1a8 vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces
Don't copy bind mounts of /proc/<pid>/ns/mnt between namespaces.
These files hold references to a mount namespace and copying them
between namespaces could result in a reference counting loop.

The current mnt_ns_loop test prevents loops on the assumption that
mounts don't cross between namespaces.  Unfortunately unsharing a
mount namespace and shared substrees can both cause mounts to
propogate between mount namespaces.

Add two flags CL_COPY_UNBINDABLE and CL_COPY_MNT_NS_FILE are added to
control this behavior, and CL_COPY_ALL is redefined as both of them.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-26 18:42:15 -07:00
Eric W. Biederman
aee1c13dd0 proc: Restrict mounting the proc filesystem
Don't allow mounting the proc filesystem unless the caller has
CAP_SYS_ADMIN rights over the pid namespace.  The principle here is if
you create or have capabilities over it you can mount it, otherwise
you get to live with what other people have mounted.

Andy pointed out that this is needed to prevent users in a user
namespace from remounting proc and specifying different hidepid and gid
options on already existing proc mounts.

Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-26 11:36:58 -07:00
Dan Carpenter
0d0ab120d1 xfs: check for underflow in xfs_iformat_fork()
The "di_size" variable comes from the disk and it's a signed 64 bit.
We check the upper limit but we should check for negative numbers as
well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-26 11:28:08 -05:00
Fengguang Wu
98f7462c43 xfs: xfs_dir3_sfe_put_ino can be static
TO: Dave Chinner <david@fromorbit.com>
CC: Ben Myers <bpm@sgi.com>
CC: linux-kernel@vger.kernel.org 

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-26 11:27:40 -05:00
Will Deacon
50192abe02 fs/9p: avoid accessing utsname after namespace has been torn down
During trinity fuzzing in a kvmtool guest, I stumbled across the
following:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
PC is at v9fs_file_do_lock+0xc8/0x1a0
LR is at v9fs_file_do_lock+0x48/0x1a0
[<c01e2ed0>] (v9fs_file_do_lock+0xc8/0x1a0) from [<c0119154>] (locks_remove_flock+0x8c/0x124)
[<c0119154>] (locks_remove_flock+0x8c/0x124) from [<c00d9bf0>] (__fput+0x58/0x1e4)
[<c00d9bf0>] (__fput+0x58/0x1e4) from [<c0044340>] (task_work_run+0xac/0xe8)
[<c0044340>] (task_work_run+0xac/0xe8) from [<c002e36c>] (do_exit+0x6bc/0x8d8)
[<c002e36c>] (do_exit+0x6bc/0x8d8) from [<c002e674>] (do_group_exit+0x3c/0xb0)
[<c002e674>] (do_group_exit+0x3c/0xb0) from [<c002e6f8>] (__wake_up_parent+0x0/0x18)

I believe this is due to an attempt to access utsname()->nodename, after
exit_task_namespaces() has been called, leaving current->nsproxy->uts_ns
as NULL and causing the above dereference.

A similar issue was fixed for lockd in 9a1b6bf818 ("LOCKD: Don't call
utsname()->nodename from nlmclnt_setlockargs"), so this patch attempts
something similar for 9pfs.

Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2013-08-26 10:28:46 -05:00
Jaegeuk Kim
65985d935d f2fs: support the inline xattrs
0. modified inode structure
--------------------------------------
metadata (e.g., i_mtime, i_ctime, etc)
--------------------------------------
direct pointers [0 ~ 873]

inline xattrs (200 bytes by default)

indirect pointers [0 ~ 4]
--------------------------------------
node footer
--------------------------------------

1. setxattr flow
 - read_all_xattrs copies all the xattrs from inline and xattr node block.
 - handle xattr entries
 - write_all_xattrs copies modified xattrs into inline and xattr node block.

2. getxattr flow
 - read_all_xattrs copies all the xattrs from inline and xattr node block.
 - check target entries

3. Usage
 # mount -t f2fs -o inline_xattr $DEV $MNT

 Once mounted with the inline_xattr option, f2fs marks all the newly created
 files to reserve an amount of inline xattr space explicitly inside the inode
 block. Without the mount option, f2fs will not touch any existing files and
 newly created files as well.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:23 +09:00
Jaegeuk Kim
4f16fb0f9b f2fs: add the truncate_xattr_node function
The truncate_xattr_node function will be used by inline xattr.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:06 +09:00
Jaegeuk Kim
dd9cfe236f f2fs: introduce __find_xattr for readability
The __find_xattr is to search the wanted xattr entry starting from the
base_addr.

If not found, the returned entry is the last empty xattr entry that can be
allocated newly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:06 +09:00
Jaegeuk Kim
de93653fe3 f2fs: reserve the xattr space dynamically
This patch enables the number of direct pointers inside on-disk inode block to
be changed dynamically according to the size of inline xattr space.

The number of direct pointers, ADDRS_PER_INODE, can be changed only if the file
has inline xattr flag.

The number of direct pointers that will be used by inline xattrs is defined as
F2FS_INLINE_XATTR_ADDRS.
Current patch assigns F2FS_INLINE_XATTR_ADDRS to 0 temporarily.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:15:01 +09:00
Jaegeuk Kim
444c580f7e f2fs: add flags for inline xattrs
This patch adds basic inode flags for inline xattrs, F2FS_INLINE_XATTR,
and add a mount option, inline_xattr, which is enabled when xattr is set.

If the mount option is enabled, all the files are marked with the inline_xattrs
flag.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 20:02:12 +09:00
Wei Yongjun
6e6b978c32 f2fs: fix error return code in init_f2fs_fs()
Fix to return -ENOMEM in the kset create and add error handling
case instead of 0, as done elsewhere in this function.

Introduced by commit b59d0bae6c.
(f2fs: add sysfs support for controlling the gc_thread)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
[Jaegeuk Kim: merge the patch with previous modification]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-26 19:36:46 +09:00
Linus Torvalds
4d4323ea2d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes from the last week or so"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: collect_mounts() should return an ERR_PTR
  bfs: iget_locked() doesn't return an ERR_PTR
  efs: iget_locked() doesn't return an ERR_PTR()
  proc: kill the extra proc_readfd_common()->dir_emit_dots()
  cope with potentially long ->d_dname() output for shmem/hugetlb
2013-08-25 12:25:38 -07:00
Linus Torvalds
5befb98b30 SCSI fixes on 20130824
This is a set of small bug fixes for lpfc and zfcp and a fix for a fairly
 nasty bug in sg where a process which cancels I/O completes in a kernel thread
 which would then try to write back to the now gone userspace and end up
 writing to a random kernel address instead.
 
 Signed-off-by: James Bottomley <JBottomley@Parallels.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSGIaHAAoJEDeqqVYsXL0MbPUH/3UXceHlgYRrwYZ0C10Ao5XB
 WA8RWsDsX9UJxG68zEd8ED1aRHhmkfm4pEdMQ8DHW7+B7mvNhpb6mF0wxvmS5aIj
 OVI0G+3KmghA3aDWbTtg8ED0wJ4q3ftcyzl4Fhpat+yA4g/BW7iJNDCv17nvZ90f
 hNmdGm23wuYCid7JWNDO79spSp0q6wPJhG6ynJYOtzX1GvpEliZiGB0IOR3K44nW
 cF6+Uigs3+6RGXX9UHOMrk9Ug3YFHfok224vvydcbRXVh8uneB8RQ6tziJVFI8tE
 WPgAv2oDzly8l+ku71CqrjzG7fSwCr9Urlog9cEugE1iUmFCIQm6xJcSSnGJqaY=
 =jKT6
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is a set of small bug fixes for lpfc and zfcp and a fix for a
  fairly nasty bug in sg where a process which cancels I/O completes in
  a kernel thread which would then try to write back to the now gone
  userspace and end up writing to a random kernel address instead"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  [SCSI] zfcp: remove access control tables interface (keep sysfs files)
  [SCSI] zfcp: fix schedule-inside-lock in scsi_device list loops
  [SCSI] zfcp: fix lock imbalance by reworking request queue locking
  [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
  [SCSI] lpfc: Don't force CONFIG_GENERIC_CSUM on
2013-08-24 11:33:21 -07:00
Dan Carpenter
52e220d357 VFS: collect_mounts() should return an ERR_PTR
This should actually be returning an ERR_PTR on error instead of NULL.
That was how it was designed and all the callers expect it.

[AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts()
return errors" missed - originally collect_mounts() was expected to return
NULL on failure]

Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24 12:10:29 -04:00
Dan Carpenter
821ff77c6c bfs: iget_locked() doesn't return an ERR_PTR
iget_locked() returns a NULL on error, it doesn't return an ERR_PTR.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24 12:10:22 -04:00
Dan Carpenter
136eefa48d efs: iget_locked() doesn't return an ERR_PTR()
The iget_locked() function returns NULL on error and never an ERR_PTR.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24 12:10:22 -04:00
Oleg Nesterov
a5a1955e0c proc: kill the extra proc_readfd_common()->dir_emit_dots()
proc_readfd_common() does dir_emit_dots() twice in a row,
we need to do this only once.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24 12:10:22 -04:00
Al Viro
118b230225 cope with potentially long ->d_dname() output for shmem/hugetlb
dynamic_dname() is both too much and too little for those - the
output may be well in excess of 64 bytes dynamic_dname() assumes
to be enough (thanks to ashmem feeding really long names to
shmem_file_setup()) and vsnprintf() is an overkill for those
guys.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-24 12:10:17 -04:00
Zhi Yong Wu
00574da199 xfs: introduce object readahead to log recovery
It can take a long time to run log recovery operation because it is
single threaded and is bound by read latency. We can find that it took
most of the time to wait for the read IO to occur, so if one object
readahead is introduced to log recovery, it will obviously reduce the
log recovery time.

Log recovery time stat:

          w/o this patch        w/ this patch

real:        0m15.023s             0m7.802s
user:        0m0.001s              0m0.001s
sys:         0m0.246s              0m0.107s

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-23 14:32:50 -05:00
Jie Liu
8d1d40832b xfs: Simplify xfs_ail_min() with list_first_entry_or_null()
At xfs_ail_min(), we do check if the AIL list is empty or not before
returning the first item in it with list_empty() and list_first_entry().

This can be simplified a bit with a new list operation routine that is
the list_first_entry_or_null() which has been introduced by:

commit 6d7581e62f
    list: introduce list_first_entry_or_null

v2: make xfs_ail_min() as a static inline function and move it to
    xfs_trans_priv.h as per Dave Chinner's comments.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-23 12:57:43 -05:00
Vyacheslav Dubeyko
4bf93b50fd nilfs2: fix issue with counting number of bio requests for BIO_EOPNOTSUPP error detection
Fix the issue with improper counting number of flying bio requests for
BIO_EOPNOTSUPP error detection case.

The sb_nbio must be incremented exactly the same number of times as
complete() function was called (or will be called) because
nilfs_segbuf_wait() will call wail_for_completion() for the number of
times set to sb_nbio:

  do {
      wait_for_completion(&segbuf->sb_bio_event);
  } while (--segbuf->sb_nbio > 0);

Two functions complete() and wait_for_completion() must be called the
same number of times for the same sb_bio_event.  Otherwise,
wait_for_completion() will hang or leak.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-23 09:51:22 -07:00
Vyacheslav Dubeyko
2df37a19c6 nilfs2: remove double bio_put() in nilfs_end_bio_write() for BIO_EOPNOTSUPP error
Remove double call of bio_put() in nilfs_end_bio_write() for the case of
BIO_EOPNOTSUPP error detection.  The issue was found by Dan Carpenter
and he suggests first version of the fix too.

Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-23 09:51:22 -07:00
Richard Weinberger
46677e679f xfs: Register hotcpu notifier after initialization
Currently the code initializizes mp->m_icsb_mutex and other things
_after_ register_hotcpu_notifier().
As the notifier takes mp->m_icsb_mutex it can happen
that it takes the lock before it's initialization.

Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 14:05:27 -05:00
NeilBrown
6686390bab NFS: remove incorrect "Lock reclaim failed!" warning.
After reclaiming state that was lost, the NFS client tries to reclaim
any locks, and then checks that each one has NFS_LOCK_INITIALIZED set
(which means that the server has confirmed the lock).
However if the client holds a delegation, nfs_reclaim_locks() simply aborts
(or more accurately it called nfs_lock_reclaim() and that returns without
doing anything).

This is because when a delegation is held, the server doesn't need to
know about locks.

So if a delegation is held, NFS_LOCK_INITIALIZED is not expected, and
its absence is certainly not an error.

So don't print the warnings if NFS_DELGATED_STATE is set.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 14:34:14 -04:00
Greg Kroah-Hartman
09239ed4aa sysfs: group.c: fix up kerneldoc
Fix up the wording of sysfs_create/remove_groups() a bit.

Reported-by: Anthony Foiani <tkil@scrye.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-22 09:23:28 -07:00
Mark Tinguely
3e3c51cee9 xfs: add xfs sb v4 support for dirent filetype field
Add XFS superblock v4 support for the file type field in the
directory entry feature.

This support adds a feature bit for version 4 superblocks and
leaves the original superblock 5 incompatibility bit.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 08:49:59 -05:00
Dave Chinner
1c55cece08 xfs: Add write support for dirent filetype field
Add support to propagate and add filetype values into the on-disk
directs. This involves passing the filetype into the xfs_da_args
structure along with the name and namelength for direct operations,
and encoding it into the dirent at the same time we write the inode
number into the dirent.

With write support, add the feature flag to the
XFS_SB_FEAT_INCOMPAT_ALL mask so we can now mount filesystems with
this feature set.

Performance of directory recursion is now much improved. Parallel
walk of ~50 million directory entries across hundreds of directories
improves significantly. Unpatched, no CRCs:

Walking via ls -R

real    3m19.886s
user    6m36.960s
sys     28m19.087s

THis is doing roughly 500 getdents() calls per second, and 250,000
inode lookups per second to determine the inode type at roughly
17,000 read IOPS. CPU usage is 90% kernel space.

With dtype support patched in and the fileset recreated with CRCs
enabled:

Walking via ls -R

real    0m31.316s
user    6m32.975s
sys     0m21.111s

This is doing roughly 3500 getdents() calls per second at 16,000
IOPS. There are no inode lookups at all. CPU usages is almost 100%
userspace.

This is a big win for recursive directory walks that only need to
find file names and file types.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 08:44:49 -05:00
Dave Chinner
0cb97766f2 xfs: Add read-only support for dirent filetype field
Add support for the file type field in directory entries so that
readdir can return the type of the inode the dirent points to to
userspace without first having to read the inode off disk.

The encoding of the type field is a single byte that is added to the
end of the directory entry name length. For all intents and
purposes, it appends a "hidden" byte to the name field which
contains the type information. As the directory entry is already of
dynamic size, helpers are already required to access and decode the
direct entry structures.

Hence the relevent extraction and iteration helpers are updated to
understand the hidden byte.  Helpers for reading and writing the
filetype field from the directory entries are also added. Only the
read helpers are used by this patch.  It also adds all the code
necessary to read the type information out of the dirents on disk.

Further we add the superblock feature bit and helpers to indicate
that we understand the on-disk format change. This is not a
compatible change - existing kernels cannot read the new format
successfully - so an incompatible feature flag is added. We don't
yet allow filesystems to mount with this flag yet - that will be
added once write support is added.

Finally, the code to take the type from the VFS, convert it to an
XFS on-disk type and put it into the xfs_name structures passed
around is added, but the directory code does not use this field yet.
That will be in the next patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-22 08:40:24 -05:00
Trond Myklebust
08cb47faa4 NFSv4.1: Add tracepoints for debugging test_stateid events
Add tracepoints to detect issues with the TEST_STATEID operation.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:27 -04:00
Trond Myklebust
2f92ae343e NFSv4.1: Add tracepoints for debugging slot table operations
Add tracepoints to nfs41_setup_sequence and nfs41_sequence_done
to track session and slot table state changes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:27 -04:00
Trond Myklebust
1037e6eaa3 NFSv4.1: Add tracepoints for debugging layoutget/return/commit
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:26 -04:00
Trond Myklebust
cc668ab30b NFSv4: Add tracepoints for debugging reads and writes
Set up tracepoints to track read, write and commit, as well as
pNFS reads and writes and commits to the data server.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:26 -04:00
Trond Myklebust
b5f875a925 NFSv4: Add tracepoints for debugging getattr
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:25 -04:00
Trond Myklebust
1f2d30b533 NFSv4: Add tracepoints for debugging the idmapper
Add tracepoints to help debug uid/gid mappings to username/group.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:25 -04:00
Trond Myklebust
ca8acf8d84 NFSv4: Add tracepoints for debugging delegations
Set up tracepoints to track when delegations are set, reclaimed,
returned by the client, or recalled by the server.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:24 -04:00
Trond Myklebust
fbc6f7c233 NFSv4: Add tracepoints for debugging rename
Add tracepoints to debug renames.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:23 -04:00
Trond Myklebust
c1578b769a NFSv4: Add tracepoints for debugging inode manipulations
Set up basic tracepoints for debugging NFSv4 setattr, access,
readlink, readdir, get_acl set_acl get_security_label,
and set_security_label.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:23 -04:00
Trond Myklebust
078ea3dfe3 NFSv4: Add tracepoints for debugging lookup/create operations
Set up basic tracepoints for debugging NFSv4 lookup, unlink/remove,
symlink, mkdir, mknod, fs_locations and secinfo.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:22 -04:00
Trond Myklebust
d1b748a5e7 NFSv4: Add tracepoints for debugging file locking
Set up basic tracepoints for debugging NFSv4 file lock/unlock

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:22 -04:00
Trond Myklebust
42113a7539 NFSv4: Add tracepoints for debugging file open
Set up basic tracepoints for debugging NFSv4 file open/close

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:21 -04:00
Trond Myklebust
c6d01c6f9b NFSv4: Add tracepoints for debugging state management problems
Set up basic tracepoints for debugging client id creation/destruction
and session creation/destruction.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:21 -04:00
Trond Myklebust
1fd1085b49 NFS: Add tracepoints for debugging NFS hard links
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:20 -04:00
Trond Myklebust
70ded20170 NFS: Add tracepoints for debugging NFS rename and sillyrename issues
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:19 -04:00
Trond Myklebust
1ca42382af NFS: Add tracepoints for debugging directory changes
Add tracepoints for mknod, mkdir, rmdir, remove (unlink) and symlink.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:19 -04:00
Trond Myklebust
8b0ad3d489 NFS: Add tracepoints for debugging generic file create events
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:18 -04:00
Trond Myklebust
6e0d0be715 NFS: Add event tracing for generic NFS lookups
Add tracepoints for lookup, lookup_revalidate and atomic_open

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:18 -04:00
Trond Myklebust
1472b83eae NFS: Pass in lookup flags from nfs_atomic_open to nfs_lookup
When doing an open of a directory, ensure that we do pass the lookup flags
from nfs_atomic_open into nfs_lookup.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:17 -04:00
Trond Myklebust
f4ce1299b3 NFS: Add event tracing for generic NFS events
Add tracepoints for inode attribute updates, attribute revalidation,
writeback start/end fsync start/end, attribute change start/end,
permission check start/end.

The intention is to enable performance tracing using 'perf'as well as
improving debugging.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:17 -04:00
Trond Myklebust
1264a2f053 NFS: refactor code for calculating the crc32 hash of a filehandle
We want to be able to display the crc32 hash of the filehandle in
tracepoints.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:16 -04:00
Trond Myklebust
c2dd1378fa NFS: Clean up nfs_sillyrename()
Optimise for the case where we only do one lookup.
Clean up the code so it is obvious that silly[] is not a dynamic array.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:16 -04:00
Trond Myklebust
b8a8a0dd50 NFSv4: Fix an incorrect pointer declaration in decode_first_pnfs_layout_type
We always encode to __be32 format in XDR: silences a sparse warning.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Andy Adamson <andros@netapp.com>
2013-08-22 08:58:15 -04:00
Trond Myklebust
393faffe6f NFSv4: Deal with a sparse warning in nfs_idmap_get_key()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
2013-08-22 08:58:15 -04:00
Trond Myklebust
17f26b1246 NFSv4: Deal with some more sparse warnings
Technically, we don't really need to convert these time stamps,
since they are actually cookies.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <Chuck.Lever@oracle.com>
2013-08-22 08:58:14 -04:00
Trond Myklebust
c281fa9c1f NFSv4: Deal with a sparse warning in nfs4_opendata_alloc
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-22 08:58:13 -04:00
Trond Myklebust
a9943d11c1 NFSv3: Deal with a sparse warning in nfs3_proc_create
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-21 20:00:16 -04:00
Greg Kroah-Hartman
2c3a908b4b sysfs: sysfs.h: fix coding style issues
This fixes up the remaining coding style issues in sysfs.h

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:40:05 -07:00
Greg Kroah-Hartman
07ac62a604 sysfs: file.c: fix up broken string warnings
This fixes the coding style warnings in fs/sysfs/file.c for broken
strings across lines.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:37:42 -07:00
Greg Kroah-Hartman
37814ee0ba sysfs: dir.c: fix up odd do/while indentation
This fixes up the odd do/while after an if statement warning in dir.c

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:36:02 -07:00
Greg Kroah-Hartman
060cc749e9 sysfs: fix up uaccess.h coding style warnings
This fixes the uaccess.h warnings in the sysfs.c files.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:34:59 -07:00
Greg Kroah-Hartman
ddfd6d074e sysfs: fix up 80 column coding style issues
This fixes up the 80 column coding style issues in the sysfs .c files.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:33:34 -07:00
Greg Kroah-Hartman
1b18dc2beb sysfs: fix up space coding style issues
This fixes up all of the space-related coding style issues for the sysfs
code.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:28:26 -07:00
Greg Kroah-Hartman
ab9bf4be4d sysfs: remove trailing whitespace
This removes all trailing whitespace errors in the sysfs code.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:21:17 -07:00
Greg Kroah-Hartman
1b866757fc sysfs: fix placement of EXPORT_SYMBOL()
The export should happen after the function, not at the bottom of the
file, so fix that up.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:17:47 -07:00
Greg Kroah-Hartman
9e2a47ed64 sysfs: group: update copyright to add myself and the LF
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:14:11 -07:00
Greg Kroah-Hartman
f9ae443b5a sysfs: group.c: add kerneldoc for sysfs_remove_group
sysfs_remove_group() never had kerneldoc, so add it, and fix up the
kerneldoc for sysfs_remove_groups() which didn't specify the parameters
properly.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:12:34 -07:00
Greg Kroah-Hartman
16aebf1c5d sysfs: group.c: fix up broken string coding style
checkpatch complains about the broken string in the file, and it's
correct, so fix it up.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:10:02 -07:00
Greg Kroah-Hartman
995d8ed943 sysfs: group.c: fix up some * coding style issues
This fixes up the * coding style warnings for the group.c sysfs file.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:07:29 -07:00
Greg Kroah-Hartman
e6c56920fd sysfs: group.c: fix trailing whitespace
There was some trailing spaces in the file, fix that up.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:06:14 -07:00
Greg Kroah-Hartman
d363bc53ef sysfs: group.c: move EXPORT_SYMBOL_GPL() to the proper location
This fixes up the coding style issue of incorrectly placing the
EXPORT_SYMBOL_GPL() macro, it should be right after the function itself,
not at the end of the file.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:04:12 -07:00
Greg Kroah-Hartman
3e9b2bae83 sysfs: add sysfs_create/remove_groups()
These functions are being open-coded in 3 different places in the driver
core, and other driver subsystems will want to start doing this as well,
so move it to the sysfs core to keep it all in one place, where we know
it is written properly.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-21 16:02:19 -07:00
Roland Dreier
35dc248383 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
There is a nasty bug in the SCSI SG_IO ioctl that in some circumstances
leads to one process writing data into the address space of some other
random unrelated process if the ioctl is interrupted by a signal.
What happens is the following:

 - A process issues an SG_IO ioctl with direction DXFER_FROM_DEV (ie the
   underlying SCSI command will transfer data from the SCSI device to
   the buffer provided in the ioctl)

 - Before the command finishes, a signal is sent to the process waiting
   in the ioctl.  This will end up waking up the sg_ioctl() code:

		result = wait_event_interruptible(sfp->read_wait,
			(srp_done(sfp, srp) || sdp->detached));

   but neither srp_done() nor sdp->detached is true, so we end up just
   setting srp->orphan and returning to userspace:

		srp->orphan = 1;
		write_unlock_irq(&sfp->rq_list_lock);
		return result;	/* -ERESTARTSYS because signal hit process */

   At this point the original process is done with the ioctl and
   blithely goes ahead handling the signal, reissuing the ioctl, etc.

 - Eventually, the SCSI command issued by the first ioctl finishes and
   ends up in sg_rq_end_io().  At the end of that function, we run through:

	write_lock_irqsave(&sfp->rq_list_lock, iflags);
	if (unlikely(srp->orphan)) {
		if (sfp->keep_orphan)
			srp->sg_io_owned = 0;
		else
			done = 0;
	}
	srp->done = done;
	write_unlock_irqrestore(&sfp->rq_list_lock, iflags);

	if (likely(done)) {
		/* Now wake up any sg_read() that is waiting for this
		 * packet.
		 */
		wake_up_interruptible(&sfp->read_wait);
		kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN);
		kref_put(&sfp->f_ref, sg_remove_sfp);
	} else {
		INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext);
		schedule_work(&srp->ew.work);
	}

   Since srp->orphan *is* set, we set done to 0 (assuming the
   userspace app has not set keep_orphan via an SG_SET_KEEP_ORPHAN
   ioctl), and therefore we end up scheduling sg_rq_end_io_usercontext()
   to run in a workqueue.

 - In workqueue context we go through sg_rq_end_io_usercontext() ->
   sg_finish_rem_req() -> blk_rq_unmap_user() -> ... ->
   bio_uncopy_user() -> __bio_copy_iov() -> copy_to_user().

   The key point here is that we are doing copy_to_user() on a
   workqueue -- that is, we're on a kernel thread with current->mm
   equal to whatever random previous user process was scheduled before
   this kernel thread.  So we end up copying whatever data the SCSI
   command returned to the virtual address of the buffer passed into
   the original ioctl, but it's quite likely we do this copying into a
   different address space!

As suggested by James Bottomley <James.Bottomley@hansenpartnership.com>,
add a check for current->mm (which is NULL if we're on a kernel thread
without a real userspace address space) in bio_uncopy_user(), and skip
the copy if we're on a kernel thread.

There's no reason that I can think of for any caller of bio_uncopy_user()
to want to do copying on a kernel thread with a random active userspace
address space.

Huge thanks to Costa Sapuntzakis <costa@purestorage.com> for the
original pointer to this bug in the sg code.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Tested-by: David Milburn <dmilburn@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2013-08-21 10:58:35 -07:00
Chandra Seetharaman
5d5e3d5760 xfs: Add support for the Q_XGETQSTATV
For XFS, add support for Q_XGETQSTATV quotactl command.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 17:00:38 -05:00
Chandra Seetharaman
af30cb446d quota: Add a new quotactl command Q_XGETQSTATV
XFS now supports three types of quotas (user, group and project).

Current version of Q_XGETSTAT has support for only two types of quotas.
In order to support three types of quotas, the interface, specifically
struct fs_quota_stat, need to be expanded. Current version of fs_quota_stat
does not allow expansion without breaking backward compatibility.

So, a quotactl command and new fs_quota_stat structure need to be added.

This patch adds a new command Q_XGETQSTATV to quotactl() which takes
a new data structure fs_quota_statv. This new data structure provides
support for future expansion and backward compatibility.

Callers of the new quotactl command have to set the version of the data
structure being passed, and kernel will fill as much data as requested.
If the kernel does not support the user-space provided version, EINVAL
will be returned. User-space can reduce the version number and call the same
quotactl again.

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

[v2: Applied rjohnston's suggestions as per Chandra's request. -bpm]
2013-08-20 16:53:58 -05:00
Zhi Yong Wu
c2bfbc9b48 xfs: fix the comment of xfs_mountfs()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:53:07 -05:00
Zhi Yong Wu
2533787a43 xfs: fix the comment of xfs_sb_quiet_read_verify()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:51:49 -05:00
Zhi Yong Wu
8ba701ee9e xfs: fix the comment of xlog_recover_do_dquot_buffer()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:49:06 -05:00
Zhi Yong Wu
8e159e72e2 xfs: fix the comment of xfs_log_unmount_write()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:47:18 -05:00
Zhi Yong Wu
0b8182dba6 xfs: fix the comment of xfs_ifree_cluster()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:44:36 -05:00
Zhi Yong Wu
2f21ff1cca xfs: fix the comment of xfs_ialloc_ag_select()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:42:41 -05:00
Zhi Yong Wu
b3c496343b xfs: fix the comment of xfs_extent_busy_update_extent()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:41:08 -05:00
Zhi Yong Wu
8b4ad79cc6 xfs: fix the comment of xfs_setsize_buftarg_early()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:40:39 -05:00
Zhi Yong Wu
ad4809bf22 xfs: fix the comment of xfs_bmap_punch_delalloc_range()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:40:16 -05:00
Zhi Yong Wu
02bb4873db xfs: fix the comment of xfs_bmap_last_before()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:36:28 -05:00
Zhi Yong Wu
a97f4df7b5 xfs: fix the comment of xfs_bmap_validate_ret()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:35:55 -05:00
Zhi Yong Wu
8be11e92b6 xfs: fix the comment of xfs_bmap_count_tree()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:35:32 -05:00
Zhi Yong Wu
c7c1a7d8bb xfs: rename bio_add_buffer() to xfs_bio_add_buffer()
Follow up with xfs naming style.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:35:00 -05:00
Zhi Yong Wu
0a94da24b9 xfs: fix the comment of xlog_find_head()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:34:34 -05:00
Zhi Yong Wu
34be5ff378 xfs: fix the comment of xlog_recover_buffer_pass2()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:30:53 -05:00
Zhi Yong Wu
5c75390924 xfs: remove two unused macro definitions in xfs_linux.h
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:30:23 -05:00
Zhi Yong Wu
1cb9386354 xfs: fix the comment of xfs_btree_get_iroot()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:17:35 -05:00
Zhi Yong Wu
f6c2734924 xfs: fix the comment of xfs_iroot_realloc()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:17:11 -05:00
Zhi Yong Wu
7c3e664051 xfs: remove one blank line in xfs_btree_make_block_unfull()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 15:16:07 -05:00
Zhi Yong Wu
ac0e300fa5 xfs: fix the comment of xlog_write_setup_copy()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:59:31 -05:00
Zhi Yong Wu
99e738b783 xfs: fix the comment of xfs_mod_incore_sb_unlocked()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:59:05 -05:00
Zhi Yong Wu
49d3da149c xfs: fix the comment of xfs_btree_lookup()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:45:23 -05:00
Zhi Yong Wu
b46fe8259b xfs: fix the comment of xfs_buf_free()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:44:39 -05:00
Zhi Yong Wu
0471f62e38 xfs: fix the comment of xfs_check_sizes()
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-20 14:42:16 -05:00
Trond Myklebust
5948a401a7 NFS: Remove the NFSv4 "open optimisation" from nfs_permission
Ever since commit 6168f62cb (Add ACCESS operation to OPEN compound)
the NFSv4 atomic open has primed the access cache, and so nfs_permission
will no longer do an RPC call on the wire.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-20 12:29:27 -04:00
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Jaegeuk Kim
d59ff4df7b f2fs: fix wrong BUG_ON condition
This patch removes a false-alaramed BUG_ON.
The previous BUG_ON condition didn't cover the following true scenario.

In f2fs_add_link, 1) get_new_data_page gives an uptodate page successfully,
and then, 2) init_inode_metadata returns -ENOSPC.
At this moment, a new clean data page is remained in the page cache, but its
block address still indicates NEW_ADDR.
After then, even if sync is called, this clean data page cannot be written to
the disk due to the clean state.

So this means that get_lock_data_page should make a new empty page when its
block address is NEW_ADDR and its page is not uptodated.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-20 19:32:48 +09:00
Zhao Hongjiang
9890ff3f23 f2fs: fix memory leak when init f2fs filesystem fail
When any of the caches create fails in init_f2fs_fs(), the other caches which are
create successful should be free.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-20 18:58:44 +09:00
Steven Whitehouse
7286b31eab GFS2: Take glock reference in examine_bucket()
We need to check the glock ref counter in a race free way
in order to ensure that the gfs2_glock_hold() call will
succeed. The easiest way to do that is to simply take the
reference count early in the common code of examine_bucket,
skipping any glocks with zero ref count.

That means that the examiner functions all need to put their
reference on the glock once they've performed their function.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
2013-08-20 09:35:09 +01:00
Linus Torvalds
fd3930f70c proc: more readdir conversion bug-fixes
In the previous commit, Richard Genoud fixed proc_root_readdir(), which
had lost the check for whether all of the non-process /proc entries had
been returned or not.

But that in turn exposed _another_ bug, namely that the original readdir
conversion patch had yet another problem: it had lost the return value
of proc_readdir_de(), so now checking whether it had completed
successfully or not didn't actually work right anyway.

This reinstates the non-zero return for the "end of base entries" that
had also gotten lost in commit f0c3b5093a ("[readdir] convert
procfs").  So now you get all the base entries *and* you get all the
process entries, regardless of getdents buffer size.

(Side note: the Linux "getdents" manual page actually has a nice example
application for testing getdents, which can be easily modified to use
different buffers.  Who knew? Man-pages can be useful)

Reported-by: Emmanuel Benisty <benisty.e@gmail.com>
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Cc: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19 16:26:12 -07:00
Aruna Balakrishnaiah
3f8f80f0cf pstore/ram: Read and write to the 'compressed' flag of pstore
In pstore write, add character 'C'(compressed) or 'D'(decompressed)
in the header while writing to Ram persistent buffer. In pstore read,
read the header and update the 'compressed' flag accordingly.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 11:53:50 -07:00
Aruna Balakrishnaiah
9ad2cbe0a9 pstore: Add file extension to pstore file if compressed
In case decompression fails, add a ".enc.z" to indicate the file has
compressed data. This will help user space utilities to figure
out the file contents.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 11:53:27 -07:00
Aruna Balakrishnaiah
adb42f5e10 pstore: Add decompression support to pstore
Based on the flag 'compressed' set or not, pstore will decompress the
data returning a plain text file. If decompression fails for a particular
record it will have the compressed data in the file which can be
decompressed with 'openssl' command line tool.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 11:53:20 -07:00
Aruna Balakrishnaiah
9a4e139820 pstore: Introduce new argument 'compressed' in the read callback
Backends will set the flag 'compressed' after reading the log from
persistent store to indicate the data being returned to pstore is
compressed or not.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:18:11 -07:00
Aruna Balakrishnaiah
b0aad7a99c pstore: Add compression support to pstore
Add compression support to pstore which will help in capturing more data.
Initially, pstore will make a call to kmsg_dump with a bigger buffer
and will pass the size of bigger buffer to kmsg_dump and then compress
the data to registered buffer of registered size.

In case compression fails, pstore will capture the uncompressed
data by making a call again to kmsg_dump with registered_buffer
of registered size.

Pstore will indicate the data is compressed or not with a flag
in the write callback.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:18:11 -07:00
Aruna Balakrishnaiah
90ce4ca668 pstore/Kconfig: Select ZLIB_DEFLATE and ZLIB_INFLATE when PSTORE is selected
Pstore will make use of deflate and inflate algorithm to compress and decompress
the data. So when Pstore is enabled select zlib_deflate and zlib_inflate.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:18:10 -07:00
Aruna Balakrishnaiah
b3b515bbd6 pstore: Add new argument 'compressed' in pstore write callback
Addition of new argument 'compressed' in the write call back will
help the backend to know if the data passed from pstore is compressed
or not (In case where compression fails.). If compressed, the backend
can add a tag indicating the data is compressed while writing to
persistent store.

Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:18:10 -07:00
Dan Carpenter
c39524e674 pstore: d_alloc_name() doesn't return an ERR_PTR
d_alloc_name() returns NULL on error.  Also I changed the error code
from -ENOSPC to -ENOMEM to reflect that we were short on RAM not disk
space.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-08-19 10:13:59 -07:00
Richard Genoud
94fc5d9de5 proc: return on proc_readdir error
Commit f0c3b5093a ("[readdir] convert procfs") introduced a bug on the
listing of the proc file-system.  The return value of proc_readdir()
isn't tested anymore in the proc_root_readdir function.

This lead to an "interesting" behaviour when we are using the getdents()
system call with a buffer too small: instead of failing, it returns the
first entries of /proc (enough to fill the given buffer), plus the PID
directories.

This is not triggered on glibc (as getdents is called with a 32KB
buffer), but on uclibc, the buffer size is only 1KB, thus some proc
entries are missing.

See https://lkml.org/lkml/2013/8/12/288 for more background.

Signed-off-by: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19 09:47:27 -07:00
Linus Torvalds
d6a5e06cd1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes
Pull gfs2 fixes from Steven Whitehouse:
 "Out of these five patches, the one for ensuring that the number of
  revokes is not exceeded, and the one for checking the glock is not
  already held in gfs2_getxattr are the two most important.  The latter
  can be triggered by selinux.

  The other three patches are very small and fix mostly fairly trivial
  issues"

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: Check for glock already held in gfs2_getxattr
  GFS2: alloc_workqueue() doesn't return an ERR_PTR
  GFS2: don't overrun reserved revokes
  GFS2: WQ_NON_REENTRANT is meaningless and going away
  GFS2: Fix typo in gfs2_create_inode()
2013-08-19 09:30:12 -07:00
Steven Whitehouse
7c0ef28a2c GFS2: Move gfs2_sync_meta to lops.c
Since gfs2_sync_meta() is only called from a single file, lets move
it to lops.c where it is used, and mark it static. At the same
time, we can clean up the meta_io.h header too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 17:26:32 +01:00
Steven Whitehouse
7bd9ee58a4 GFS2: Check for glock already held in gfs2_getxattr
Since the introduction of atomic_open, gfs2_getxattr can be
called with the glock already held, so we need to allow for
this.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
2013-08-19 09:33:57 +01:00
Dan Carpenter
dfc4616dde GFS2: alloc_workqueue() doesn't return an ERR_PTR
alloc_workqueue() returns a NULL on error, it doesn't return an ERR_PTR.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 09:33:43 +01:00
Benjamin Marzinski
1bc333f4cf GFS2: don't overrun reserved revokes
When run during fsync, a gfs2_log_flush could happen between the
time when gfs2_ail_flush checked the number of blocks to revoke,
and when it actually started the transaction to do those revokes.
This occassionally caused it to need more revokes than it reserved,
causing gfs2 to crash.

Instead of just reserving enough revokes to handle the blocks that
currently need them, this patch makes gfs2_ail_flush reserve the
maximum number of revokes it can, without increasing the total number
of reserved log blocks. This patch also passes the number of reserved
revokes to __gfs2_ail_flush() so that it doesn't go over its limit
and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush
will still cause a BUG() necessary revokes are skipped.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 09:33:16 +01:00
Tejun Heo
d08fa65a81 GFS2: WQ_NON_REENTRANT is meaningless and going away
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
2013-08-19 09:33:01 +01:00
Steven Whitehouse
2523d47a79 GFS2: Fix typo in gfs2_create_inode()
PTR_RET should be PTR_ERR

Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 09:32:29 +01:00
Gu Zheng
7b40527508 f2fs: fix a compound statement label error
An error "label at end of compound statement" will occur if CONFIG_F2FS_STAT_FS
disabled.
fs/f2fs/segment.c:556:1: error: label at end of compound statement
So clean up the 'out' label to fix it.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-19 11:51:08 +09:00
Jin Xu
92c4342fb7 f2fs: avoid writing inode redundantly when creating a file
In f2fs_write_inode, updating inode after f2fs_balance_fs is not
a optimized way in the case that f2fs_gc is performed ahead. The
inode page will be unnecessarily written out twice, one of which
is in f2fs_gc->...->sync_node_pages and the other is in
update_inode_page.

Let's update the inode page in prior to f2fs_balance_fs to avoid
this.

To reproduce it,
$ touch file (before this step, should make the device need f2fs_gc)
$ sync (or wait the bdi to write dirty inode)

Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-19 09:43:25 +09:00
Dan Carpenter
e27dae4d66 f2fs: alloc_page() doesn't return an ERR_PTR
alloc_page() returns a NULL on failure, it never returns an ERR_PTR.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-19 09:42:29 +09:00
Linus Torvalds
a08797e853 Two jbd2 bug fixes, one of which is a regression fix.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJSD2uJAAoJENNvdpvBGATwPwEP/R1Yu0UReTDVSmRwn8MGbk5O
 M7QwPgUVnEdqr86v4qx7kqOmKtwSQKfSr7+07UGwEBVasPPI1ezXT5c2VAwz146z
 5EClKIJoZoKIdWwM0NvUYxb7p5HgTLubho5RTsHmjOGIy37ju+5rFKZ44aHt9siY
 J3w5v1EruHdDXwjyiZcIeHFUBGV34x9zPtIGy3yY06gEPAUMeEXShiH/Uj3EroKH
 50PkwECBvqFuO4HpunhSYfNc3csTir2ZAx4judaV0s5ebyxzyZ398ql581q73mvP
 m5KREYYVLQnN9VzCnsjLsnyZboh6XoTxmlmblNVOr0LihyKLS/Nof6pvrWEqgnBf
 6NpR6Y8cPVUq+rhhXwFx/RIkVh2EIME0T5dZplr7fVaPxzI1ZKud8p18WifxMAix
 y3rbdZiEA//OuH5u5XqV0/zv3aJUP7XjdT8dia5iq3FazFOdyT1VfVXzAfaXyqZk
 JlFguSrOJbQe+vgk2gFLWLH8L4wsnGlSUv1AeDEboriushmZqYlAn10y6ENnM5yW
 IMFNA6Wyz35MghsKMKV57H7B05VS8em6NGtHZVhTBt7JoiFn7PBvic/KcK/czpDy
 W2Y8K8azhuscmQfa/RjuVDtwakbNwTu341e7hIUOy41qL/x9nNr07ZRAeF68Nesd
 p4uRmJxLSW9PReYp9n4V
 =OroQ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull jbd2 bug fixes from Ted Ts'o:
 "Two jbd2 bug fixes, one of which is a regression fix"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: Fix oops in jbd2_journal_file_inode()
  jbd2: Fix use after free after error in jbd2_journal_dirty_metadata()
2013-08-17 10:43:19 -07:00
Jan Kara
90e775b71a ext4: fix lost truncate due to race with writeback
The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:

ext4_setattr()				mpage_map_and_submit_extent()
  EXT4_I(inode)->i_disksize = attr->ia_size;
  ...					  ...
					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
					  /* False because i_size isn't
					   * updated yet */
					  if (disksize > i_size_read(inode))
					  /* True, because i_disksize is
					   * already truncated */
					  if (disksize > EXT4_I(inode)->i_disksize)
					    /* Overwrite i_disksize
					     * update from truncate */
					    ext4_update_i_disksize()
  i_size_write(inode, attr->ia_size);

For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.

We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 10:09:31 -04:00
Jan Kara
5208386c50 ext4: simplify truncation code in ext4_setattr()
Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 10:07:17 -04:00
Jan Kara
5f1132b2ba ext4: fix ext4_writepages() in presence of truncate
Inode size can arbitrarily change while writeback is in progress. When
ext4_writepages() has prepared a long extent for mapping and truncate
then reduces i_size, mpage_map_and_submit_buffers() will always map just
one buffer in a page instead of all of them due to lblk < blocks check.
So we end up not using all blocks we've allocated (thus leaking them)
and also delalloc accounting goes wrong manifesting as a warning like:

ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
with only 0 reserved data blocks

Note that the problem can happen only when blocksize < pagesize because
otherwise we have only a single buffer in the page.

Fix the problem by removing the size check from the mapping loop. We
have an extent allocated so we have to use it all before checking for
i_size. We also rename add_page_bufs_to_extent() to
mpage_process_page_bufs() and make that function submit the page for IO
if all buffers (upto EOF) in it are mapped.

Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 10:02:33 -04:00
Jan Kara
09930042a2 ext4: move test whether extent to map can be extended to one place
Currently the logic whether the current buffer can be added to an extent
of buffers to map is split between mpage_add_bh_to_extent() and
add_page_bufs_to_extent(). Move the whole logic to
mpage_add_bh_to_extent() which makes things a bit more straightforward
and make following i_size fixes easier.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-17 09:57:56 -04:00
Jan Kara
7d7345322d ext4: fix warning in ext4_da_update_reserve_space()
reaim workfile.dbase test easily triggers warning in
ext4_da_update_reserve_space():

EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
blocks with reserved 9 data blocks)

The problem is that (one of) tests creates file and then randomly writes
to it with O_SYNC. That results in writing back pages of the file in
random order so we create extents for written blocks say 0, 2, 4, 6, 8
- this last allocation also allocates new block for extents. Then we
writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
indirect extent block because extents fit in the inode again. Then we
writeout block 10 and we need to allocate indirect extent block again
which triggers the warning because we don't have the reservation
anymore.

Fix the problem by giving back freed metadata blocks resulting from
extent merging into inode's reservation pool.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-08-17 09:36:54 -04:00
Jan Kara
1c8924eb10 quota: provide interface for readding allocated space into reserved space
ext4 needs to convert allocated (metadata) blocks back into blocks
reserved for delayed allocation. Add functions into quota code for
supporting such operation.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-17 09:32:32 -04:00
Theodore Ts'o
19883bd965 ext4: avoid reusing recently deleted inodes in no journal mode
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away.  Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk.  However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.

E2fsck will make sure the file system is consistent after the
unclean shutdown.  However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues.  We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.

Google-Bug-Id: 10017573

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:55 -04:00
Theodore Ts'o
0e20270454 ext4: allocate delayed allocation blocks before rename
When ext4_rename() overwrites an already existing file, call
ext4_alloc_da_blocks() before starting the journal handle which
actually does the rename, instead of doing this afterwards.  This
improves the likelihood that the contents will survive a crash if an
application replaces a file using the sequence:

1)  write replacement contents to foo.new
2)  <omit fsync of foo.new>
3)  rename foo.new to foo

It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
doing a file integrity sync; this means if foo.new is a very large
file, it may not be completely flushed out to disk.

However, for files smaller than a megabyte or so, any dirty pages
should be flushed out before we do the rename operation, and so at the
next journal commit, the CACHE FLUSH command will make sure al of
these pages are safely on the disk platter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:53 -04:00
Theodore Ts'o
5b61de7575 ext4: start handle at least possible moment when renaming files
In ext4_rename(), don't start the journal handle until the the
directory entries have been successfully looked up.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:06:14 -04:00
Theodore Ts'o
7869a4a6c5 ext4: add support for extent pre-caching
Add a new fiemap flag which forces the all of the extents in an inode
to be cached in the extent_status tree.  This is critically important
when using AIO to a preallocated file, since if we need to read in
blocks from the extent tree, the io_submit(2) system call becomes
synchronous, and the AIO is no longer "A", which is bad.

In addition, for most files which have an external leaf tree block,
the cost of caching the information in the extent status tree will be
less than caching the entire 4k block in the buffer cache.  So it is
generally a win to keep the extent information cached.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 22:05:14 -04:00
Theodore Ts'o
107a7bd31a ext4: cache all of an extent tree's leaf block upon reading
When we read in an extent tree leaf block from disk, arrange to have
all of its entries cached.  In nearly all cases the in-memory
representation will be more compact than the on-disk representation in
the buffer cache, and it allows us to get the information without
having to traverse the extent tree for successive extents.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:23:41 -04:00
Theodore Ts'o
3be78c7317 ext4: use unsigned int for es_status values
Don't use an unsigned long long for the es_status flags; this requires
that we pass 64-bit values around which is painful on 32-bit systems.
Instead pass the extent status flags around using the low 4 bits of an
unsigned int, and shift them into place when we are reading or writing
es_pblk.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:22:41 -04:00
Theodore Ts'o
c349179b48 ext4: print the block number of invalid extent tree blocks
When we find an invalid extent tree block, report the block number of
the bad block for debugging purposes.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:21:41 -04:00
Theodore Ts'o
7d7ea89e75 ext4: refactor code to read the extent tree block
Refactor out the code needed to read the extent tree block into a
single read_extent_tree_block() function.  In addition to simplifying
the code, it also makes sure that we call the ext4_ext_load_extent
tracepoint whenever we need to read an extent tree block from disk.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
2013-08-16 21:20:41 -04:00
Jan Kara
a361293f5f jbd2: Fix oops in jbd2_journal_file_inode()
Commit 0713ed0cde added
jbd2_journal_file_inode() call into ext4_block_zero_page_range().
However that function gets called from truncate path and thus inode
needn't have jinode attached - that happens in ext4_file_open() but
the file needn't be ever open since mount. Calling
jbd2_journal_file_inode() without jinode attached results in the oops.

We fix the problem by attaching jinode to inode also in ext4_truncate()
and ext4_punch_hole() when we are going to zero out partial blocks.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-08-16 21:19:41 -04:00
Linus Torvalds
2b047252d0 Fix TLB gather virtual address range invalidation corner cases
Ben Tebulin reported:

 "Since v3.7.2 on two independent machines a very specific Git
  repository fails in 9/10 cases on git-fsck due to an SHA1/memory
  failures.  This only occurs on a very specific repository and can be
  reproduced stably on two independent laptops.  Git mailing list ran
  out of ideas and for me this looks like some very exotic kernel issue"

and bisected the failure to the backport of commit 53a59fc67f ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").

That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.

The real bug is that the TLB gather virtual memory range setup is subtly
buggered.  It was introduced in commit 597e1c3580 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96c ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.

The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB.  And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.

Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.

This makes the patch larger, but conceptually much simpler.  And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.

Ben verified that this fixes his problem.

Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com>
Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au>
Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-16 08:52:46 -07:00
Mats Kärrman
c23e9b75cc UBIFS: remove invalid warn msg with tst_recovery enabled
Signed-off-by: Mats Karrman <mats.karrman@tritech.se>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2013-08-16 16:42:08 +03:00
Dave Kleikamp
44512449c0 jfs: fix readdir cookie incompatibility with NFSv4
NFSv4 reserves readdir cookie values 0-2 for special entries (. and ..),
but jfs allows a value of 2 for a non-special entry. This incompatibility
can result in the nfs client reporting a readdir loop.

This patch doesn't change the value stored internally, but adds one to
the value exposed to the iterate method.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
2013-08-15 17:22:29 -05:00
Dave Chinner
2ad01f53dc xfs: use reference counts to free clean buffer items
When a transaction is cancelled and the buffer log item is clean in
the transaction, the buffer log item is unconditionally freed. If
the log item is in the AIL, however, this leads to a use after free
condition as the item still has other users.

In this case, xfs_buf_item_relse() should only be called on clean
buffer items if the reference count has dropped to zero. This
ensures only the last user frees the item.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 16:42:29 -05:00
Dwight Engen
8c567a7fab xfs: add capability check to free eofblocks ioctl
Check for CAP_SYS_ADMIN since the caller can truncate preallocated
blocks from files they do not own nor have write access to. A more
fine grained access check was considered: require the caller to
specify their own uid/gid and to use inode_permission to check for
write, but this would not catch the case of an inode not reachable
via path traversal from the callers mount namespace.

Add check for read-only filesystem to free eofblocks ioctl.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 14:25:01 -05:00
Dwight Engen
b9fe505258 xfs: create internal eofblocks structure with kuid_t types
Have eofblocks ioctl convert uid_t to kuid_t into internal structure.
Update internal filter matching to compare ids with kuid_t types.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 14:24:10 -05:00
Dwight Engen
7aab1b2887 xfs: convert kuid_t to/from uid_t for internal structures
Use uint32 from init_user_ns for xfs internal uid/gid
representation in xfs_icdinode, xfs_dqid_t.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 14:22:40 -05:00
Dwight Engen
fd5e2aa865 xfs: ioctl check for capabilities in the current user namespace
Use inode_capable() to check if SUID|SGID bits should be cleared to match
similar check in inode_change_ok().

The check for CAP_LINUX_IMMUTABLE was not modified since all other file
systems also check against init_user_ns rather than current_user_ns.

Only allow changing of projid from init_user_ns.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 14:19:25 -05:00
Dwight Engen
288bbe0eeb xfs: convert kuid_t to/from uid_t in ACLs
Change permission check for setting ACL to use inode_owner_or_capable()
which will additionally allow a CAP_FOWNER user in a user namespace to
be able to set an ACL on an inode covered by the user namespace mapping.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 14:18:31 -05:00
Dwight Engen
c5eeb7ec3e xfs: create wrappers for converting kuid_t to/from uid_t
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-15 14:17:34 -05:00
Li Wang
ad7a60de88 ceph: punch hole support
This patch implements fallocate and punch hole support for Ceph kernel client.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
2013-08-15 11:12:17 -07:00
Yan, Zheng
3871cbb9a4 ceph: fix request max size
ceph_check_caps() requests new max size only when there is Fw cap.
If we call check_max_size() while there is no Fw cap. It updates
i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps()
does nothing. Later when Fw cap is issued, we call check_max_size()
again. But i_wanted_max_size is equal to 'endoff' at this time, so
check_max_size() doesn't call ceph_check_caps() and we end up with
waiting for the new max size forever.

The fix is duplicate ceph_check_caps()'s "request max size" code in
check_max_size(), and make try_get_cap_refs() wait for the Fw cap
before retry requesting new max size.

This patch also removes the "endoff > (inode->i_size << 1)" check
in check_max_size(). It's useless because there is no corresponding
logic in ceph_check_caps().

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-15 11:12:11 -07:00
Yan, Zheng
b0d7c22310 ceph: introduce i_truncate_mutex
I encountered below deadlock when running fsstress

wmtruncate work      truncate                 MDS
---------------  ------------------  --------------------------
                   lock i_mutex
                                      <- truncate file
lock i_mutex (blocked)
                                      <- revoking Fcb (filelock to MIX)
                   send request ->
                                         handle request (xlock filelock)

At the initial time, there are some dirty pages in the page cache.
When the kclient receives the truncate message, it reduces inode size
and creates some 'out of i_size' dirty pages. wmtruncate work can't
truncate these dirty pages because it's blocked by the i_mutex. Later
when the kclient receives the cap message that revokes Fcb caps, It
can't flush all dirty pages because writepages() only flushes dirty
pages within the inode size.

When the MDS handles the 'truncate' request from kclient, it waits
for the filelock to become stable. But the filelock is stuck in
unstable state because it can't finish revoking kclient's Fcb caps.

The truncate pagecache locking has already caused lots of trouble
for use. I think it's time simplify it by introducing a new mutex.
We use the new mutex to prevent concurrent truncate_inode_pages().
There is no need to worry about race between buffered write and
truncate_inode_pages(), because our "get caps" mechanism prevents
them from concurrent execution.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-15 11:12:06 -07:00
Milosz Tanski
b150f5c1c7 ceph: cleanup the logic in ceph_invalidatepage
The invalidatepage code bails if it encounters a non-zero page offset. The
current logic that does is non-obvious with multiple if statements.

This should be logically and functionally equivalent.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-15 11:12:02 -07:00
Sage Weil
ee3e542fec Merge remote-tracking branch 'linus/master' into testing 2013-08-15 11:11:45 -07:00
yonghua zheng
8c8296223f fs/proc/task_mmu.c: fix buffer overflow in add_page_map()
Recently we met quite a lot of random kernel panic issues after enabling
CONFIG_PROC_PAGE_MONITOR.  After debuggind we found this has something
to do with following bug in pagemap:

In struct pagemapread:

  struct pagemapread {
      int pos, len;
      pagemap_entry_t *buffer;
      bool v2;
  };

pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
buffer, it is a mistake to compare pos and len in add_page_map() for
checking buffer is full or not, and this can lead to buffer overflow and
random kernel panic issue.

Correct len to be total number of PM_ENTRY_BYTES in buffer.

[akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
Signed-off-by: Yonghua Zheng <younghua.zheng@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:50 -07:00
Jeff Liu
d6394b5900 ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id()
Fix a NULL pointer deference while removing an empty directory, which
was introduced by commit 3704412bdb ("[readdir] convert ocfs2").

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<(null)>]           (null)
  PGD 6da85067 PUD 6da89067 PMD 0
  Oops: 0010 [#1] SMP
  CPU: 0 PID: 6564 Comm: rmdir Tainted: G           O 3.11.0-rc1 #4
  RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
  Call Trace:
    ocfs2_dir_foreach+0x49/0x50 [ocfs2]
    ocfs2_empty_dir+0x12c/0x3e0 [ocfs2]
    ocfs2_unlink+0x56e/0xc10 [ocfs2]
    vfs_rmdir+0xd5/0x140
    do_rmdir+0x1cb/0x1e0
    SyS_rmdir+0x16/0x20
    system_call_fastpath+0x16/0x1b
  Code:  Bad RIP value.
  RIP  [<          (null)>]           (null)
  RSP <ffff88006daddc10>
  CR2: 0000000000000000

[dan.carpenter@oracle.com: fix pointer math]
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: David Weber <wb@munzinger.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:49 -07:00
Tiger Yang
c7dd3392ad ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page
Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as
the struct file pointer, it finally result in a null pointer dereference
in ocfs2_duplicate_clusters_by_page.

This patch replace file pointer with inode pointer in
cow_duplicate_clusters to fix this issue.

[jeff.liu@oracle.com: rebased patch against linux-next tree]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Tao Ma <tm@tao.ma>
Tested-by: David Weber <wb@munzinger.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:49 -07:00
Jie Liu
6115ea2884 ocfs2: Revert 40bd62e to avoid regression in extended allocation
Revert commit 40bd62eb7f ("fs/ocfs2/journal.h: add bits_wanted while
calculating credits in ocfs2_calc_extend_credits").

Unfortunately this change broke fallocate even if there is insufficient
disk space for the preallocation, which is a serious problem.

  # df -h
  /dev/sda8        22G  1.2G   21G   6% /ocfs2
  # fallocate -o 0 -l 200M /ocfs2/testfile
  fallocate: /ocfs2/test: fallocate failed: No space left on device

and a kernel warning:

  CPU: 3 PID: 3656 Comm: fallocate Tainted: G        W  O 3.11.0-rc3 #2
  Call Trace:
    dump_stack+0x77/0x9e
    warn_slowpath_common+0xc4/0x110
    warn_slowpath_null+0x2a/0x40
    start_this_handle+0x6c/0x640 [jbd2]
    jbd2__journal_start+0x138/0x300 [jbd2]
    jbd2_journal_start+0x23/0x30 [jbd2]
    ocfs2_start_trans+0x166/0x300 [ocfs2]
    __ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2]
    ocfs2_allocate_unwritten_extents+0x3c9/0x520
    __ocfs2_change_file_space+0x5e0/0xa60 [ocfs2]
    ocfs2_fallocate+0xb1/0xe0 [ocfs2]
    do_fallocate+0x1cb/0x220
    SyS_fallocate+0x6f/0xb0
    system_call_fastpath+0x16/0x1b
  JBD2: fallocate wants too many credits (51216 > 4381)

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:49 -07:00
Michal Hocko
b610ded719 hugetlb: fix lockdep splat caused by pmd sharing
Dave has reported the following lockdep splat:

  =================================
  [ INFO: inconsistent lock state ]
  3.11.0-rc1+ #9 Not tainted
  ---------------------------------
  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
  kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
  {RECLAIM_FS-ON-W} state was registered at:
     mark_held_locks+0x81/0xe7
     lockdep_trace_alloc+0x5e/0xbc
     __alloc_pages_nodemask+0x8b/0x9b6
     __get_free_pages+0x20/0x31
     get_zeroed_page+0x12/0x14
     __pmd_alloc+0x1c/0x6b
     huge_pmd_share+0x265/0x283
     huge_pte_alloc+0x5d/0x71
     hugetlb_fault+0x7c/0x64a
     handle_mm_fault+0x255/0x299
     __do_page_fault+0x142/0x55c
     do_page_fault+0xd/0x16
     error_code+0x6c/0x74
  irq event stamp: 3136917
  hardirqs last  enabled at (3136917):  _raw_spin_unlock_irq+0x27/0x50
  hardirqs last disabled at (3136916):  _raw_spin_lock_irq+0x15/0x78
  softirqs last  enabled at (3136180):  __do_softirq+0x137/0x30f
  softirqs last disabled at (3136175):  irq_exit+0xa8/0xaa
  other info that might help us debug this:
   Possible unsafe locking scenario:
         CPU0
         ----
    lock(&mapping->i_mmap_mutex);
    <Interrupt>
      lock(&mapping->i_mmap_mutex);

  *** DEADLOCK ***
  no locks held by kswapd0/49.

  stack backtrace:
  CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
  Hardware name: Dell Inc.                 Precision WorkStation 490    /0DT031, BIOS A08 04/25/2008
  Call Trace:
    dump_stack+0x4b/0x79
    print_usage_bug+0x1d9/0x1e3
    mark_lock+0x1e0/0x261
    __lock_acquire+0x623/0x17f2
    lock_acquire+0x7d/0x195
    mutex_lock_nested+0x6c/0x3a7
    page_referenced+0x87/0x5e3
    shrink_page_list+0x3d9/0x947
    shrink_inactive_list+0x155/0x4cb
    shrink_lruvec+0x300/0x5ce
    shrink_zone+0x53/0x14e
    kswapd+0x517/0xa75
    kthread+0xa8/0xaa
    ret_from_kernel_thread+0x1b/0x28

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mapping->i_mmap_mutex.  If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special by
assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

[peterz@infradead.org: comment for annotation]
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:48 -07:00
Cyrill Gorcunov
41bb3476b3 mm: save soft-dirty bits on file pages
Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry.  Thus when #pf happens on such non-present pte
we can restore it back.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:48 -07:00
Cyrill Gorcunov
179ef71cbc mm: save soft-dirty bits on swapped pages
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.

To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out.  When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.

One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap.  The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13 17:57:47 -07:00
Dave Chinner
4bb928cdb9 xfs: split the CIL lock
The xc_cil_lock is used for two purposes - to protect the CIL
itself, and to protect the push/commit state and lists. These are
two logically separate structures and operations, so can have their
own locks. This means that pushing on the CIL and the commit wait
ordering won't contend for a lock with other transactions that are
completing concurrently. As the CIL insertion is the hottest path
throught eh CIL, this is a big win.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 16:21:21 -05:00
Dave Chinner
991aaf65ff xfs: Combine CIL insert and prepare passes
Now that all the log item preparation and formatting is done under
the CIL lock, we can get rid of the intermediate log vector chain
used to track items to be inserted into the CIL.

We can already find all the items to be committed from the
transaction handle, so as long as we attach the log vectors to the
item before we insert the items into the CIL, we don't need to
create a log vector chain to pass around.

This means we can move all the item insertion code into and optimise
it into a pair of simple passes across all the items in the
transaction. The first pass does the formatting and accounting, the
second inserts them all into the CIL.

We keep this two pass split so that we can separate the CIL
insertion - which must be done under the CIL spinlock - from the
formatting. We could insert each item into the CIL with a single
pass, but that massively increases the number of times we have to
grab the CIL spinlock. It is much more efficient (and hence
scalable) to do a batch operation and insert all objects in a single
lock grab.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 16:20:09 -05:00
Dave Chinner
f5baac354d xfs: avoid CIL allocation during insert
Now that we have the size of the log vector that has been allocated,
we can determine if we need to allocate a new log vector for
formatting and insertion. We only need to allocate a new vector if
it won't fit into the existing buffer.

However, we need to hold the CIL context lock while we do this so
that we can't race with a push draining the currently queued log
vectors. It is safe to do this as long as we do GFP_NOFS allocation
to avoid avoid memory allocation recursing into the filesystem.
Hence we can safely overwrite the existing log vector on the CIL if
it is large enough to hold all the dirty regions of the current
item.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 16:19:03 -05:00
Dave Chinner
7492c5b42d xfs: Reduce allocations during CIL insertion
Now that we have the size of the object before the formatting pass
is called, we can allocation the log vector and it's buffer in a
single allocation rather than two separate allocations.

Store the size of the allocated buffer in the log vector so that
we potentially avoid allocation for future modifications of the
object.

While touching this code, remove the IOP_FORMAT definition.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 16:12:30 -05:00
Dave Chinner
166d13688a xfs: return log item size in IOP_SIZE
To begin optimising the CIL commit process, we need to have IOP_SIZE
return both the number of vectors and the size of the data pointed
to by the vectors. This enables us to calculate the size ofthe
memory allocation needed before the formatting step and reduces the
number of memory allocations per item by one.

While there, kill the IOP_SIZE macro.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 16:10:21 -05:00
Eric Sandeen
050a1952c3 xfs:free bp in xlog_find_tail() error path
xlog_find_tail() currently leaks a bp on one error path.

There is no error target, so manually free the bp before
returning the error.

Found by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 15:49:51 -05:00
Eric Sandeen
5d0a654974 xfs: free bp in xlog_find_zeroed() error path
xlog_find_zeroed() currently leaks a bp on one error path.

Using the bp_err: target resolves this.

Found by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 15:48:51 -05:00
Eric Sandeen
6dd93e9e5e xfs: avoid double-free in xfs_attr_node_addname
xfs_attr_node_addname()'s error handling tests whether it
should free "state" in the out: error handling label:

out:
        if (state)
                xfs_da_state_free(state);

but an earlier free doesn't set state to NULL afterwards; this
could lead to a double free.  Fix it by setting state to NULL
after it's freed.

This was found by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 15:48:01 -05:00
Jie Liu
2c2bcc0735 xfs: call roundup_64() to calculate the min_logblks
Replace roundup() with roundup_64() as we calculate min_logblks
with 64-bit divisions.  Hence, call roundup() will cause the
following error while compiling a 32-bit kernel:

fs/built-in.o: In function `xfs_log_calc_minimum_size':
fs/xfs/xfs_log_rlimit.c:140: undefined reference to `__udivdi3'

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-13 14:19:11 -05:00
Jie Liu
3e7b91cf8c xfs: Validate log space at mount time
Validate log space during log mount stage, the underlying function
will drop a warning message via syslog in critical level if the log
space is too small or too large.

[ dchinner: For CRC enable filesystems, abort the mounting of the
filesystem as mkfs should never make a log too small for the given
filesystem configuration. ]

[ dchinner: make a note of the fact that the log size limits in
block counts are in units of filesystem blocks, not basic blocks. ]

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:50:35 -05:00
Jie Liu
5a96a94547 xfs: Add xfs_log_rlimit.c
Add source files for xfs_log_rlimit.c The new file is used for log
size calculations and validation shared with userspace.

[dchinner: xfs_log_calc_max_attrsetm_res() does not modify the
tr_attrsetm reservation, just calculates the maximum. ]

[dchinner: rework loop in xfs_log_get_max_trans_res() ]

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:49:38 -05:00
Jie Liu
e773fc934f xfs: Refactor xfs_ticket_alloc() to extract a new helper
Refactor xlog_ticket_alloc() to extract a new helper, i.e.
xfs_log_calc_unit_res().

This helper would be used to calculate the total log reservation
size by adding extra log operation/transation headers for a new
log ticket.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:49:02 -05:00
Jie Liu
f749278c5a xfs: Get rid of all XFS_XXX_LOG_RES() macro
Get rid of all XFS_XXX_LOG_RES() macros since they are obsoleted now.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:48:08 -05:00
Jie Liu
3d3c8b5222 xfs: refactor xfs_trans_reserve() interface
With the new xfs_trans_res structure has been introduced, the log
reservation size, log count as well as log flags are pre-initialized
at mount time.  So it's time to refine xfs_trans_reserve() interface
to be more neat.

Also, introduce a new helper M_RES() to return a pointer to the
mp->m_resv structure to simplify the input.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:47:34 -05:00
Jie Liu
783cb6d172 xfs: Make writeid transaction use tr_writeid
tr_writeid is defined at mp->m_resv structure, however, it does not
really being used when it should be..

This patch changes it to tr_writeid to fetch the correct log
reservation size.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:46:59 -05:00
Jie Liu
20996c9326 xfs: Introduce tr_fsyncts to m_reservation
A preparation step.

For now fsync_ts transaction use the pre-calculated log reservation
size of tr_swrite.  This patch introduce a new item tr_fsyncts to
mp->m_reservations structure so that we can fetch the log
reservation value for it in a same manner to others.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:46:29 -05:00
Jie Liu
0eadd10288 xfs: Introduce a new structure to hold transaction reservation items
Introduce a new structure xfs_trans_res to hold transaction
reservation item info per log ticket.

We also need to improve xfs_trans_resv_calc() by initializing the
log count as well as log flags for permanent log reservation.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:45:49 -05:00
Dave Chinner
9356fe22af xfs: make struct xfs_perag kernel only
The struct xfs_perag has many kernel-only definitions in it,
requiring a __KERNEL__ guard so userspace can use it to. Move it to
xfs_mount.h so that it it kernel-only, and let userspace redefine
it's own version of the structure containing only what it needs.
This gets rid of another __KERNEL__ check in the XFS header files.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:44:36 -05:00
Dave Chinner
4f3d71f68b xfs: move kernel specific type definitions to xfs.h
xfs_types.h is shared with userspace, so having kernel specific
types defined in it is problematic. Move all the kernel specific
defines to xfs_linux.h so we can remove the __KERNEL__ guards from
xfs_types.h.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:04:08 -05:00
Linus Torvalds
584d88b2cd Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "A set of small cifs fixes, including 3 relating to symlink handling"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately
  cifs: set sb->s_d_op before calling d_make_root()
  cifs: fix bad error handling in crypto code
  cifs: file: initialize oparms.reconnect before using it
  Do not attempt to do cifs operations reading symlinks with SMB2
  cifs: extend the buffer length enought for sprintf() using
2013-08-12 15:02:53 -07:00
Linus Torvalds
fd4f35d0fa A number of miscellaneous ext4 bugs fixes for v3.11, including a fix
so that if ext4 is built as a module, to allow it to be unloaded.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJSCOS0AAoJENNvdpvBGATwOYkQALgOTX2ma7hgCr91ldg3Q69e
 bejRm2Prdjs3HXa3MwC5yZ0i2nDXxTjMXidnVcD2QgLwS6rPGQQfrGhQ1OtJqLxA
 Gzf6mHIWIaRG9/JtTkzp3ucONbhc13LnaJd9B2bHASlq4+5aJhlziblwapMSvVbI
 YixhejVREmXdZ/typvnQOsoPfXzH9NLOTY/asT9IcZK6flhczNIlMC0/zRZQw0+I
 pDK/akmdOLpkd57UvJcwdcGIgcioSrg+JXk++EsDe82v2bpxleG2IdLbVZ76whIS
 A3FHEJvUfiFPj+GhKMQJyI1sa1ZGmD7nH8n2Yf6TVqZ4jgFx4ox9d83c0PpSLn0+
 VTZD601lfAoThUwKRTk45FUI4siZ1zBP13cbTH1DJXvGVllehhC9zBCezlNma4vr
 cI+K3d8VBHV7bKSRSgRVcB2pY+bJE1qP5XJXGnfDhjEiecNgXFRm0i2UtYSEl6R7
 aQGAdfjxp6tf+gme6c13zFK43FEJ9/DMaC8l0eamiQyH5U4OhOD8Q4ExnHjrfN69
 V3yZc4AZq46J9+2H/Jc457NYiFH2YK74jpAcpsVV0fXKFDR8Ss+rKXmfB+TGS45f
 Q1eHbKLuA3fHzHfwJCozqnUBxQO41M4vEd/ktqenwsKzeOw0qH3qynxEUd6xgka5
 1pcnJZnSGbm1d8pwYLMc
 =NuDG
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull more ext4 bugfixes from Ted Ts'o:
 "A number of miscellaneous ext4 bugs fixes for v3.11, including a fix
  so that if ext4 is built as a module, to allow it to be unloaded"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT
  ext4: fix mount/remount error messages for incompatible mount options
  ext4: allow the mount options nodelalloc and data=journal
2013-08-12 15:00:40 -07:00
Dave Chinner
9b90b0d9da xfs: xfs_filestreams.h doesn't need __KERNEL__
Because it is only used within the kernel.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 17:00:11 -05:00
Dave Chinner
cb9eabff58 xfs: remove __KERNEL__ check from xfs_dir2_leaf.c
It's actually an ifndef section, which means it is only included in
userspace. however, it's deep within the libxfs code, so it's
unlikely that the condition checked in userspace can actually occur
(search an empty leaf) through the libxfs interfaces. i.e. if it can
happen in usrspace, it can happen in the kernel, so remove it from
userspace too....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:59:14 -05:00
Dave Chinner
b49a0c1883 xfs: remove __KERNEL__ from debug code
There is no reason the remaining kernel-only debug code needs to
remain kernel-only. Kill the __KERNEL__ part of the defines, and let
userspace handle the debug code appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:58:37 -05:00
Dave Chinner
63d20d6e36 xfs: kill __KERNEL__ check for debug code in allocation code
Userspace running debug builds is relatively rare, so there's need
to special case the allocation algorithm code coverage debug switch.
As it is, userspace defines random numbers to 0, so  invert the
logic of the switch so it is effectively a no-op in userspace.
This kills another couple of __KERNEL__ users.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:57:51 -05:00
Dave Chinner
94b406091b xfs: don't special case shared superblock mounts
Neither kernel or userspace support shared read-only mounts, so
don't bother special casing the support check to be different
between kernel and userspace. The same check can be used as neither
like it...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:57:16 -05:00
Dave Chinner
a133d952b4 xfs: consolidate extent swap code
So we don't need xfs_dfrag.h in userspace anymore, move the extent
swap ioctl structure definition to xfs_fs.h where most of the other
ioctl structure definitions are.

Now that we don't need separate files for extent swapping, separate
the basic file descriptor checking code to xfs_ioctl.c, and the code
that does the extent swap operation to xfs_bmap_util.c.  This
cleanly separates the user interface code from the physical
mechanism used to do the extent swap.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:56:06 -05:00
Dave Chinner
e546cb79ef xfs: consolidate xfs_utils.c
There are a few small helper functions in xfs_util, all related to
xfs_inode modifications. Move them all to xfs_inode.c so all
xfs_inode operations are consiolidated in the one place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:55:17 -05:00
Dave Chinner
f6bba2017a xfs: consolidate xfs_rename.c
Move the rename code to xfs_inode.c to continue consolidating
all the kernel xfs_inode operations in the one place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:54:09 -05:00
Dave Chinner
c24b5dfadc xfs: kill xfs_vnodeops.[ch]
Now we have xfs_inode.c for holding kernel-only XFS inode
operations, move all the inode operations from xfs_vnodeops.c to
this new file as it holds another set of kernel-only inode
operations. The name of this file traces back to the days of Irix
and it's vnodes which we don't have anymore.

Essentially this move consolidates the inode locking functions
and a bunch of XFS inode operations into the one file. Eventually
the high level functions will be merged into the VFS interface
functions in xfs_iops.c.

This leaves only internal preallocation, EOF block manipulation and
hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c
where we are already consolidating various in-kernel physical extent
manipulation and querying functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:53:39 -05:00
Dave Chinner
836a94ad59 xfs: fix issues that cause userspace warnings
Some of the code shared with userspace causes compilation warnings
from things turned off in the kernel code, such as differences in
variable signedness. Fix those issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:52:54 -05:00
Dave Chinner
c5c249b424 xfs: minor cleanups
These come from syncing the shared userspace and kernel code. Small
whitespace and trivial cleanups.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:46:08 -05:00
Dave Chinner
6898811459 xfs: create xfs_bmap_util.[ch]
There is a bunch of code in xfs_bmap.c that is kernel specific and
not shared with userspace. To minimise the difference between the
kernel and userspace code, shift this unshared code to
xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.

The biggest issue here is xfs_bmap_finish() - userspace has it's own
definition of this function, and so we need to move it out of
xfs_bmap.[ch]. This means several other files need to include
xfs_bmap_util.h as well.

It also introduces and interesting dance for the stack switching
code in xfs_bmapi_allocate(). The stack switching/workqueue code is
actually moved to xfs_bmap_util.c, so that userspace can simply use
a #define in a header file to connect the dots without needing to
know about the stack switch code at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:45:17 -05:00
Dave Chinner
ff55068c20 xfs: introduce xfs_sb.c for sharing with libxfs
xfs_mount.c is shared with userspace, but the only functions that
are shared are to do with physical superblock manipulations. This
means that less than 25% of the xfs_mount.c code is actually shared
with userspace. Move all the superblock functions to xfs_sb.c and
share that instead with libxfs.

Note that this will leave all the in-core transaction related
superblock counter modifications in xfs_mount.c as none of that is
shared with userspace. With a few more small changes, xfs_mount.h
won't need to be shared with userspace anymore, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:44:11 -05:00
Dave Chinner
1fb7e48db6 xfs: split out the remote symlink handling
The remote symlink format definition and manipulation needs to be
shared with userspace, but the in-kernel interfaces do not. Split
the remote symlink format handling out into xfs_symlink_remote.[ch]
fo it can easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:43:38 -05:00
Dave Chinner
fde2227ce1 xfs: split out attribute fork truncation code into separate file
The attribute inactivation code is not used by userspace, so like
the attribute listing, split it out into a separate file to minimise
the differences between the filesystem shared with libxfs in
userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:42:30 -05:00
Dave Chinner
abec5f2bf9 xfs: split out attribute listing code into separate file
The attribute listing code is not used by userspace, so like the
directory readdir code, split it out into a separate file to
minimise the differences between the filesystem shared with libxfs
in userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:41:29 -05:00
Dave Chinner
2b9ab5ab9c xfs: reshuffle dir2 definitions around for userspace
Many of the definitions within xfs_dir2_priv.h are needed in
userspace outside libxfs. Definitions within xfs_dir2_priv.h are
wholly contained within libxfs, so we need to shuffle some of the
definitions around to keep consistency across files shared between
user and kernel space.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:40:57 -05:00
Dave Chinner
4a8af273de xfs: move getdents code into it's own file
The directory readdir code is not used by userspace, but it is
intermingled with files that are shared with userspace. This makes
it difficult to compare the differences between the userspac eand
kernel files are the userspace files don't have the getdents code in
them. Move all the kernel getdents code to a separate file to bring
the shared content between userspace and kernel files closer
together.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:39:56 -05:00
Dave Chinner
1fd7115eda xfs: introduce xfs_inode_buf.c for inode buffer operations
The only thing remaining in xfs_inode.[ch] are the operations that
read, write or verify physical inodes in their underlying buffers.
Move all this code to xfs_inode_buf.[ch] and so we can stop sharing
xfs_inode.[ch] with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:39:05 -05:00
Dave Chinner
7bb85ef360 xfs: move unrelated definitions out of xfs_inode.h
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:37:57 -05:00
Dave Chinner
5c4d97d01a xfs: move inode fork definitions to a new header file
The inode fork definitions are a combination of on-disk format
definition and in-memory tracking and manipulation. They are both
shared with userspace, so move them all into their own file so
sharing is easy to do and track.  This removes all inode fork
related information from xfs_inode.h.

Do the same for the all the C code that currently resides in
xfs_inode.c for the same reason.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:37:32 -05:00
Dave Chinner
7fd36c4418 xfs: split out transaction reservation code
The transaction reservation size calculations is used by both kernel
and userspace, but most of the transaction code in xfs_trans.c is
kernel specific. Split all the transaction reservation code out into
it's own files to make sharing with userspace simpler. This just
leaves kernel-only definitions in xfs_trans.h, so it doesn't need to
be shared with userspace anymore, either.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:36:16 -05:00
Dave Chinner
d386b32b55 xfs: sync minor header differences needed by userspace.
Little things like exported functions, __KERNEL__ protections, and
so on that ensure user and kernel shared headers are identical.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:35:41 -05:00
Dave Chinner
76456fc2a6 xfs: introduce xfs_quota_defs.h
There are a lot of quota flag definitions that are shared by user
and kernel space. Move them all to xfs_quota_defs.h so we can
unshare xfs_quota.h and remove the __KERNEL__ regions from it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:20:18 -05:00
Dave Chinner
c7298202e5 xfs: introduce xfs_rtalloc_defs.h
There are quite a few realtime device definitions shared with
userspace. Move them from xfs_rtalloc.h to xfs_rt_alloc_defs.h
so we don't need to share xfs_rtalloc.h with userspace anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:13:10 -05:00
Dave Chinner
2a3c0acc35 xfs: split out on-disk transaction definitions
There's a bunch of definitions in xfs_trans.h that define on-disk
formats - transaction headers that get written into the log, log
item type definitions, etc. Split out everything into a separate
file so that all which remains in xfs_trans.h are kernel only
definitions.

Also, remove the duplicate magic number definitions for
XFS_TRANS_MAGIC...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:11:57 -05:00
Dave Chinner
9cd047f3a3 xfs: separate icreate log format definitions from xfs_icreate_item.h
The on disk log format definitions for the icreate log item are
intertwined with the kernel-only in-memory log item definitions.
Separate the log format definitions out into their own header file
so they can easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:10:35 -05:00
Dave Chinner
6ca1c9063d xfs: separate dquot on disk format definitions out of xfs_quota.h
The on disk format definitions of the on-disk dquot, log formats and
quota off log formats are all intertwined with other definitions for
quotas. Separate them out into their own header file so they can
easily be shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:09:52 -05:00
Dave Chinner
9fbe24d95e xfs: split out EFI/EFD log item format definition
The EFI/EFD item format definitions are shared with userspace. Split
the out of header files that contain kernel only defintions to make
it simple to shared them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:07:13 -05:00
Dave Chinner
a8da0da25c xfs: split out buf log item format definitions
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:06:37 -05:00
Dave Chinner
69432832fd xfs: split out inode log item format definition
The log item format definitions are shared with userspace. Split
them out of header files that contain kernel only defintions to make
it simple to shared them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:05:19 -05:00
Dave Chinner
fc06c6d064 xfs: separate out log format definitions
The on-disk format definitions for the log are spread randoms
through a couple of header files. Consolidate it all in a single
file that can be shared easily with userspace. This means that
xfs_log.h and xfs_log_priv.h no longer need to be shared with
userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-08-12 16:03:51 -05:00
David Teigland
c6ca7bc91d dlm: remove signal blocking
The signal blocking was incorrect and unnecessary
so just remove it.

Signed-off-by: David Teigland <teigland@redhat.com>
2013-08-12 15:22:43 -05:00
Jan Kara
91aa11fae1 jbd2: Fix use after free after error in jbd2_journal_dirty_metadata()
When jbd2_journal_dirty_metadata() returns error,
__ext4_handle_dirty_metadata() stops the handle. However callers of this
function do not count with that fact and still happily used now freed
handle. This use after free can result in various issues but very likely
we oops soon.

The motivation of adding __ext4_journal_stop() into
__ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
improve error reporting. So replace __ext4_journal_stop() with
ext4_journal_abort_handle() which was there before that commit and add
WARN_ON_ONCE() to dump stack to provide useful information.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org	# 3.2+
2013-08-12 09:53:28 -04:00
Theodore Ts'o
cde2d7a796 ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT
Previously we weren't swapping only some of the extent_status LRU
fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl.  The
much safer thing to do is to just completely flush the extent status
tree when doing the swap.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: stable@vger.kernel.org
2013-08-12 09:29:30 -04:00
Jaegeuk Kim
479bd73ac4 f2fs: should cover i_xattr_nid with its xattr node page lock
Previously, f2fs_setxattr assigns i_xattr_nid in the inode page inconsistently.

The scenario is:

= Thread 1 =         = Thread 2 =     = fi->i_xattr_nid =  = on-disk nid =

f2fs_setxattr                                   0                 0
  new_node_page                                 X                 0
                   sync_inode_page              X                 X
                   checkpoint                   X                 X -.
    grab_cache_page                             X                 X  |
--> allocate a new xattr node block or -ENOSPC      <----------------'

At this moment, the checkpoint stores inconsistent data where the inode has
i_xattr_nid but actual xattr node block is not allocated yet.

So, we should assign the real i_xattr_nid only after its xattr node block is
allocated.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-12 16:04:53 +09:00
Jaegeuk Kim
9c02740c01 f2fs: check the free space first in new_node_page
Let's check the free space in prior to the main process of allocating a new node
page.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-12 16:00:46 +09:00
Gu Zheng
41dfde135f f2fs: clean up the needless end 'return' of void function
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-12 11:49:22 +09:00
Linus Torvalds
d92581fcad Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are assorted fixes, mostly from Josef nailing down xfstests
  runs.  Zach also has a long standing fix for problems with readdir
  wrapping f_pos (or ctx->pos)

  These patches were spread out over different bases, so I rebased
  things on top of rc4 and retested overnight"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: don't loop on large offsets in readdir
  Btrfs: check to see if root_list is empty before adding it to dead roots
  Btrfs: release both paths before logging dir/changed extents
  Btrfs: allow splitting of hole em's when dropping extent cache
  Btrfs: make sure the backref walker catches all refs to our extent
  Btrfs: fix backref walking when we hit a compressed extent
  Btrfs: do not offset physical if we're compressed
  Btrfs: fix extent buffer leak after backref walking
  Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
  btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified
2013-08-10 15:21:47 -07:00
Linus Torvalds
b8ea0d06ff NFS client bugfixes for 3.11
- Stable patch for lockd to fix Oopses due to inappropriate calls to
   utsname()->nodename
 - Stable patches for sunrpc to fix Oopses on shutdown when using
   AF_LOCAL sockets with rpcbind
 - Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint
 - Fix a regression with the sync mount option failing to work for nfs4 mounts
 - Fix a writeback performance issue when doing cache invalidation
 - Remove an incorrect call to nfs_setsecurity in nfs_fhget
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJSBRqqAAoJEGcL54qWCgDyejoP/RNfkebEex2ugAAfymMOUNjA
 qc8X/YO8KDQD5T0X8P9MFXzgSSMPT5YfemxfKDpkjJlYwij8O9D8r1BAfgmhhKFI
 bVp9J8pH5Y3jXgLFVsubEc9N9CXLI0yRzcOFfzapYmGJV/ZO4cBAi8HMowFtbREj
 l2uk9H8XQ1p8KqVWXhFKoXVL2b9q0CGj5z3+RkE/rD5uuZyusbTB6OhNFArPnwXk
 aCx373mlvpIAhU6DkueCiXaHH06Mff2Vlu6eBkGrdfBGC6l+x1nwevxYH750fqSU
 8O94rQkw9++qnIIvBJF/g1NzqyDhychJcXtgGLdxYUdWH3c8tevJZxCEj7U/dIJQ
 ndEaZGFxSFfdnxrwJBWtB+xsEfe9K4no9JwlkyVi8oZ5j2NUv7cJpA5cdQ4IUf/1
 uqTlIxtPHQhHWCUUKpGLlhLiZyvwOPtJvuBl/Pc9UYQbyNtSqjYBk9mamrrIC6FK
 mF6jXgWe9x+miBqWYrEdPNLGdx/hUhhqGweYPJa6jTcxif+2l2xGfscWYI89Io/e
 qy8YNcHUrRci+o+YfY4lLhk88WBZogzFOYc4jDaLCEL1TE5B2k/jHr9v39V5S7Ks
 63bhmfCTxB82uMzFeUWbiPzWQzt030pXPYy/PUPaucy/+QaZC/0lZ3HJKRKI37JJ
 ygSD7ndCZJCG+PxLkk49
 =W12x
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Stable patch for lockd to fix Oopses due to inappropriate calls to
   utsname()->nodename

 - Stable patches for sunrpc to fix Oopses on shutdown when using
   AF_LOCAL sockets with rpcbind

 - Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint

 - Fix a regression with the sync mount option failing to work for nfs4
   mounts

 - Fix a writeback performance issue when doing cache invalidation

 - Remove an incorrect call to nfs_setsecurity in nfs_fhget

* tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix up nfs4_proc_lookup_mountpoint
  NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget()
  NFSv4: Fix the sync mount option for nfs4 mounts
  NFS: Fix writeback performance issue on cache invalidation
  SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister
  SUNRPC: Don't auto-disconnect from the local rpcbind socket
  LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs
2013-08-10 15:20:37 -07:00
Linus Torvalds
022e5d098b Merge branch 'for-3.11' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
 "Some fixes for a 4.1 feature that in retrospect probably should have
  waited for 3.12....  But it appears to be working now"

* 'for-3.11' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID
  nfsd4: Fix MACH_CRED NULL dereference
2013-08-10 15:19:58 -07:00
Milosz Tanski
fe2a801b50 ceph: Remove bogus check in invalidatepage
The early bug checks are moot because the VMA layer ensures those things.

1. It will not call invalidatepage unless PagePrivate (or PagePrivate2) are set
2. It will not call invalidatepage without taking a PageLock first.
3. Guantrees that the inode page is mapped.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:58 -07:00
Sage Weil
2f75e9e179 ceph: replace hold_mutex flag with goto
All of the early exit paths need to drop the mutex; it is only the normal
path through the function that does not.  Skip the unlock in that case
with a goto out_unlocked.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-09 17:55:48 -07:00
majianpeng
0e5dd45ce4 ceph: Move the place for EOLDSNAPC handle in ceph_aio_write to easily understand
Only for ceph_sync_write, the osd can return EOLDSNAPC.so move the
related codes after the call ceph_sync_write.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:43 -07:00
Yan, Zheng
6f60f88947 ceph: fix freeing inode vs removing session caps race
remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.

We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-09 17:55:32 -07:00
majianpeng
2fbcbff1d6 ceph: Add check returned value on func ceph_calc_ceph_pg.
Func ceph_calc_ceph_pg maybe failed.So add check for returned value.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:21 -07:00
majianpeng
7ab9b38070 ceph: Don't use ceph-sync-mode for synchronous-fs.
Sending reads and writes through the sync read/write paths bypasses the
page cache, which is not expected or generally a good idea.  Removing
the write check is safe as there is a conditional vfs_fsync_range() later
in ceph_aio_write that already checks for the same flag (via
IS_SYNC(inode)).

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:18 -07:00
Dan Carpenter
688bac461b ceph: cleanup types in striped_read()
We pass in a u64 value for "len" and then immediately truncate away the
upper 32 bits.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-08-09 17:55:15 -07:00
Yan, Zheng
ca20c99191 ceph: trim deleted inode
The MDS uses caps message to notify clients about deleted inode.
when receiving a such message, invalidate any alias of the inode.
This makes the kernel release the inode ASAP.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:55:10 -07:00
Yan, Zheng
85ce127a9a ceph: wake up writer if vmtruncate work get blocked
To write data, the writer first acquires the i_mutex, then try getting
caps. The writer may sleep while holding the i_mutex. If the MDS revokes
Fb cap in this case, vmtruncate work can't do its job because i_mutex
is locked. We should wake up the writer and let it truncate the pages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:54:33 -07:00
Yan, Zheng
ad88f23f42 ceph: drop CAP_LINK_SHARED when sending "link" request to MDS
To handle "link" request, the MDS need to xlock inode's linklock,
which requires revoking any CAP_LINK_SHARED.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:54:32 -07:00
Nathaniel Yazdani
c338c07c51 ceph: fix null pointer dereference
When register_session() is given an out-of-range argument for mds,
ceph_mdsmap_get_addr() will return a null pointer, which would be given to
ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685
in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>.

Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:52:58 -07:00
majianpeng
494ddd11be ceph: Don't forget the 'up_read(&osdc->map_sem)' if met error.
CC: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09 17:49:39 -07:00
Zach Brown
db62efbbf8 btrfs: don't loop on large offsets in readdir
When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.

But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX.  It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.

So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t.   Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:34:56 -04:00
Josef Bacik
cfad392b22 Btrfs: check to see if root_list is empty before adding it to dead roots
A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice.  To fix this check to see if we are empty before adding ourselves to the
dead roots list.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:23 -04:00
Josef Bacik
f3b15ccdbb Btrfs: release both paths before logging dir/changed extents
The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging.  This is based
on Chris's fix which was reported to fix the problem.  We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:16 -04:00
Josef Bacik
ee20a98314 Btrfs: allow splitting of hole em's when dropping extent cache
I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap.  This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it.  So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error.  Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:09 -04:00
Josef Bacik
ed8c4913da Btrfs: make sure the backref walker catches all refs to our extent
Because we don't mess with the offset into the extent for compressed we will
properly find both extents for this case

[extent a][extent b][rest of extent a]

but because we already added a ref for the front half we won't add the inode
information for the second half.  This causes us to leak that memory and not
print out the other offset when we do logical-resolve.  So fix this by calling
ulist_add_merge and then add our eie to the existing entry if there is one.
With this patch we get both offsets out of logical-resolve.  With this and the
other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
set.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:03 -04:00
Josef Bacik
8ca15e05e6 Btrfs: fix backref walking when we hit a compressed extent
If you do btrfs inspect-internal logical-resolve on a compressed extent that has
been partly overwritten it won't find anything.  This is because we try and
match the extent offset we've searched for based on the extent offset in the
data extent entry.  However this doesn't work for compressed extents because the
offsets are for the uncompressed size, not the compressed size.  So instead only
do this check if we are not compressed, that way we can get an actual entry for
the physical offset rather than nothing for compressed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:56 -04:00
Josef Bacik
b76bb70136 Btrfs: do not offset physical if we're compressed
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset.  This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed.  So we can return a physical value that isn't at all within
the area we have allocated on disk.  Fix this by returning the start of the
extent if it is compressed no matter what the offset.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:50 -04:00
Liu Bo
b5b9b5b318 Btrfs: fix extent buffer leak after backref walking
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:42 -04:00
Liu Bo
e68afa49ae Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
   disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
   the 'offset' stored in btrfs_extent_data_ref which is
   (key.offset - extent_offset).

The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.

This addresses the above two problems.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:17 -04:00
Jie Liu
7cddc19392 btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified
Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,

total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G   56K  7.2G   1% /mnt

-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G  5.1G  2.2G  70% /mnt

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G  5.1G  2.2G  70% /mnt

With this fix, the truncated up space is back as:
Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G   56K  7.2G   1% /mnt

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:28:56 -04:00
Oleg Nesterov
201d3dfa4d dlm: kill the unnecessary and wrong device_close()->recalc_sigpending()
device_close()->recalc_sigpending() is not needed, sigprocmask() takes
care of TIF_SIGPENDING correctly.

And without ->siglock it is racy and wrong, it can wrongly clear
TIF_SIGPENDING and miss a signal.

But even with this patch device_close() is still buggy:

  1. sigprocmask() should not be used, we have set_task_blocked(),
     but this is minor.

  2. We should never block SIGKILL or SIGSTOP, and this is what
     the code tries to do.

  3. This can't protect against SIGKILL or SIGSTOP anyway. Another
     thread can do signal_wake_up(), say, do_signal_stop() or
     complete_signal() or debugger.

  4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
     TIF_SIGPENDING, say, freezing() or ->jobctl.

  5. device_write() looks equally wrong by the same reason.

Looks like, this tries to protect some wait_event_interruptible() logic
from signals, it should be turned into uninterruptible wait.  Or we need
to implement something like signals_stop/start for such a use-case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-09 10:48:20 -07:00
Jan Kara
97a2847d06 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/linux-reiserfs into for_next_testing
Reiserfs locking fixes.
2013-08-09 10:49:58 +02:00
Paul Gortmaker
e99a03c6f5 jbd: use a single printk for jbd_debug()
Backport of jbd2 commit 169f1a2a87
("jbd2: use a single printk for jbd_debug()")

Since the jbd_debug() is implemented with two separate printk()
calls, it can lead to corrupted and misleading debug output like
the following (see lines marked with "*"):

[  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
[  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
[  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
[* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
[* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
[* 290.339382] JBD2: starting commit of transaction 42104
[  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
[  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

i.e. the debug output from log_wait_commit and journal_commit_transaction
have become interleaved.  The output should have been:

(fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
(fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

It is expected that this is not easy to replicate -- I was only able
to cause it on preempt-rt kernels, and even then only under heavy
I/O load.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-08-09 10:49:00 +02:00
Jaegeuk Kim
d71b5564c0 f2fs: introduce cur_cp_version function to reduce code size
This patch introduces a new inline function, cur_cp_version, to reduce redundant
codes.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-09 15:25:37 +09:00
Jaegeuk Kim
e518ff81c3 f2fs: fix inconsistency between xattr node blocks and its inode
Previously xattr node blocks are stored to the COLD_NODE log, which means that
our roll-forward mechanism doesn't recover the xattr node blocks at all.
Only the direct node blocks in the WARM_NODE log can be recovered.

So, let's resolve the issue simply by conducting checkpoint during fsync when a
file has a modified xattr node block.

This approach is able to degrade the performance, but normally the checkpoint
overhead is shown at the initial fsync call after the xattr entry changes.
Once the checkpoint is done, no additional overhead would be occurred.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-09 15:25:24 +09:00
Jaegeuk Kim
dbe6a5ff4f f2fs: fix the use of XATTR_NODE_OFFSET
This patch fixes the use of XATTR_NODE_OFFSET.

o The offset should not use several MSB bits which are used by marking node
blocks.

o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality
during the fsync call.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-09 14:57:56 +09:00
Piotr Sarna
6ae6514b33 ext4: fix mount/remount error messages for incompatible mount options
Commit 5688978 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.

First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.

Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored.  Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.

To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.

Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:02:24 -04:00
Theodore Ts'o
59d9fa5c2e ext4: allow the mount options nodelalloc and data=journal
Commit 26092bf ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously.  This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.

Reported-by: Piotr Sarna <p.sarna@partner.samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-08-08 23:01:24 -04:00
Tejun Heo
8af01f56a0 cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/
The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward.  Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css().  This patch is pure rename.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2013-08-08 20:11:22 -04:00
Jeff Mahoney
d2d0395fd1 reiserfs: locking, release lock around quota operations
Previous commits released the write lock across quota operations but
missed several places.  In particular, the free operations can also
call into the file system code and take the write lock, causing
deadlocks.

This patch introduces some more helpers and uses them for quota call
sites.  Without this patch applied, reiserfs + quotas runs into deadlocks
under anything more than trivial load.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2013-08-08 17:34:47 -04:00
Jeff Mahoney
278f6679f4 reiserfs: locking, handle nested locks properly
The reiserfs write lock replaced the BKL and uses similar semantics.

Frederic's locking code makes a distinction between when the lock is nested
and when it's being acquired/released, but I don't think that's the right
distinction to make.

The right distinction is between the lock being released at end-of-use and
the lock being released for a schedule. The unlock should return the depth
and the lock should restore it, rather than the other way around as it is now.

This patch implements that and adds a number of places where the lock
should be dropped.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2013-08-08 17:34:46 -04:00
Jeff Mahoney
4c05141df5 reiserfs: locking, push write lock out of xattr code
The reiserfs xattr code doesn't need the write lock and sleeps all over
the place. We can simplify the locking by releasing it and reacquiring
after the xattr call.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2013-08-08 17:27:19 -04:00
Linus Torvalds
55f5bfd4c9 ext4 bugfixes for v3.11-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR+YUoAAoJENNvdpvBGATwxsgQAJzxe8x1SWW8jrvasMELW3bS
 I7gl3lu789GROmZo5MI6+XI1DSCjHcMTisoPnPGx8q9LNhiYQGjp9Loxgl+L+Cem
 wbDfkSHrhxtDoEw2rdpPb/hFHOpsv8t6TT53GVbdc5l/lxwyvdrblXxhWbLuTRCQ
 G0X2VoxvTph7SyVPpRqoU4U99vW3g6LRI0dsxyU8kLc/kp7/t89gwh6Gj0wlb6tb
 4pswvEuxK4HauMPjajCIaEeGpQoHbvCVpB4P9HvEtL/wklc5hNIAWy5+KYjMFWil
 8Vc6RsJv6DZc66z+X4PT1oYrx6bkhNKNRGBETV2/dzeLU/7o2ttUIQ0vlOlN9gKV
 xSne19OKwqhhRMuWrc3LDzunlFiLex2zHbZJVDOeKNXmEJVpO9ZPVkvpoi/g0vOE
 DPwcc89J+763CVmXu8uyPqgQRfJTp2dXNggP3lINCDi1mFayrBqR0S6rQ1B2m/6R
 8HYDjJ6gBpXBcs0FTGh1cfbpI8GpPx9Xu3Q/FtjvczEXUO3831KCm+k7O3cUjB72
 I7I3wmyYQvUanOc39fg+EZ0cXpiSIZG4Mjv+8vWgADFlCMUKMaC+vp2PuKmBxTuk
 7dAkxKr2kXfMirGWF1R/vo/CaKnCWmJbAJDFJ8Mt/Nk3zn3NOl/kCRtxndlj7k+m
 /s8DHlVLyD8dgvzRO9gp
 =xqB3
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o.

Misc ext4 fixes, delayed by Ted moving mail servers and email getting
marked as spam due to bad spf records.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: add WARN_ON to check the length of allocated blocks
  ext4: fix retry handling in ext4_ext_truncate()
  ext4: destroy ext4_es_cachep on module unload
  ext4: make sure group number is bumped after a inode allocation race
2013-08-08 09:38:19 -07:00
Andy Adamson
97431204ea NFSv4.1 Use clientid management rpc_clnt for secinfo_no_name
As per RFC 5661 Security Considerations

Commit 4edaa308 "NFS: Use "krb5i" to establish NFSv4 state whenever possible"
uses the nfs_client cl_rpcclient for all clientid management operations.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-08 11:46:25 -04:00
Andy Adamson
5ec16a8500 NFSv4.1 Use clientid management rpc_clnt for secinfo
As per RFC 3530 and RFC 5661 Security Considerations

Commit 4edaa308 "NFS: Use "krb5i" to establish NFSv4 state whenever possible"
uses the nfs_client cl_rpcclient for all clientid management operations.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-08 11:46:25 -04:00
Jaegeuk Kim
c2d715d144 f2fs: fix a build failure due to missing the kobject header
This patch should resolve the following error reported by kbuild test robot.

All error/warnings:

   In file included from fs/f2fs/dir.c:13:0:
   >> fs/f2fs/f2fs.h:435:17: error: field 's_kobj' has incomplete type
        struct kobject s_kobj;

The failure was caused by missing the kobject header file in dir.c.
So, this patch move the header file to the right location, f2fs.h.

CC: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-08 15:56:49 +09:00
Trond Myklebust
b72888cb0b NFSv4: Fix up nfs4_proc_lookup_mountpoint
Currently, we do not check the return value of client = rpc_clone_client(),
nor do we shut down the resulting cloned rpc_clnt in the case where a
NFS4ERR_WRONGSEC has caused nfs4_proc_lookup_common() to replace the
original value of 'client' (causing a memory leak).

Fix both issues and simplify the code by moving the call to
rpc_clone_client() until after nfs4_proc_lookup_common() has
done its business.

Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 20:47:26 -04:00
Benjamin LaHaise
f30d704fe1 aio: table lookup: verify ctx pointer
Another shortcoming of the table lookup patch was revealed where the pointer
was not being tested before being dereferenced.  Verify this to avoid the
NULL pointer dereference.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-08-07 18:23:48 -04:00
Trond Myklebust
eddffa4084 NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget()
We only need to call it on the creation of the inode.

Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Steve Dickson <SteveD@redhat.com>
Cc: Dave Quigley <dpquigl@davequigley.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 17:07:41 -04:00
Scott Mayhew
e890db0104 NFSv4: Fix the sync mount option for nfs4 mounts
The sync mount option stopped working for NFSv4 mounts after commit
c02d7adf8c (NFSv4: Replace nfs4_path_walk() with
FS path lookup in a private namespace).  If MS_SYNCHRONOUS is set in the
super_block that we're cloning from, then it should be set in the new
super_block as well.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 17:07:41 -04:00
Trond Myklebust
f8806c843f NFS: Fix writeback performance issue on cache invalidation
If a cache invalidation is triggered, and we happen to have a lot of
writebacks cached at the time, then the call to invalidate_inode_pages2()
will end up calling ->launder_page() on each and every dirty page in order
to sync its contents to disk, thus defeating write coalescing.
The following patch ensures that we try to sync the inode to disk before
calling invalidate_inode_pages2() so that we do the writeback as efficiently
as possible.

Reported-by: William Dauchy <william@gandi.net>
Reported-by: Pascal Bouchareine <pascal@gandi.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: William Dauchy <william@gandi.net>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2013-08-07 17:07:40 -04:00
Linus Torvalds
b7bc9e7d80 Oleg Nesterov has been working hard in closing all the holes that can
lead to race conditions between deleting an event and accessing an event
 debugfs file. This included a fix to the debugfs system (acked by
 Greg Kroah-Hartman). We think that all the holes have been patched and
 hopefully we don't find more. I haven't marked all of them for stable
 because I need to examine them more to figure out how far back some of
 the changes need to go.
 
 Along the way, some other fixes have been made. Alexander Z Lam fixed
 some logic where the wrong buffer was being modifed.
 
 Andrew Vagin found a possible corruption for machines that actually
 allocate cpumask, as a reference to one was being zeroed out by mistake.
 
 Dhaval Giani found a bad prototype when tracing is not configured.
 
 And I not only had some changes to help Oleg, but also finally fixed
 a long standing bug that Dave Jones and others have been hitting, where
 a module unload and reload can cause the function tracing accounting
 to get screwed up.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQEcBAABAgAGBQJR/7FvAAoJEOdOSU1xswtMCAgH/RpxS5mQcQnhrGfu1qnSk+3v
 CyL1X9IkvgP5f+N59tbU3ZuE8Q3YQvTII35na38fgbHbwWrhYjLSvlf3Pwtre8zr
 Rjm9b8zA6iSobHu3DyuuupzMsLE+SpBakVSmG6mi8izZyuCi0YHMZMnHTGNnW9Vv
 YFqkfGgQQWMqiUHLcueetOqWAjR2JiJaLfr47rsRLNflF6dB2MbG/7YbYJ0TBk7E
 hemVmR7JH7606eJZ6nR7bh/hYv0sL2SSqVVaOHO5nWFLFbI8CybpNVGcoD+nihZY
 M7LrZT1C1mn3nKrfDhyXy4dwwx6wBON0fY23eJTHMVNnc0tvY1xqH52vTdPILWg=
 =UfsE
 -----END PGP SIGNATURE-----

Merge tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Oleg Nesterov has been working hard in closing all the holes that can
  lead to race conditions between deleting an event and accessing an
  event debugfs file.  This included a fix to the debugfs system (acked
  by Greg Kroah-Hartman).  We think that all the holes have been patched
  and hopefully we don't find more.  I haven't marked all of them for
  stable because I need to examine them more to figure out how far back
  some of the changes need to go.

  Along the way, some other fixes have been made.  Alexander Z Lam fixed
  some logic where the wrong buffer was being modifed.

  Andrew Vagin found a possible corruption for machines that actually
  allocate cpumask, as a reference to one was being zeroed out by
  mistake.

  Dhaval Giani found a bad prototype when tracing is not configured.

  And I not only had some changes to help Oleg, but also finally fixed a
  long standing bug that Dave Jones and others have been hitting, where
  a module unload and reload can cause the function tracing accounting
  to get screwed up"

* tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix reset of time stamps during trace_clock changes
  tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct buffer
  tracing: Fix trace_dump_stack() proto when CONFIG_TRACING is not set
  tracing: Fix fields of struct trace_iterator that are zeroed by mistake
  tracing/uprobes: Fail to unregister if probe event files are in use
  tracing/kprobes: Fail to unregister if probe event files are in use
  tracing: Add comment to describe special break case in probe_remove_event_call()
  tracing: trace_remove_event_call() should fail if call/file is in use
  debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs)
  ftrace: Check module functions being traced on reload
  ftrace: Consolidate some duplicate code for updating ftrace ops
  tracing: Change remove_event_file_dir() to clear "d_subdirs"->i_private
  tracing: Introduce remove_event_file_dir()
  tracing: Change f_start() to take event_mutex and verify i_private != NULL
  tracing: Change event_filter_read/write to verify i_private != NULL
  tracing: Change event_enable/disable_read() to verify i_private != NULL
  tracing: Turn event/id->i_private into call->event.type
2013-08-07 13:01:30 -07:00
Andy Adamson
bc4b2a86a5 NFSv4.1 Increase NFS4_DEF_SLOT_TABLE_SIZE
Increase NFS4_DEF_SLOT_TABLE_SIZE which is used as the client ca_maxreequests
value in CREATE_SESSION.  Current non-dynamic session slot server
implementations use the client ca_maxrequests as a maximum slot number: 64
session slots can handle most workloads.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 13:10:40 -04:00
Andy Adamson
f8407299f6 NFS Remove unused authflavour parameter from init_client
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 13:09:30 -04:00
Chuck Lever
73d8bde5e4 NFS: Never use user credentials for lease renewal
Never try to use a non-UID 0 user credential for lease management,
as that credential can change out from under us.  The server will
block NFSv4 lease recovery with NFS4ERR_CLID_INUSE.

Since the mechanism to acquire a credential for lease management
is now the same for all minor versions, replace the minor version-
specific callout with a single function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 13:06:08 -04:00
Chuck Lever
d688f7b8f6 NFS: Use root's credential for lease management when keytab is missing
Commit 05f4c350 "NFS: Discover NFSv4 server trunking when mounting"
Fri Sep 14 17:24:32 2012 introduced Uniform Client String support,
which forces our NFS client to establish a client ID immediately
during a mount operation rather than waiting until a user wants to
open a file.

Normally machine credentials (eg. from a keytab) are used to perform
a mount operation that is protected by Kerberos.  Before 05fc350,
SETCLIENTID used a machine credential, or fell back to a regular
user's credential if no keytab is available.

On clients that don't have a keytab, performing SETCLIENTID early
means there's no user credential to fall back on, since no regular
user has kinit'd yet.  05f4c350 seems to have broken the ability
to mount with sec=krb5 on clients that don't have a keytab in
kernels 3.7 - 3.10.

To address this regression, commit 4edaa308 (NFS: Use "krb5i" to
establish NFSv4 state whenever possible), Sat Mar 16 15:56:20 2013,
was merged in 3.10.  This commit forces the NFS client to fall back
to AUTH_SYS for lease management operations if no keytab is
available.

Neil Brown noticed that, since root is required to kinit to do a
sec=krb5 mount when a client doesn't have a keytab, we can try to
use root's Kerberos credential before AUTH_SYS.

Now, when determining a principal and flavor to use for lease
management, the NFS client tries in this order:

  1.  Flavor: AUTH_GSS, krb5i
      Principal: service principal (via keytab)

  2.  Flavor: AUTH_GSS, krb5i
      Principal: user principal established for UID 0 (via kinit)

  3.  Flavor: AUTH_SYS
      Principal: UID 0 / GID 0

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 13:05:10 -04:00
Trond Myklebust
6da1a03436 NFSv4: Refuse mount attempts with proto=udp
RFC3530 disallows the use of udp as a transport protocol for NFSv4.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 12:37:04 -04:00
Jeff Layton
9597c13b2f nfs: verify open flags before allowing an atomic open
Currently, you can open a NFSv4 file with O_APPEND|O_DIRECT, but cannot
fcntl(F_SETFL,...) with those flags. This flag combination is explicitly
forbidden on NFSv3 opens, and it seems like it should also be on NFSv4.

Reported-by: Chao Ye <cye@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-08-07 12:16:22 -04:00
Weston Andros Adamson
58cd57bfd9 nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID
- don't BUG_ON() when not SP4_NONE
 - calculate recv and send reserve sizes correctly

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-08-07 12:06:07 -04:00
J. Bruce Fields
c47205914c nfsd4: Fix MACH_CRED NULL dereference
Fixes a NULL-dereference on attempts to use MACH_CRED protection over
auth_sys.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-08-07 12:05:51 -04:00
Jeff Layton
757c4f6260 cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately
David reported that commit c2b93e06 (cifs: only set ops for inodes in
I_NEW state) caused a regression with mfsymlinks. Prior to that patch,
if a mfsymlink dentry was instantiated at readdir time, the inode would
get a new set of ops when it was revalidated. After that patch, this
did not occur.

This patch addresses this by simply skipping instantiating dentries in
the readdir codepath when we know that they will need to be immediately
revalidated. The next attempt to use that dentry will cause a new lookup
to occur (which is basically what we want to happen anyway).

Cc: <stable@vger.kernel.org>
Cc: "Stefan (metze) Metzmacher" <metze@samba.org>
Cc: Sachin Prabhu <sprabhu@redhat.com>
Reported-and-Tested-by: David McBride <dwm37@cam.ac.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-08-07 10:57:06 -05:00
Jin Xu
a569469e96 f2fs: fix a deadlock in fsync
This patch fixes a deadlock bug that occurs quite often when there are
concurrent write and fsync on a same file.

Following is the simplified call trace when tasks get hung.

fsync thread:
- f2fs_sync_file
 ...
 - f2fs_write_data_pages
 ...
  - update_extent_cache
  ...
   - update_inode
    - wait_on_page_writeback

bdi writeback thread
- __writeback_single_inode
 - f2fs_write_data_pages
  - mutex_lock(sbi->writepages)

The deadlock happens when the fsync thread waits on a inode page that has
been added to the f2fs' cached bio sbi->bio[NODE], and unfortunately,
no one else could be able to submit the cached bio to block layer for
writeback. This is because the fsync thread already hold a sbi->fs_lock and
the sbi->writepages lock, causing the bdi thread being blocked when attempt
to write data pages for the same inode. At the same time, f2fs_gc thread
does not notice the situation and could not help. Even the sync syscall
gets blocked.

To fix it, we could submit the cached bio first before waiting on a inode page
that is being written back.

Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: add more cases to use f2fs_wait_on_page_writeback]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-06 22:00:36 +09:00
Namjae Jeon
df273efc36 f2fs: remove redundant code from f2fs_write_begin
This code is being used for nobh_write_end() function.
But since now f2fs_write_end function is added so
there is no need for this code.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-06 22:00:35 +09:00
Namjae Jeon
d2dc095f42 f2fs: add sysfs entries to select the gc policy
Add sysfs entry gc_idle to control the gc policy. Where
gc_idle = 1 corresponds to selecting a cost benefit approach,
while gc_idle = 2 corresponds to selecting a greedy approach
to garbage collection. The selection is mutually exclusive one
approach will work at any point. If gc_idle = 0, then this
option is disabled.

Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: change the select_gc_type() flow slightly]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-06 22:00:18 +09:00
Namjae Jeon
b59d0bae6c f2fs: add sysfs support for controlling the gc_thread
Add sysfs entries to control the timing parameters for
f2fs gc thread.

Various Sysfs options introduced are:
gc_min_sleep_time: Min Sleep time for GC in ms
gc_max_sleep_time: Max Sleep time for GC in ms
gc_no_gc_sleep_time: Default Sleep time for GC in ms

Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: fix an umount bug and some minor changes]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-08-06 21:53:34 +09:00
Trond Myklebust
9a1b6bf818 LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs
Firstly, nlmclnt_setlockargs can be called from a reclaimer thread, in
which case we're in entirely the wrong namespace.

Secondly, commit 8aac62706a (move
exit_task_namespaces() outside of exit_notify()) now means that
exit_task_work() is called after exit_task_namespaces(), which
triggers an Oops when we're freeing up the locks.

Fix this by ensuring that we initialise the nlm_host's rpc_client at mount
time, so that the cl_nodename field is initialised to the value of
utsname()->nodename that the net namespace uses. Then replace the
lockd callers of utsname()->nodename.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Toralf Förster <toralf.foerster@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # 3.10.x
2013-08-05 15:03:46 -04:00
Benjamin LaHaise
da90382c2e aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
In the patch "aio: convert the ioctx list to table lookup v3", incorrect
handling in the ioctx_alloc() error path was introduced that lead to an
ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
call returned -EAGAIN due to hitting the aio_max_nr limit.  Fix this by
only calling ioctx_add_table() as the last step in ioctx_alloc().

Also, several unnecessary rcu_dereference() calls were added that lead to
RCU warnings where the system was already protected by a spin lock for
accessing mm->ioctx_table.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-08-05 13:21:43 -04:00
Zheng Liu
3d62c45b38 vfs: add missing check for __O_TMPFILE in fcntl_init()
As comment in include/uapi/asm-generic/fcntl.h described, when
introducing new O_* bits, we need to check its uniqueness in
fcntl_init().  But __O_TMPFILE bit is missing.  So fix it.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-05 18:25:32 +04:00
Andy Lutomirski
bb2314b479 fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink
Every now and then someone proposes a new flink syscall, and this spawns
a long discussion of whether it would be a security problem.  I think
that this is missing the point: flink is *already* allowed without
privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.

Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
write it, and link it in is very convenient.  The only problem is that
it requires that /proc be mounted so that you can do:

linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)

This sucks -- it's much nicer to do:

linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)

Let's allow it.

If this turns out to be excessively scary, it we could instead require
that the inode in question be I_LINKABLE, but this seems pointless given
the /proc situation

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-05 18:24:11 +04:00
Andy Lutomirski
e305f48bc4 fs: Fix file mode for O_TMPFILE
O_TMPFILE, like O_CREAT, should respect the requested mode and should
create regular files.

This fixes two bugs: O_TMPFILE required privilege (because the mode
ended up as 000) and it produced bogus inodes with no type.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-05 18:24:10 +04:00
Al Viro
672fe15d09 reiserfs: fix deadlock in umount
Since remove_proc_entry() started to wait for IO in progress (i.e.
since 2007 or so), the locking in fs/reiserfs/proc.c became wrong;
if procfs read happens between the moment when umount() locks the
victim superblock and removal of /proc/fs/reiserfs/<device>/*,
we'll get a deadlock - read will wait for s_umount (in sget(),
called by r_start()), while umount will wait in remove_proc_entry()
for that read to finish, holding s_umount all along.

Fortunately, the same change allows a much simpler race avoidance -
all we need to do is remove the procfs entries in the very beginning
of reiserfs ->kill_sb(); that'll guarantee that pointer to superblock
will remain valid for the duration for procfs IO, so we don't need
sget() to keep the sucker alive.  As the matter of fact, we can
get rid of the home-grown iterator completely, and use single_open()
instead.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-05 17:37:37 +04:00
Paul Gortmaker
a91de85247 jbd: relocate assert after state lock in journal_commit_transaction()
The state lock is taken after we are doing an assert on the state
value, not before.  So we might in fact be doing an assert on a
transient value.  Ensure the state check is within the scope of
the state lock being taken.

Backport of jbd2 commit 3ca841c106
("jbd2: relocate assert after state lock in journal_commit_transaction()")

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-08-01 23:25:38 +02:00
Gu Zheng
62c610460d ocfs2/refcounttree: add the missing NULL check of the return value of find_or_create_page()
Add the missing NULL check of the return value of find_or_create_page() in
function ocfs2_duplicate_clusters_by_page().

[akpm@linux-foundation.org: fix layout, per Joel]
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31 14:41:02 -07:00
Jan Kara
e729eac6f6 udf: Refuse RW mount of the filesystem instead of making it RO
Refuse RW mount of udf filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.

Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression.  Plus any
tool mounting udf is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.

Reported-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-31 22:14:51 +02:00
Jan Kara
d759bfa4e7 udf: Standardize return values in mount sequence
Change all function used in filesystem discovery during mount to user
standard kernel return values - -errno on error, 0 on success instead
of 1 on failure and 0 on success. This allows us to pass error number
(not just failure / success) so we can abort device scanning earlier
in case of errors like EIO or ENOMEM . Also we will be able to return
EROFS in case writeable mount is requested but writing isn't supported.

Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-31 22:14:50 +02:00
Jan Kara
17b7f7cf58 isofs: Refuse RW mount of the filesystem instead of making it RO
Refuse RW mount of isofs filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.

Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression.  Plus any
tool mounting isofs is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.

Reported-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-31 22:14:50 +02:00
Eric Sandeen
cf7eff4666 ext3: allow specifying external journal by pathname mount option
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.

The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.

Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.

# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

and it'll mount even if the devnum has changed, as shown here:

# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext3 -L mylabel -J device=/dev/loop0 /dev/sdb1

Change the journal device number:

# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile

And today it will fail:

# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

# dmesg | tail -n 1
[17343.240702] EXT3-fs (sdb1): error: couldn't read superblock of external journal

But with this new mount option, we can specify the new path:

# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#

(which does update the encoded device number, incidentally):

# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device:	          0x0701

But best of all we can just always mount by journal-path, and
it'll always work:

# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#

So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-07-31 22:11:15 +02:00
Jeff Layton
66ffd113f5 cifs: set sb->s_d_op before calling d_make_root()
Currently, the s_root dentry doesn't get its d_op pointer set to
anything. This breaks lookups in the root of case-insensitive mounts
since that relies on having d_hash and d_compare routines that know to
treat the filename as case-insensitive.

cifs.ko has been broken this way for a long time, but commit 1c929cfe6
("switch cifs"), added a cryptic comment which is removed in the patch
below, which makes me wonder if this was done deliberately for some
reason. It's not clear to me why we'd want the s_root not to have d_op
set properly.

It may have something to do with d_automount or d_revalidate on the
root, but my suspicion in looking over the code is that Al was just
trying to preserve the existing behavior when changing this code over to
use s_d_op.

This patch changes it so that we set s_d_op before calling d_make_root
and removes the comment. I tested mounting, accessing and unmounting
several types of shares (including DFS referrals) and everything still
seemed to work OK afterward. I could be missing something however, so
please do let me know if I am.

Reported-by: Jan-Marek Glogowski <glogow@fbihome.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-07-31 13:45:02 -05:00
Jeff Layton
ba48202932 cifs: fix bad error handling in crypto code
Jarod reported an Oops like when testing with fips=1:

CIFS VFS: could not allocate crypto hmacmd5
CIFS VFS: could not crypto alloc hmacmd5 rc -2
CIFS VFS: Error -2 during NTLMSSP authentication
CIFS VFS: Send error in SessSetup = -2
BUG: unable to handle kernel NULL pointer dereference at 000000000000004e
IP: [<ffffffff812b5c7a>] crypto_destroy_tfm+0x1a/0x90
PGD 0
Oops: 0000 [#1] SMP
Modules linked in: md4 nls_utf8 cifs dns_resolver fscache kvm serio_raw virtio_balloon virtio_net mperf i2c_piix4 cirrus drm_kms_helper ttm drm i2c_core virtio_blk ata_generic pata_acpi
CPU: 1 PID: 639 Comm: mount.cifs Not tainted 3.11.0-0.rc3.git0.1.fc20.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff88007bf496e0 ti: ffff88007b080000 task.ti: ffff88007b080000
RIP: 0010:[<ffffffff812b5c7a>]  [<ffffffff812b5c7a>] crypto_destroy_tfm+0x1a/0x90
RSP: 0018:ffff88007b081d10  EFLAGS: 00010282
RAX: 0000000000001f1f RBX: ffff880037422000 RCX: ffff88007b081fd8
RDX: 000000000000001f RSI: 0000000000000006 RDI: fffffffffffffffe
RBP: ffff88007b081d30 R08: ffff880037422000 R09: ffff88007c090100
R10: 0000000000000000 R11: 00000000fffffffe R12: fffffffffffffffe
R13: ffff880037422000 R14: ffff880037422000 R15: 00000000fffffffe
FS:  00007fc322f4f780(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000004e CR3: 000000007bdaa000 CR4: 00000000000006e0
Stack:
 ffffffff81085845 ffff880037422000 ffff8800375e7400 ffff880037422000
 ffff88007b081d48 ffffffffa0176022 ffff880037422000 ffff88007b081d60
 ffffffffa015c07b ffff880037600600 ffff88007b081dc8 ffffffffa01610e1
Call Trace:
 [<ffffffff81085845>] ? __cancel_work_timer+0x75/0xf0
 [<ffffffffa0176022>] cifs_crypto_shash_release+0x82/0xf0 [cifs]
 [<ffffffffa015c07b>] cifs_put_tcp_session+0x8b/0xe0 [cifs]
 [<ffffffffa01610e1>] cifs_mount+0x9d1/0xad0 [cifs]
 [<ffffffffa014ff50>] cifs_do_mount+0xa0/0x4d0 [cifs]
 [<ffffffff811ab6e9>] mount_fs+0x39/0x1b0
 [<ffffffff811c466f>] vfs_kern_mount+0x5f/0xf0
 [<ffffffff811c6a9e>] do_mount+0x23e/0xa20
 [<ffffffff811c66e6>] ? copy_mount_options+0x36/0x170
 [<ffffffff811c7303>] SyS_mount+0x83/0xc0
 [<ffffffff8165c8d9>] system_call_fastpath+0x16/0x1b
Code: eb 9e 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 49 89 fc 53 48 83 ec 08 48 85 ff 74 46 <48> 83 7e 48 00 48 8b 5e 50 74 4b 48 89 f7 e8 83 fc ff ff 4c 8b
RIP  [<ffffffff812b5c7a>] crypto_destroy_tfm+0x1a/0x90
 RSP <ffff88007b081d10>
CR2: 000000000000004e

The cifs code allocates some crypto structures. If that fails, it
returns an error, but it leaves the pointers set to their PTR_ERR
values. Then later when it tries to clean up, it sees that those values
are non-NULL and then passes them to the routine that frees them.

Fix this by setting the pointers to NULL after collecting the error code
in this situation.

Cc: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-07-31 13:44:59 -05:00
Oleg Nesterov
776164c1fa debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs)
debugfs_remove_recursive() is wrong,

1. it wrongly assumes that !list_empty(d_subdirs) means that this
   dir should be removed.

   This is not that bad by itself, but:

2. if d_subdirs does not becomes empty after __debugfs_remove()
   it gives up and silently fails, it doesn't even try to remove
   other entries.

   However ->d_subdirs can be non-empty because it still has the
   already deleted !debugfs_positive() entries.

3. simple_release_fs() is called even if __debugfs_remove() fails.

Suppose we have

	dir1/
		dir2/
			file2
		file1

and someone opens dir1/dir2/file2.

Now, debugfs_remove_recursive(dir1/dir2) succeeds, and dir1/dir2 goes
away.

But debugfs_remove_recursive(dir1) silently fails and doesn't remove
this directory. Because it tries to delete (the already deleted)
dir1/dir2/file2 again and then fails due to "Avoid infinite loop"
logic.

Test-case:

	#!/bin/sh

	cd /sys/kernel/debug/tracing
	echo 'p:probe/sigprocmask sigprocmask' >> kprobe_events
	sleep 1000 < events/probe/sigprocmask/id &
	echo -n >| kprobe_events

	[ -d events/probe ] && echo "ERR!! failed to rm probe"

And after that it is not possible to create another probe entry.

With this patch debugfs_remove_recursive() skips !debugfs_positive()
files although this is not strictly needed. The most important change
is that it does not try to make ->d_subdirs empty, it simply scans
the whole list(s) recursively and removes as much as possible.

Link: http://lkml.kernel.org/r/20130726151256.GC19472@redhat.com

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-07-31 12:16:31 -04:00
Benjamin LaHaise
6878ea72a5 aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
In the event that an overflow/underflow occurs while calculating req_batch,
clamp the minimum at 1 request instead of doing a BUG_ON().

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-31 10:34:18 -04:00
Dan Carpenter
f0c5e565bb f2fs: remove an unneeded kfree(NULL)
This kfree() is no longer needed after a79dc083d7 "f2fs: move
bio_private allocation out of f2fs_bio_alloc()".  The "bio->bi_private"
is NULL here so it's a no-op.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-07-31 19:07:01 +09:00
Andi Shyti
fe090e4e44 cifs: file: initialize oparms.reconnect before using it
In the cifs_reopen_file function, if the following statement is
asserted:

(tcon->unix_ext && cap_unix(tcon->ses) &&
            (CIFS_UNIX_POSIX_PATH_OPS_CAP &
            (tcon->fsUnixInfo.Capability)))

and we succeed to open with cifs_posix_open, the function jumps
to the label reopen_success and checks for oparms.reconnect
which is not initialized.

This issue has been reported by scan.coverity.com

Signed-off-by: Andi Shyti <andi@etezian.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-07-30 23:54:49 -05:00
Steve French
1b244081af Do not attempt to do cifs operations reading symlinks with SMB2
When use of symlinks is enabled (mounting with mfsymlinks option) to
non-Samba servers, we always tried to use cifs, even when we
were mounted with SMB2 or SMB3, which causes the server to drop the
network connection.

This patch separates out the protocol specific operations for cifs from
the code which recognizes symlinks, and fixes the problem where
with SMB2 mounts we attempt cifs operations to open and read
symlinks.  The next patch will add support for SMB2 for opening
and reading symlinks.  Additional followon patches will address
the similar problem creating symlinks.

Signed-off-by: Steve French <smfrench@gmail.com>
2013-07-30 23:54:45 -05:00
Chen Gang
057d6332b2 cifs: extend the buffer length enought for sprintf() using
For cifs_set_cifscreds() in "fs/cifs/connect.c", 'desc' buffer length
is 'CIFSCREDS_DESC_SIZE' (56 is less than 256), and 'ses->domainName'
length may be "255 + '\0'".

The related sprintf() may cause memory overflow, so need extend related
buffer enough to hold all things.

It is also necessary to be sure of 'ses->domainName' must be less than
256, and define the related macro instead of hard code number '256'.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Scott Lovenberg <scott.lovenberg@gmail.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-07-30 23:54:40 -05:00
Tejun Heo
7a378c9aea xfs: WQ_NON_REENTRANT is meaningless and going away
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: xfs@oss.sgi.com
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2013-07-30 13:11:17 -05:00
Benjamin LaHaise
db446a08c2 aio: convert the ioctx list to table lookup v3
On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> >  32.51%  [guest.kernel]  [g] lookup_ioctx
> >   9.19%  [guest.kernel]  [g] __lock_acquire.isra.28
> >   4.40%  [guest.kernel]  [g] lock_release
> >   4.19%  [guest.kernel]  [g] sched_clock_local
> >   3.86%  [guest.kernel]  [g] local_clock
> >   3.68%  [guest.kernel]  [g] native_sched_clock
> >   3.08%  [guest.kernel]  [g] sched_clock_cpu
> >   2.64%  [guest.kernel]  [g] lock_release_holdtime.part.11
> >   2.60%  [guest.kernel]  [g] memcpy
> >   2.33%  [guest.kernel]  [g] lock_acquired
> >   2.25%  [guest.kernel]  [g] lock_acquire
> >   1.84%  [guest.kernel]  [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores         1         2         4         8         16        32
> > list       109376 ms  69119 ms  35682 ms  22671 ms  19724 ms  16408 ms
> > %rsd         0.69%      1.15%     1.17%     1.21%     1.71%     1.43%
> > radix       73651 ms  41748 ms  23028 ms  16766 ms  15232 ms   13787 ms
> > %rsd         1.19%      0.98%     0.69%     1.13%    0.72%      0.75%
> > % of radix
> > relative    66.12%     65.59%    66.63%    72.31%   77.26%     83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list        58892 ms
> > %rsd         0.91%
> > radix       59404 ms
> > %rsd         0.81%
> > % of radix
> > relative    100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>

I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx.  It's not finished yet, and
it needs to be tidied up, but is most of the way there.

		-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo>		-Kent

bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 12:56:36 -04:00
Benjamin LaHaise
4cd81c3dfc aio: double aio_max_nr in calculations
With the changes to use percpu counters for aio event ring size calculation,
existing increases to aio_max_nr are now insufficient to allow for the
allocation of enough events.  Double the value used for aio_max_nr to account
for the doubling introduced by the percpu slack.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 12:06:37 -04:00
Kent Overstreet
d29c445b63 aio: Kill ki_dtor
sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:12 -04:00
Kent Overstreet
57282d8fd7 aio: Kill ki_users
The kiocb refcount is only needed for cancellation - to ensure a kiocb
isn't freed while a ki_cancel callback is running. But if we restrict
ki_cancel callbacks to not block (which they currently don't), we can
simply drop the refcount.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:12 -04:00
Kent Overstreet
8bc92afcf7 aio: Kill unneeded kiocb members
The old aio retry infrastucture needed to save the various arguments to
to aio operations. But with the retry infrastructure gone, we can trim
struct kiocb quite a bit.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:12 -04:00
Kent Overstreet
73a7075e3f aio: Kill aio_rw_vect_retry()
This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.

This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:12 -04:00
Kent Overstreet
5ffac122db aio: Don't use ctx->tail unnecessarily
aio_complete() (arguably) needs to keep its own trusted copy of the tail
pointer, but io_getevents() doesn't have to use it - it's already using
the head pointer from the ring buffer.

So convert it to use the tail from the ring buffer so it touches fewer
cachelines and doesn't contend with the cacheline aio_complete() needs.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:11 -04:00
Kent Overstreet
bec68faaf3 aio: io_cancel() no longer returns the io_event
Originally, io_event() was documented to return the io_event if
cancellation succeeded - the io_event wouldn't be delivered via the ring
buffer like it normally would.

But this isn't what the implementation was actually doing; the only
driver implementing cancellation, the usb gadget code, never returned an
io_event in its cancel function. And aio_complete() was recently changed
to no longer suppress event delivery if the kiocb had been cancelled.

This gets rid of the unused io_event argument to kiocb_cancel() and
kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
kiocb->ki_cancel() returned success.

Also tweak the refcounting in kiocb_cancel() to make more sense.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-07-30 11:53:11 -04:00