Commit Graph

39687 Commits

Author SHA1 Message Date
Jens Axboe
b520252aa2 fs: make inode_to_bdi() handle NULL inode
Running a heavy fs workload, I ran into a situation where we pass
down a page for writeback/swap that doesn't have an inode mapping:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffff8119589f>] inode_to_bdi+0xf/0x50
PGD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: wl(O) tun cfg80211 btusb joydev hid_apple hid_generic usbhid hid bcm5974 usb_storage nouveau snd_hda_codec_hdmi snd_hda_codec_cirrus snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel kvm_intel snd_hda_controller snd_hda_codec kvm snd_hwdep snd_pcm applesmc input_polldev snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_timer snd_seq_device snd xhci_pci xhci_hcd ttm thunderbolt soundcore apple_gmux apple_bl bluetooth binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat [last unloaded: wl]
CPU: 4 PID: 50 Comm: kswapd0 Tainted: G     U     O   3.19.0-rc5+ #60
Hardware name: Apple Inc. MacBookPro11,3/Mac-2BD1B31983FE1663, BIOS MBP112.88Z.0138.B02.1310181745 10/18/2013
task: ffff880462e917f0 ti: ffff880462edc000 task.ti: ffff880462edc000
RIP: 0010:[<ffffffff8119589f>]  [<ffffffff8119589f>] inode_to_bdi+0xf/0x50
RSP: 0000:ffff880462edf8e8  EFLAGS: 00010282
RAX: ffffffff81c4cd80 RBX: ffffea0001b3abc0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff880462edf8f8 R08: 00000000001e8500 R09: ffff880460f7cb68
R10: ffff880462edfa00 R11: 0000000000000101 R12: 0000000000000000
R13: ffffffff81c4cd98 R14: 0000000000000000 R15: ffff880460f7c9c0
FS:  0000000000000000(0000) GS:ffff88047f300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 00000002b6341000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 ffffea0001b3abc0 ffffffff81c4cd80 ffff880462edf948 ffffffff811244aa
 ffffffff811565b0 ffff880460f7c9c0 ffff880462edf948 ffffea0001b3abc0
 0000000000000001 ffff880462edfb40 ffff880008b999c0 ffff880460f7c9c0
Call Trace:
 [<ffffffff811244aa>] __test_set_page_writeback+0x3a/0x170
 [<ffffffff811565b0>] ? SyS_madvise+0x790/0x790
 [<ffffffff81156bb6>] __swap_writepage+0x216/0x280
 [<ffffffff8133d592>] ? radix_tree_insert+0x32/0xe0
 [<ffffffff81157741>] ? swap_info_get+0x61/0xf0
 [<ffffffff81159bfc>] ? page_swapcount+0x4c/0x60
 [<ffffffff81156c4d>] swap_writepage+0x2d/0x50
 [<ffffffff81131658>] shmem_writepage+0x198/0x2c0
 [<ffffffff8112cae4>] shrink_page_list+0x464/0xa00
 [<ffffffff8112d666>] shrink_inactive_list+0x266/0x500
 [<ffffffff8112e215>] shrink_lruvec+0x5d5/0x720
 [<ffffffff8112e3bb>] shrink_zone+0x5b/0x190
 [<ffffffff8112ee3f>] kswapd+0x48f/0x8d0
 [<ffffffff8112e9b0>] ? try_to_free_pages+0x4c0/0x4c0
 [<ffffffff81067be2>] kthread+0xd2/0xf0
 [<ffffffff81060000>] ? workqueue_congested+0x30/0x80
 [<ffffffff81067b10>] ? kthread_create_on_node+0x180/0x180
 [<ffffffff816b556c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81067b10>] ? kthread_create_on_node+0x180/0x180
Code: 00 48 c7 c7 8d 8d a4 81 e8 3f 62 eb ff e9 fc fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 <48> 8b 5f 28 48 89 df e8 15 f8 00 00 85 c0 75 11 48 8b 83 d8 00
RIP  [<ffffffff8119589f>] inode_to_bdi+0xf/0x50
 RSP <ffff880462edf8e8>
CR2: 0000000000000028
---[ end trace eb0e21aa7dad3ddf ]---

Handle this in inode_to_bdi() by punting it to noop_backing_dev_info,
if mapping->host is NULL.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-22 08:13:17 -07:00
chandan
95449a1626 Btrfs: insert_new_root: Fix lock type of the extent buffer.
btrfs_alloc_tree_block() returns an extent buffer on which a blocked lock has
been taken. Hence assign the appropriate value to path->locks[level].

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-22 05:42:23 -08:00
Anand Jain
78f55e5e1f Btrfs: fix unused members in struct btrfs_root
There isn't any real use of following members of struct btrfs_root
so delete them.

struct kobject root_kobj;
struct completion kobj_unregister;

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:37 -08:00
Yang Dongsheng
0ee13fe28c btrfs: qgroup: move WARN_ON() to the correct location.
In function qgroup_excl_accounting(), we need to WARN when
qg->excl is less than what we want to free, same to child
and parents. But currently, for parent qgroup, the WARN_ON()
is located after freeing qg->excl. It will WARN out even we
free it normally.

This patch move this WARN_ON() before freeing qg->excl.

Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:37 -08:00
Liu Bo
26455d3318 Btrfs: cleanup unused run_most
"run_most" is not used anymore.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:16 -08:00
Zhao Lei
570193454a Rename all ref_count to refs in struct
refs is better than ref_count to record a struct's ref count.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Suggested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:50 -08:00
Zhao Lei
ffe2d2034b Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply
So we can check raid56 with:
 (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
instead of long:
 (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
10f1190016 Btrfs: Include map_type in raid_bio
Corrent code use many kinds of "clever" way to determine operation
target's raid type, as:
  raid_map != NULL
  or
  raid_map[MAX_NR] == RAID[56]_Q_STRIPE

To make code easy to maintenance, this patch put raid type into
bbio, and we can always get raid type from bbio with a "stupid"
way.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
be50a8ddaa Btrfs: Simplify scrub_setup_recheck_block()'s argument
scrub_setup_recheck_block() have many arguments but most of them
can be get from one of them, we can remove them to make code clean.
Some other cleanup for that function also included in this patch.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
b968fed1c3 Btrfs: Combine per-page recover in dev-replace and scrub
The code are similar, combine them to make code clean and easy to maintenance.
Some lost condition are also completed with benefit of this combination.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
8d6738c1bd Btrfs: Separate finding-right-mirror and writing-to-target's process in scrub_handle_errored_block()
In corrent code, code of finding-right-mirror and writing-to-target
are mixed in logic, if we find a right mirror but failed in writing
to target, it will treat as "hadn't found right block", and fill the
target with sblock_bad.

Actually, "failed in writing to target" does not mean "source
block is wrong", this patch separate above two condition in logic,
and do some cleanup to make code clean.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:49 -08:00
Zhao Lei
dc5f7a3bd8 Btrfs: Break loop when reach BTRFS_MAX_MIRRORS in scrub_setup_recheck_block()
Use break instead of useless loop should be more suitable in this
case.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
7653947fe6 Btrfs: btrfs_rm_dev_replace_blocked(): Use wait_event()
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
09dd7a01c3 Btrfs: Cleanup btrfs_bio_counter_inc_blocked()
1: Remove no-need DEFINE_WAIT(wait)
2: Add likely() for BTRFS_FS_STATE_DEV_REPLACING condition
3: Use while loop instead of goto

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
114ab50d82 Btrfs: Remove noneed force_write in scrub_write_block_to_dev_replace
It is always 1 in this place, because !1 case was already jumped
out in previous code.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
b25c94c580 Btrfs: Fix a jump typo of nodatasum_case to avoid wrong WARN_ON()
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
    ...
    goto nodatasum_case;
}
...
nodatasum_case:
    WARN_ON(sctx->is_dev_replace);

In above code, nodatasum_case marker should be moved after
WARN_ON().

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
6e9606d2a2 Btrfs: add ref_count and free function for btrfs_bio
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
   to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
   forced data type checking in compile.

Changelog v1->v2:
 Rename following by David Sterba's suggestion:
 put_btrfs_bio() -> btrfs_put_bio()
 get_btrfs_bio() -> btrfs_get_bio()
 bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Zhao Lei
8e5cfb55d3 Btrfs: Make raid_map array be inlined in btrfs_bio structure
It can make code more simple and clear, we need not care about
free bbio and raid_map together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
cc7539edea Btrfs: sort raid_map before adding tgtdev stripes
It can avoid complex calculation of real stripes in sort,
moreover, we can clean up code of sorting tgtdev_map because it
will be in order initially.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Zhao Lei
e34c330d63 Btrfs: fix a out-of-bound access of raid_map
We add the number of stripes on target devices into bbio->num_stripes
if we are under device replacement, and we just sort the raid_map of
those stripes that not on the target devices, so if when we need
real raid_map, we need skip the stripes on the target devices.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:47 -08:00
Filipe Manana
df8d116ffa Btrfs: fix fsync log replay for inodes with a mix of regular refs and extrefs
If we have an inode with a large number of hard links, some of which may
be extrefs, turn a regular ref into an extref, fsync the inode and then
replay the fsync log (after a crash/reboot), we can endup with an fsync
log that makes the replay code always fail with -EOVERFLOW when processing
the inode's references.

This is easy to reproduce with the test case I made for xfstests. Its steps
are the following:

   _scratch_mkfs "-O extref" >> $seqres.full 2>&1
   _init_flakey
   _mount_flakey

   # Create a test file with 3001 hard links. This number is large enough to
   # make btrfs start using extrefs at some point even if the fs has the maximum
   # possible leaf/node size (64Kb).
   echo "hello world" > $SCRATCH_MNT/foo
   for i in `seq 1 3000`; do
       ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
   done

   # Make sure all metadata and data are durably persisted.
   sync

   # Now remove one link, add a new one with a new name, add another new one with
   # the same name as the one we just removed and fsync the inode.
   rm -f $SCRATCH_MNT/foo_link_0001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_0001
   rm -f $SCRATCH_MNT/foo_link_0002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3002
   ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3003
   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

   # Simulate a crash/power loss. This makes sure the next mount
   # will see an fsync log and will replay that log.

   _load_flakey_table $FLAKEY_DROP_WRITES
   _unmount_flakey

   _load_flakey_table $FLAKEY_ALLOW_WRITES
   _mount_flakey

   # Check that the number of hard links is correct, we are able to remove all
   # the hard links and read the file's data. This is just to verify we don't
   # get stale file handle errors (due to dangling directory index entries that
   # point to inodes that no longer exist).
   echo "Link count: $(stat --format=%h $SCRATCH_MNT/foo)"
   [ -f $SCRATCH_MNT/foo ] || echo "Link foo is missing"
   for ((i = 1; i <= 3003; i++)); do
       name=foo_link_`printf "%04d" $i`
       if [ $i -eq 2 ]; then
           [ -f $SCRATCH_MNT/$name ] && echo "Link $name found"
       else
           [ -f $SCRATCH_MNT/$name ] || echo "Link $name is missing"
       fi
   done
   rm -f $SCRATCH_MNT/foo_link_*
   cat $SCRATCH_MNT/foo
   rm -f $SCRATCH_MNT/foo

   status=0
   exit

The fix is simply to correct the overflow condition when overwriting a
reference item because it was wrong, trying to increase the item in the
fs/subvol tree by an impossible amount. Also ensure that we don't insert
one normal ref and one ext ref for the same dentry - this happened because
processing a dir index entry from the parent in the log happened when
the normal ref item was full, which made the logic insert an extref and
later when the normal ref had enough room, it would be inserted again
when processing the ref item from the child inode in the log.

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon. This test only passes if the previous
patch titled "Btrfs: fix fsync when extend references are added to an inode"
is applied too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:05 -08:00
Filipe Manana
2c2c452b0c Btrfs: fix fsync when extend references are added to an inode
If we added an extended reference to an inode and fsync'ed it, the log
replay code would make our inode have an incorrect link count, which
was lower then the expected/correct count.
This resulted in stale directory index entries after deleting some of
the hard links, and any access to the dangling directory entries resulted
in -ESTALE errors because the entries pointed to inode items that don't
exist anymore.

This is easy to reproduce with the test case I made for xfstests, and
the bulk of that test is:

    _scratch_mkfs "-O extref" >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create a test file with 3001 hard links. This number is large enough to
    # make btrfs start using extrefs at some point even if the fs has the maximum
    # possible leaf/node size (64Kb).
    echo "hello world" > $SCRATCH_MNT/foo
    for i in `seq 1 3000`; do
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_`printf "%04d" $i`
    done

    # Make sure all metadata and data are durably persisted.
    sync

    # Add one more link to the inode that ends up being a btrfs extref and fsync
    # the inode.
    ln $SCRATCH_MNT/foo $SCRATCH_MNT/foo_link_3001
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Now after the fsync log replay btrfs left our inode with a wrong link count N,
    # which was smaller than the correct link count M (N < M).
    # So after removing N hard links, the remaining M - N directory entries were
    # still visible to user space but it was impossible to do anything with them
    # because they pointed to an inode that didn't exist anymore. This resulted in
    # stale file handle errors (-ESTALE) when accessing those dentries for example.
    #
    # So remove all hard links except the first one and then attempt to read the
    # file, to verify we don't get an -ESTALE error when accessing the inodel
    #
    # The btrfs fsck tool also detected the incorrect inode link count and it
    # reported an error message like the following:
    #
    # root 5 inode 257 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 256 index 2978 namelen 13 name foo_link_2976 filetype 1 errors 4, no inode ref
    #
    # The fstests framework automatically calls fsck after a test is run, so we
    # don't need to call fsck explicitly here.

    rm -f $SCRATCH_MNT/foo_link_*
    cat $SCRATCH_MNT/foo

    status=0
    exit

So make sure an fsync always flushes the delayed inode item, so that the
fsync log contains it (needed in order to trigger the link count fixup
code) and fix the extref counting function, which always return -ENOENT
to its caller (and made it assume there were always 0 extrefs).

This issue has been present since the introduction of the extrefs feature
(2012).

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
d36808e0d4 Btrfs: fix directory inconsistency after fsync log replay
If we have an inode (file) with a link count greater than 1, remove
one of its hard links, fsync the inode, power fail/crash and then
replay the fsync log on the next mount, we end up getting the parent
directory's metadata inconsistent - its i_size still reflects the
deleted hard link and has dangling index entries (with no matching
inode reference entries). This prevents the directory from ever being
deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
even if all of its children inodes are deleted, and the dangling index
entries can never be removed (as they point to an inode that does not
exist anymore).

This is easy to reproduce with the following excerpt from the test case
for xfstests that I just made:

    _scratch_mkfs >> $seqres.full 2>&1

    _init_flakey
    _mount_flakey

    # Create a test file with 2 hard links in the same directory.
    mkdir -p $SCRATCH_MNT/a/b
    echo "hello world" > $SCRATCH_MNT/a/b/foo
    ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar

    # Make sure all metadata and data are durably persisted.
    sync

    # Now remove one of the hard links and fsync the inode.
    rm -f $SCRATCH_MNT/a/b/bar
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/b/foo

    # Simulate a crash/power loss. This makes sure the next mount
    # will see an fsync log and will replay that log.

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    # Remove the last hard link of the file and attempt to remove its parent
    # directory - this failed in btrfs because the fsync log and replay code
    # didn't decrement the parent directory's i_size and left dangling directory
    # index entries - this made the btrfs rmdir implementation always fail with
    # the error -ENOTEMPTY.
    #
    # The dangling directory index entries were visible to user space, but it was
    # impossible to do anything on them (unlink, open, read, write, stat, etc)
    # because the inode they pointed to did not exist anymore.
    #
    # The parent directory's metadata inconsistency (stale index entries) was
    # also detected by btrfs' fsck tool, which is run automatically by the fstests
    # framework when the test finishes. The error message reported by fsck was:
    #
    # root 5 inode 259 errors 2001, no inode item, link count wrong
    #   unresolved ref dir 258 index 3 namelen 3 name bar filetype 1 errors 4, no inode ref
    #
    rm -f $SCRATCH_MNT/a/b/*
    rmdir $SCRATCH_MNT/a/b
    rmdir $SCRATCH_MNT/a

To fix this just make sure that after an unlink, if the inode is fsync'ed,
he parent inode is fully logged in the fsync log.

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Filipe Manana
6219872dc6 Btrfs: lookup for block group only if needed when freeing a tree block
Very often our extent buffer's header generation doesn't match the current
transaction's id or it is also referenced by other trees (snapshots), so
we don't need the corresponding block group cache object. Therefore only
search for it if we are going to use it, so we avoid an unnecessary search
in the block groups rbtree (and acquiring and releasing its spinlock).

Freeing a tree block is performed when COWing or deleting a node/leaf,
which implies we are holding the node/leaf's parent node lock, therefore
reducing the amount of time spent when freeing a tree block helps reducing
the amount of time we are holding the parent node's lock.

For example, for a run of xfstests/generic/083, the block group cache
object was needed only 682 times for a total of 226691 calls to free
a tree block.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
730a78c741 btrfs: remove a no-op unfreeze superbock callback
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
9ee49a047d btrfs: switch extent_state state to unsigned
Currently there's a 4B hole in the structure between refs and state and there
are only 16 bits used so we can make it unsigned. This will get a better
packing and may save some stack space for local variables.

The size of extent_state gets reduced by 8B and there are usually a lot
of slab objects.

struct extent_state {
	u64                        start;                /*     0     8 */
	u64                        end;                  /*     8     8 */
	struct rb_node             rb_node;              /*    16    24 */
	wait_queue_head_t          wq;                   /*    40    24 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	atomic_t                   refs;                 /*    64     4 */

	/* XXX 4 bytes hole, try to pack */

	long unsigned int          state;                /*    72     8 */
	u64                        private;              /*    80     8 */

	/* size: 88, cachelines: 2, members: 7 */
	/* sum members: 84, holes: 1, sum holes: 4 */
	/* last cacheline: 24 bytes */
};

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
David Sterba
5efa0490cc btrfs: set proper message level for skinny metadata
This has been confusing people for too long, the message is really just
informative.

CC: <stable@vger.kernel.org> # 3.10+
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
f0954c6637 btrfs: update message levels after checksum errors
The errors are worth noting and might get missed with INFO level.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
aa8ee31209 btrfs: update message levels during failed mount
All error conditions from open_ctree shall be ERR. Warning would
suggest that something's wrong and we can continue.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
David Sterba
68b663d13c btrfs: update message levels for errors
Several messages that point to some internal problem, level INFO is
wrong here.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Filipe Manana
a8df6fe666 Btrfs: fix setup_leaf_for_split() to avoid leaf corruption
We were incorrectly detecting when the target key didn't exist anymore
after releasing the path and re-searching the tree. This could make
us split or duplicate (btrfs_split_item() and btrfs_duplicate_item()
are its only callers at the moment) an item when we should not.

For the case of duplicating an item, we currently only duplicate
checksum items (csum tree) and file extent items (fs/subvol trees).
For the checksum items we end up overriding the item completely,
but for file extent items we update only some of their fields in
the copy (done in __btrfs_drop_extents), which means we can end up
having a logical corruption for some values.

Also for the case where we duplicate a file extent item it will make
us produce a leaf with a wrong key order, as btrfs_duplicate_item()
advances us to the next slot and then its caller sets a smaller key
on the new item at that slot (like in __btrfs_drop_extents() e.g.).
Alternatively if the tree search in setup_leaf_for_split() leaves
with path->slots[0] == btrfs_header_nritems(path->nodes[0]), we end
up accessing beyond the leaf's end (when we check if the item's size
has changed) and make our caller insert an item at the invalid slot
btrfs_header_nritems(path->nodes[0]) + 1, causing an invalid memory
access if the leaf is full or nearly full.

This issue has been present since the introduction of this function
in 2009:

    Btrfs: Add btrfs_duplicate_item
    commit ad48fd7546

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:03 -08:00
Chris Mason
57bbddd7fb Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:49:35 -08:00
Chris Mason
d354183488 Merge branch 'fix/find-item-path-leak' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:45:25 -08:00
Jeff Layton
8116bf4cb6 locks: update comments that refer to inode->i_flock
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-01-21 20:44:01 -05:00
Josef Bacik
ce93ec548c Btrfs: track dirty block groups on their own list
Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set.  This function
can get called several times during a commit.  So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time.  Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
free space cache out so it is much cleaner.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:36:52 -08:00
Josef Bacik
e7070be198 Btrfs: change how we track dirty roots
I've been overloading root->dirty_list to keep track of dirty roots and which
roots need to have their commit roots switched at transaction commit time.  This
could cause us to lose an update to the root which could corrupt the file
system.  To fix this use a state bit to know if the root is dirty, and if it
isn't set we go ahead and move the root to the dirty list.  This way if we
re-dirty the root after adding it to the switch_commit list we make sure to
update it.  This also makes it so that the extent root is always the last root
on the dirty list to try and keep the amount of churn down at this point in the
commit.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:35:49 -08:00
Brian Foster
4d949021aa xfs: remove incorrect error negation in attr_multi ioctl
xfs_compat_attrmulti_by_handle() calls memdup_user() which returns a
negative error code. The error code is negated by the caller and thus
incorrectly converted to a positive error code.

Remove the error negation such that the negative error is passed
correctly back up to userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 10:04:24 +11:00
Dave Chinner
438c3c8d2b Merge branch 'xfs-buf-type-fixes' into for-next 2015-01-22 09:51:30 +11:00
Dave Chinner
3443a3bca5 xfs: set superblock buffer type correctly
When the superblock is modified in a transaction, the commonly
modified fields are not actually copied to the superblock buffer to
avoid the buffer lock becoming a serialisation point. However, there
are some other operations that modify the superblock fields within
the transaction that don't directly log to the superblock but rely
on the changes to be applied during the transaction commit (to
minimise the buffer lock hold time).

When we do this, we fail to mark the buffer log item as being a
superblock buffer and that can lead to the buffer not being marked
with the corect type in the log and hence causing recovery issues.
Fix it by setting the type correctly, similar to xfs_mod_sb()...

cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:30:23 +11:00
Dave Chinner
fe22d552b8 xfs: set buf types when converting extent formats
Conversion from local to extent format does not set the buffer type
correctly on the new extent buffer when a symlink data is moved out
of line.

Fix the symlink code and leave a comment in the generic bmap code
reminding us that the format-specific data copy needs to set the
destination buffer type appropriately.

cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:30:06 +11:00
Dave Chinner
f19b872b08 xfs: inode unlink does not set AGI buffer type
This leads to log recovery throwing errors like:

XFS (md0): Mounting V5 Filesystem
XFS (md0): Starting recovery (logdev: internal)
XFS (md0): Unknown buffer type 0!
XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1
ffff8800ffc53800: 58 41 47 49 .....

Which is the AGI buffer magic number.

Ensure that we set the type appropriately in both unlink list
addition and removal.

cc: <stable@vger.kernel.org> # 3.10 to current
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:29:40 +11:00
Dave Chinner
0d612fb570 xfs: ensure buffer types are set correctly
Jan Kara reported that log recovery was finding buffers with invalid
types in them. This should not happen, and indicates a bug in the
logging of buffers. To catch this, add asserts to the buffer
formatting code to ensure that the buffer type is in range when the
transaction is committed.

We don't set a type on buffers being marked stale - they are not
going to get replayed, the format item exists only for recovery to
be able to prevent replay of the buffer, so the type does not
matter. Hence that needs special casing here.

cc: <stable@vger.kernel.org> # 3.10 to current
Reported-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:29:05 +11:00
Dave Chinner
465e2def7c Merge branch 'xfs-sb-logging-rework' into for-next
Conflicts:
	fs/xfs/xfs_mount.c
2015-01-22 09:20:53 +11:00
Anna Schumaker
2ef47eb1ae NFS: Fix use of nfs_attr_use_mounted_on_fileid()
This function call was being optimized out during nfs_fhget(), leading
to situations where we have a valid fileid but still want to use the
mounted_on_fileid.  For example, imagine we have our server configured
like this:

server % df
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       9.1G  6.5G  1.9G  78% /
/dev/vdb1       487M  2.3M  456M   1% /exports
/dev/vdc1       487M  2.3M  456M   1% /exports/vol1
/dev/vdd1       487M  2.3M  456M   1% /exports/vol2

If our client mounts /exports and tries to do a "chown -R" across the
entire mountpoint, we will get a nasty message warning us about a circular
directory structure.  Running chown with strace tells me that each directory
has the same device and inode number:

newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0

With this patch the mounted_on_fileid values are used for st_ino, so the
directory loop warning isn't reported.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:15:41 -05:00
Dave Chinner
074e427ba7 xfs: sanitise sb_bad_features2 handling
We currently have to ensure that every time we update sb_features2
that we update sb_bad_features2. Now that we log and format the
superblock in it's entirety we actually don't have to care because
we can simply update the sb_bad_features2 when we format it into the
buffer. This removes the need for anything but the mount and
superblock formatting code to care about sb_bad_features2, and
hence removes the possibility that we forget to update bad_features2
when necessary in the future.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:33 +11:00
Dave Chinner
61e63ecb57 xfs: consolidate superblock logging functions
We now have several superblock loggin functions that are identical
except for the transaction reservation and whether it shoul dbe a
synchronous transaction or not. Consolidate these all into a single
function, a single reserveration and a sync flag and call it
xfs_sync_sb().

Also, xfs_mod_sb() is not really a modification function - it's the
operation of logging the superblock buffer. hence change the name of
it to reflect this.

Note that we have to change the mp->m_update_flags that are passed
around at mount time to a boolean simply to indicate a superblock
update is needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:31 +11:00
Dave Chinner
4d11a40239 xfs: remove bitfield based superblock updates
When we log changes to the superblock, we first have to write them
to the on-disk buffer, and then log that. Right now we have a
complex bitfield based arrangement to only write the modified field
to the buffer before we log it.

This used to be necessary as a performance optimisation because we
logged the superblock buffer in every extent or inode allocation or
freeing, and so performance was extremely important. We haven't done
this for years, however, ever since the lazy superblock counters
pulled the superblock logging out of the transaction commit
fast path.

Hence we have a bunch of complexity that is not necessary that makes
writing the in-core superblock to disk much more complex than it
needs to be. We only need to log the superblock now during
management operations (e.g. during mount, unmount or quota control
operations) so it is not a performance critical path anymore.

As such, remove the complex field based logging mechanism and
replace it with a simple conversion function similar to what we use
for all other on-disk structures.

This means we always log the entirity of the superblock, but again
because we rarely modify the superblock this is not an issue for log
bandwidth or CPU time. Indeed, if we do log the superblock
frequently, delayed logging will minimise the impact of this
overhead.

[Fixed gquota/pquota inode sharing regression noticed by bfoster.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-22 09:10:26 +11:00
Trond Myklebust
3175e1dcec NFSv4.1: Fix an Oops in nfs41_walk_client_list
If we start state recovery on a client that failed to initialise correctly,
then we are very likely to Oops.

Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:10:09 -05:00
Peng Tao
ee8a1a8b16 nfs: fix dio deadlock when O_DIRECT flag is flipped
We only support swap file calling nfs_direct_IO. However, application
might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
during IO and it can deadlock because we grab inode->i_mutex in
nfs_file_direct_write(). So return 0 for such case. Then the generic
layer will fall back to buffer IO.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-21 17:10:08 -05:00
Jan Kara
a39427007e xfs: Remove some pointless quota checks
xfs_fs_get_xstate() and xfs_fs_get_xstatev() check whether there's quota
running before calling xfs_qm_scall_getqstat() or
xfs_qm_scall_getqstatv(). Thus we are certain that superblock supports
quota and xfs_sb_version_hasquota() check is pointless. Similarly we
know that when quota is running, mp->m_quotainfo will be allocated.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:33 +01:00
Jan Kara
fbf64b3df3 xfs: Remove some useless flags tests
'flags' have XFS_ALL_QUOTA_ACCT cleared immediately on function entry.
There's no point in checking these bits later in the function. Also
because we check something is going to change, we know some enforcement
bits are being added and thus there's no point in testing that later.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:32 +01:00
Jan Kara
8a2fdd4a49 xfs: Remove useless test
Q_XQUOTARM is never passed to xfs_fs_set_xstate() so remove the test.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:32 +01:00
Jan Kara
ca6cb0918e quota: Verify flags passed to Q_SETINFO
Currently flags passed via Q_SETINFO were just stored. This makes it
hard to add new flags since in theory userspace could be just setting /
clearing random flags. Since currently there is only one userspace
settable flag and that is somewhat obscure flags only for ancient v1
quota format, I'm reasonably sure noone operates these flags and
hopefully we are fine just adding the check that passed flags are sane.
If we indeed find some userspace program that gets broken by the strict
check, we can always remove it again.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:31 +01:00
Jan Kara
9c45101e88 quota: Cleanup flags definitions
Currently all quota flags were defined just in kernel-private headers.
Export flags readable / writeable from userspace to userspace via
include/uapi/linux/quota.h.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:30 +01:00
Jan Kara
96827adcc2 ocfs2: Move OLQF_CLEAN flag out of generic quota flags
OLQF_CLEAN flag is used by OCFS2 on disk to recognize whether quota
recovery is needed or not. We also somewhat abuse mem_dqinfo->dqi_flags
to pass this flag around. Use private flags for this to avoid clashes
with other quota flags / not pollute generic quota flag namespace.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:30 +01:00
Jan Kara
c119c5b974 quota: Don't store flags for v2 quota format
Currently, v2 quota format blindly stored flags from in-memory dqinfo on
disk, although there are no flags supported. Since it is stupid to store
flags which have no effect, just store 0 unconditionally and don't
bother loading it from disk.

Note that userspace could have stored some flags there via Q_SETINFO
quotactl and then later read them (although flags have no effect) but
I'm pretty sure noone does that (most definitely quota-tools don't and
quota interface doesn't have too much other users).

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-21 19:21:07 +01:00
Ingo Molnar
f49028292c Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

  - Documentation updates.

  - Miscellaneous fixes.

  - Preemptible-RCU fixes, including fixing an old bug in the
    interaction of RCU priority boosting and CPU hotplug.

  - SRCU updates.

  - RCU CPU stall-warning updates.

  - RCU torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-21 06:12:21 +01:00
Qu Wenruo
a53f4f8e9c btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
Commit 6b5fe46dfa (btrfs: do commit in sync_fs if there are pending
changes) will call btrfs_start_transaction() in sync_fs(), to handle
some operations needed to be done in next transaction.

However this can cause deadlock if the filesystem is frozen, with the
following sys_r+w output:
[  143.255932] Call Trace:
[  143.255936]  [<ffffffff816c0e09>] schedule+0x29/0x70
[  143.255939]  [<ffffffff811cb7f3>] __sb_start_write+0xb3/0x100
[  143.255971]  [<ffffffffa040ec06>] start_transaction+0x2e6/0x5a0
[btrfs]
[  143.255992]  [<ffffffffa040f1eb>] btrfs_start_transaction+0x1b/0x20
[btrfs]
[  143.256003]  [<ffffffffa03dc0ba>] btrfs_sync_fs+0xca/0xd0 [btrfs]
[  143.256007]  [<ffffffff811f7be0>] sync_fs_one_sb+0x20/0x30
[  143.256011]  [<ffffffff811cbd01>] iterate_supers+0xe1/0xf0
[  143.256014]  [<ffffffff811f7d75>] sys_sync+0x55/0x90
[  143.256017]  [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17
[  143.256111] Call Trace:
[  143.256114]  [<ffffffff816c0e09>] schedule+0x29/0x70
[  143.256119]  [<ffffffff816c3405>] rwsem_down_write_failed+0x1c5/0x2d0
[  143.256123]  [<ffffffff8133f013>] call_rwsem_down_write_failed+0x13/0x20
[  143.256131]  [<ffffffff811caae8>] thaw_super+0x28/0xc0
[  143.256135]  [<ffffffff811db3e5>] do_vfs_ioctl+0x3f5/0x540
[  143.256187]  [<ffffffff811db5c1>] SyS_ioctl+0x91/0xb0
[  143.256213]  [<ffffffff816c49d2>] system_call_fastpath+0x12/0x17

The reason is like the following:
(Holding s_umount)
VFS sync_fs staff:
|- btrfs_sync_fs()
   |- btrfs_start_transaction()
      |- sb_start_intwrite()
      (Waiting thaw_fs to unfreeze)
					VFS thaw_fs staff:
					thaw_fs()
					(Waiting sync_fs to release
					 s_umount)

So deadlock happens.
This can be easily triggered by fstest/generic/068 with inode_cache
mount option.

The fix is to check if the fs is frozen, if the fs is frozen, just
return and waiting for the next transaction.

Cc: David Sterba <dsterba@suse.cz>
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
[enhanced comment, changed to SB_FREEZE_WRITE]
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20 17:20:21 -08:00
Qu Wenruo
6c9fe14f9d btrfs: Fix the bug that fs_info->pending_changes is never cleared.
Fs_info->pending_changes is never cleared since the original code uses
cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
pending_changes is already 0.

This will cause a lot of problem when mount it with inode_cache mount
option.
If the btrfs is mounted as inode_cache, pending_changes will always be
1, even when the fs is frozen.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-20 17:19:40 -08:00
Christoph Hellwig
df0ce26cb4 fs: remove default_backing_dev_info
Now that default_backing_dev_info is not used for writeback purposes we can
git rid of it easily:

 - instead of using it's name for tracing unregistered bdi we just use
   "unknown"
 - btrfs and ceph can just assign the default read ahead window themselves
   like several other filesystems already do.
 - we can assign noop_backing_dev_info as the default one in alloc_super.
   All filesystems already either assigned their own or
   noop_backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:05:38 -07:00
Christoph Hellwig
7b14a21389 nfs: don't call bdi_unregister
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.

Addintionally remove the call during mount failure, as
deactivate_super_locked will already call ->kill_sb and clean up
the bdi for us.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:09 -07:00
Christoph Hellwig
e4d2750909 ceph: remove call to bdi_unregister
bdi_destroy already does all the work, and if we delay freeing the
anon bdev we can get away with just that single call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:07 -07:00
Christoph Hellwig
b83ae6d421 fs: remove mapping->backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:05 -07:00
Christoph Hellwig
de1414a654 fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
Now that we got rid of the bdi abuse on character devices we can always use
sb->s_bdi to get at the backing_dev_info for a file, except for the block
device special case.  Export inode_to_bdi and replace uses of
mapping->backing_dev_info with it to prepare for the removal of
mapping->backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:04 -07:00
Christoph Hellwig
26ff13047e nilfs2: set up s_bdi like the generic mount_bdev code
mapping->backing_dev_info will go away, so don't rely on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:02 -07:00
Christoph Hellwig
495a276e1c block_dev: get bdev inode bdi directly from the block device
Directly grab the backing_dev_info from the request_queue instead of
detouring through the address_space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:01 -07:00
Christoph Hellwig
564f00f6c0 block_dev: only write bdev inode on close
Since 018a17bdc8 ("bdi: reimplement bdev_inode_switch_bdi()") the
block device code writes out all dirty data whenever switching the
backing_dev_info for a block device inode.  But a block device inode can
only be dirtied when it is in use, which means we only have to write it
out on the final blkdev_put, but not when doing a blkdev_get.

Factoring out the write out from the bdi list switch prepares from
removing the list switch later in the series.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:02:59 -07:00
Christoph Hellwig
b4caecd480 fs: introduce f_op->mmap_capabilities for nommu mmap support
Since "BDI: Provide backing device capability information [try #3]" the
backing_dev_info structure also provides flags for the kind of mmap
operation available in a nommu environment, which is entirely unrelated
to it's original purpose.

Introduce a new nommu-only file operation to provide this information to
the nommu mmap code instead.  Splitting this from the backing_dev_info
structure allows to remove lots of backing_dev_info instance that aren't
otherwise needed, and entirely gets rid of the concept of providing a
backing_dev_info for a character device.  It also removes the need for
the mtd_inodefs filesystem.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:02:58 -07:00
Christoph Hellwig
a7a2c680a2 fs: deduplicate noop_backing_dev_info
hugetlbfs, kernfs and dlmfs can simply use noop_backing_dev_info instead
of creating a local duplicate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:02:54 -07:00
Sachin Prabhu
ca7df8e0bb Complete oplock break jobs before closing file handle
Commit
c11f1df500
requires writers to wait for any pending oplock break handler to
complete before proceeding to write. This is done by waiting on bit
CIFS_INODE_PENDING_OPLOCK_BREAK in cifsFileInfo->flags. This bit is
cleared by the oplock break handler job queued on the workqueue once it
has completed handling the oplock break allowing writers to proceed with
writing to the file.

While testing, it was noticed that the filehandle could be closed while
there is a pending oplock break which results in the oplock break
handler on the cifsiod workqueue being cancelled before it has had a
chance to execute and clear the CIFS_INODE_PENDING_OPLOCK_BREAK bit.
Any subsequent attempt to write to this file hangs waiting for the
CIFS_INODE_PENDING_OPLOCK_BREAK bit to be cleared.

We fix this by ensuring that we also clear the bit
CIFS_INODE_PENDING_OPLOCK_BREAK when we remove the oplock break handler
from the workqueue.

The bug was found by Red Hat QA while testing using ltp's fsstress
command.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Acked-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Jeff Layton <jlayton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-01-19 20:20:46 -06:00
Giel van Schijndel
f99dbfa4b3 cifs: use memzero_explicit to clear stack buffer
When leaving a function use memzero_explicit instead of memset(0) to
clear stack allocated buffers. memset(0) may be optimized away.

This particular buffer is highly likely to contain sensitive data which
we shouldn't leak (it's named 'passwd' after all).

Signed-off-by: Giel van Schijndel <me@mortis.eu>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Reported-at: http://www.viva64.com/en/b/0299/
Reported-by: Andrey Karpov
Reported-by: Svyatoslav Razmyslov
Signed-off-by: Steve French <steve.french@primarydata.com>
2015-01-19 15:32:13 -06:00
Satoru Takeuchi
6e1103a6e9 btrfs: fix state->private cast on 32 bit machines
Suppress the following warning displayed on building 32bit (i686) kernel.

===============================================================================
...
   CC [M]  fs/btrfs/extent_io.o
fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
different size [-Wint-to-pointer-cast]
    failrec = (struct io_failure_record *)state->private;
...
===============================================================================

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:06:06 -08:00
Filipe Manana
75c68e9fbb Btrfs: fix race deleting block group from space_info->ro_bgs list
When removing a block group we were deleting it from its space_info's
ro_bgs list without the correct protection - the space info's spinlock.
Fix this by doing the list delete while holding the spinlock of the
corresponding space info, which is the correct lock for any operation
on that list.

This issue was introduced in the 3.19 kernel by the following change:

    Btrfs: move read only block groups onto their own list V2
    commit 633c0aad4c

I ran into a kernel crash while a task was running statfs, which iterates
the space_info->ro_bgs list while holding the space info's spinlock,
and another task was deleting it from the same list, without holding that
spinlock, as part of the block group remove operation (while running the
function btrfs_remove_block_group). This happened often when running the
stress test xfstests/generic/038 I recently made.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:45 -08:00
Tsutomu Itoh
379d6854a2 Btrfs: fix incorrect freeing in scrub_stripe
The address that should be freed is not 'ppath' but 'path'.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:44 -08:00
David Sterba
98bd5c547e btrfs: sync ioctl, handle errors after transaction start
The version merged to 3.19 did not handle errors from start_trancaction
and could pass an invalid pointer to commit_transaction.

Fixes: 6b5fe46dfa ("btrfs: do commit in sync_fs if there are pending changes")
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:44 -08:00
Theodore Ts'o
3edc18d845 ext4: reserve codepoints used by the ext4 encryption feature
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-19 16:00:58 -05:00
Darrick J. Wong
b6924225c2 jbd2: complain about descriptor block checksum errors
We should complain in dmesg when journal recovery fails on account of
the descriptor block being corrupt, so that the diagnostic data can
be recovered.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-19 15:59:58 -05:00
Al Viro
378ff1a53b fix deadlock in cifs_ioctl_clone()
It really needs to check that src is non-directory *and* use
{un,}lock_two_nodirectories().  As it is, it's trivial to cause
double-lock (ioctl(fd, CIFS_IOC_COPYCHUNK_FILE, fd)) and if the
last argument is an fd of directory, we are asking for trouble
by violating the locking order - all directories go before all
non-directories.  If the last argument is an fd of parent
directory, it has 50% odds of locking child before parent,
which will cause AB-BA deadlock if we race with unlink().

Cc: stable@vger.kernel.org @ 3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-18 23:49:26 -05:00
Johannes Berg
053c095a82 netlink: make nlmsg_end() and genlmsg_end() void
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 01:03:45 -05:00
alex chen
a6b8978c54 pstore: Fix sprintf format specifier in pstore_dump()
We should use sprintf format specifier "%u" instead of "%d" for
argument of type 'unsigned int' in pstore_dump().

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-01-16 16:01:29 -08:00
Mark Salyzyn
9d5438f462 pstore: Add pmsg - user-space accessible pstore object
A secured user-space accessible pstore object. Writes
to /dev/pmsg0 are appended to the buffer, on reboot
the persistent contents are available in
/sys/fs/pstore/pmsg-ramoops-[ID].

One possible use is syslogd, or other daemon, can
write messages, then on reboot provides a means to
triage user-space activities leading up to a panic
as a companion to the pstore dmesg or console logs.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-01-16 16:01:10 -08:00
Mark Salyzyn
f44f96528a pstore: Handle zero-sized prz in series
ramoops_pstore_read fails to return the next in a prz
series after first zero-sized entry, not venturing to
the next non-zero entry.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-01-16 15:47:23 -08:00
Mark Salyzyn
ff6bf6e802 pstore: Remove superfluous memory size check
All previous checks will fail with error if memory size
is not sufficient to register a zone, so this legacy
check has become redundant.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-01-16 15:45:46 -08:00
Mark Salyzyn
dbaffde764 pstore: Use scnprintf() in pstore_mkfile()
No guarantees that the names will not exceed the
name buffer with future adjustments.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-01-16 15:30:50 -08:00
Jeff Layton
3d8e560de4 locks: consolidate NULL i_flctx checks in locks_remove_file
We have each of the locks_remove_* variants doing this individually.
Have the caller do it instead, and have locks_remove_flock and
locks_remove_lease just assume that it's a valid pointer.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-01-16 16:08:50 -05:00
Jeff Layton
9bd0f45b70 locks: keep a count of locks on the flctx lists
This makes things a bit more efficient in the cifs and ceph lock
pushing code.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 16:08:50 -05:00
Jeff Layton
7448cc37b1 locks: clean up the lm_change prototype
Now that we use standard list_heads for tracking leases, we can have
lm_change take a pointer to the lease to be modified instead of a
double pointer.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 16:08:50 -05:00
Jeff Layton
6109c85037 locks: add a dedicated spinlock to protect i_flctx lists
We can now add a dedicated spinlock without expanding struct inode.
Change to using that to protect the various i_flctx lists.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 16:08:49 -05:00
Jeff Layton
8634b51f6c locks: convert lease handling to file_lock_context
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 16:08:17 -05:00
Jeff Layton
bd61e0a9c8 locks: convert posix locks to file_lock_context
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 16:08:16 -05:00
Jeff Layton
5263e31e45 locks: move flock locks to file_lock_context
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 15:09:25 -05:00
Jeff Layton
c362781cad ceph: move spinlocking into ceph_encode_locks_to_buffer and ceph_count_locks
There is only a single call site for each of these functions, and the
caller takes the i_lock prior to calling them and drops it just
afterward. Move the spinlocking into the functions instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 15:09:25 -05:00
Jeff Layton
4a075e39c8 locks: add a new struct file_locking_context pointer to struct inode
The current scheme of using the i_flock list is really difficult to
manage. There is also a legitimate desire for a per-inode spinlock to
manage these lists that isn't the i_lock.

Start conversion to a new scheme to eventually replace the old i_flock
list with a new "file_lock_context" object.

We start by adding a new i_flctx to struct inode. For now, it lives in
parallel with i_flock list, but will eventually replace it. The idea is
to allocate a structure to sit in that pointer and act as a locus for
all things file locking.

We allocate a file_lock_context for an inode when the first lock is
added to it, and it's only freed when the inode is freed. We use the
i_lock to protect the assignment, but afterward it should mostly be
accessed locklessly.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 15:05:54 -05:00
Jeff Layton
dd459bb197 locks: have locks_release_file use flock_lock_file to release generic flock locks
...instead of open-coding it and removing flock locks directly. This
helps consolidate the flock lock removal logic into a single spot.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-01-16 15:05:54 -05:00
Jeff Layton
6dee60f69d locks: add new struct list_head to struct file_lock
...that we can use to queue file_locks to per-ctx list_heads. Go ahead
and convert locks_delete_lock and locks_dispose_list to use it instead
of the fl_block list.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16 15:05:54 -05:00
Linus Torvalds
62b1530065 Driver core fixes for 3.19-rc5
Here is one kernfs fix for a reported issue for 3.19-rc5.
 
 It has been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlS5V8QACgkQMUfUDdst+ynQTgCdEOUn6oftKCkErl4WWX9q0+ZT
 4CIAoLuGH9Gdn5tIVlqJ1tVmESnsgn0T
 =P84S
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
 "Here is one kernfs fix for a reported issue for 3.19-rc5.

  It has been in linux-next for a while"

* tag 'driver-core-3.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  kernfs: Fix kernfs_name_compare
2015-01-17 08:16:52 +13:00
Linus Torvalds
a2a32cd1d7 NFS client bugfixes for Linux 3.19
Highlights include:
 
 - Stable fix for a NFSv3/lockd race
 - Fixes for several NFSv4.1 client id trunking bugs
 - Remove an incorrect test when checking for delegated opens
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUuSAcAAoJEGcL54qWCgDyVGkQAJTTzJ8TcuDaXPz7a+8vYR/F
 0Z0iJ7Q2VzfVwLKReQdSMUhX8E6OsVe/HlmFT1paNdIOnF7gQvTclP7/GZ/EaDtA
 csBKkEcxoCRbF6OhCdeSMJC+iXs0cugCgYrkusPSGJj8dG+qa6Hix8rDfZKbM4yp
 sQphMozeI/FLHlckBTsViBCECcyR1FxjOrpXc1RPTz1u2dw9dDgZABdX00ubve2Y
 5dGIGyriFUiZzqFp+7GLXa2GITuLnda+IPPywyzF2MQy4XYWH63bXoWbyXL3kzQ7
 P7hLsGMOfl+pFOBnwYyXhhfqdsS/lOt8V+18Rrr1NhFx7VtQUae/VDPwLxnnoq0P
 ZtYiuJmIihzHfr4N5V2QR9TcuNO7fg6/1hxHZkOYxaYkH2IezFRWbtj9J/UJS+8q
 XDQzEvXMxtXCHKDQvmW653oQIdCFRR9YvgsN/YaxIB3vWPKL2tupeQWe/ZEdVvX3
 WemIxyc6NBS86otauVXOIKZFgiJ4mynpgH22xb43CJ33b049vJVu/NOQWHgrhHqQ
 7vpNDvpjyErjtoMVdN/x2L21uVTucZDXF59hfUD/Q9kE5qXGDRlqiFsDbkVvd+0C
 SXKBngtslBvgU3dYblAuJ9uMkTYkSE/vNkzVibPXJr4/cysVyMeeRDysVPHRZpw2
 8jH+yyHBxvK8XSDTxIkh
 =hz4r
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Stable fix for a NFSv3/lockd race
   - Fixes for several NFSv4.1 client id trunking bugs
   - Remove an incorrect test when checking for delegated opens"

* tag 'nfs-for-3.19-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Remove incorrect check in can_open_delegated()
  NFS: Ignore transport protocol when detecting server trunking
  NFSv4/v4.1: Verify the client owner id during trunking detection
  NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_client
  NFSv4.1: Fix client id trunking on Linux
  LOCKD: Fix a race when initialising nlmsvc_timeout
2015-01-17 07:59:06 +13:00
Linus Torvalds
cb59670870 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "This fixes a regression in the latest fuse update plus a fix for a
  rather theoretical memory ordering issue"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add memory barrier to INIT
  fuse: fix LOOKUP vs INIT compat handling
2015-01-16 14:58:16 +13:00
Rickard Strandqvist
917937025a nfsd: nfs4state: Remove unused function
Remove the function renew_client() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:42 -05:00
Rickard Strandqvist
4cb7208a41 lockd: xdr: Remove unused function
Remove the function nlm_encode_fh() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 13:46:27 -05:00
David Sterba
1d4c08e0a6 btrfs: expand btrfs_find_item if found_key is NULL
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper
for simple btrfs_search_slot.

After we've removed all such callers, passing a NULL key is not valid
anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:48 +01:00
David Sterba
9c4f61f01d btrfs: simplify insert_orphan_item
We can search and add the orphan item in one go,
btrfs_insert_orphan_item will find out if the item already exists.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:48 +01:00
David Sterba
c234a24de9 btrfs: cleanup, remove inode_ref_info helper
A simple wrapper around btrfs_find_item.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:47 +01:00
David Sterba
14692cc150 btrfs: cleanup, remove inode_item_info helper
It's only a simple wrapper around btrfs_find_item, the locally defined
key is not used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:47 +01:00
David Sterba
381cf6587f btrfs: fix leak of path in btrfs_find_item
If btrfs_find_item is called with NULL path it allocates one locally but
does not free it. Affected paths are inserting an orphan item for a file
and for a subvol root.

Move the path allocation to the callers.

CC: <stable@vger.kernel.org> # 3.14+
Fixes: 3f870c2899 ("btrfs: expand btrfs_find_item() to include find_orphan_item functionality")
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-01-14 19:23:46 +01:00
Matthew Wilcox
dd22f551ac block: Change direct_access calling convention
In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-13 21:58:11 -07:00
NeilBrown
52d304eb4e locks: fix NULL-deref in generic_delete_lease
commit 0efaa7e82f
  locks: generic_delete_lease doesn't need a file_lock at all

moves the call to fl->fl_lmops->lm_change() to a place in the
code where fl might be a non-lease lock.
When that happens, fl_lmops is NULL and an Oops ensures.

So add an extra test to restore correct functioning.

Reported-by: Linda Walsh <suse@tlinx.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=912569
Cc: stable@vger.kernel.org (v3.18)
Fixes: 0efaa7e82f
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-01-13 07:00:55 -05:00
alex chen
3566c96476 GFS2: fix sprintf format specifier
Sprintf format specifier "%d" and "%u" are mixed up in
gfs2_recovery_done() and freeze_show(). So correct them.

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2015-01-13 10:48:57 +00:00
Rickard Strandqvist
e4999bbe08 jffs2: compr_rubin: Remove unused function
Remove the function pulledbits() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Reviewed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2015-01-12 20:40:03 -08:00
Fabian Frederick
bbe48dd811 udf: destroy sbi mutex in put_super
Call mutex_destroy() on superblock mutex in udf_put_super()
otherwise mutex debugging code isn't able to detect that
mutex is used after being freed.
(thanks to Jan Kara for complete definition).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-12 10:56:04 +01:00
Matt Fleming
17473c32ce Updates included:
*Rename the function efi_guid_unparse to reflect better its behavior. - Borislav Petkov
 
  *Update the EFI firmware Kconfig help with the URL of efibootmgr. - Peter Jones
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUrfd/AAoJECYUxVC6df4Sd7kP/RP4uzJQnQTteg2ys2Zk2sEe
 rWdjxtcQurnP5P07SqNV/dZIWQx6hG/l61herFzryq9JKNye70fCd8bGl4sOPiHW
 i64FJtmCF+3vgLhJK2s6Ysub/xPdC0zkRXIIHvulX1nwZRWXiPzkJPzFlkCbXaW1
 O3ykSBIT4EwXAe84dN1Oio476Zyf/rS81rKakYXKwYNrfSKTmyH2FUlZODaSL+0a
 oGpae/2kps6rqdlOq19J5VKcYE3CTzy+Pv3QhqVlgywWF0AIUOHrcWpsWTv4IhKr
 v8blQcy7QZ/QWOEg7Nnus48/Mh0sNdAy497ZV4/tgmBfkGUcHD5qX5/Fy6Yp+/0R
 YJfRv5fTDyxYp6krdEeUnOphaXZT2jkG8uRdLJxNBONWBkwf/i9QXcmwuJc0+sgF
 7NKP8/kbqc3A6blQ/RNQRdeLW0hAAXfrVHj/zW8Fbi0PJTn+ALTTyvjDhMwYOr46
 IVbm1CzGQLXGNDFWvxRlMbzQa5PktiiqOxymTNJ6HZwP8TnS5IE5f/iEVpj5Yd+z
 zbYQcpBiN8VnQsKBmq04865/OrNgQRcUC+OU6Mz1iIIEIF8NLBDoRfE/oX0I+dEC
 jc0wTszmr0uXAu6dP4tgQEaQ0oNdOE806+BZ++3bpaZ3GCScjYtDOnSdsFB5ByYp
 dsAqvfPW8uY+I27SXMat
 =aLDw
 -----END PGP SIGNATURE-----

Merge tag 'rneri-efi-next' of https://github.com/ricardon/efi into next

Pull EFI changes for v3.20 from Ricardo Neri, who kindly picked them up
while I was out on annual leave in December.

Updates included:

 *Rename the function efi_guid_unparse to reflect better its behavior. - Borislav Petkov

 *Update the EFI firmware Kconfig help with the URL of efibootmgr. - Peter Jones

Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-01-12 09:33:25 +00:00
Linus Torvalds
5ab551d662 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Misc fixes: group scheduling corner case fix, two deadline scheduler
  fixes, effective_load() overflow fix, nested sleep fix, 6144 CPUs
  system fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()
  sched/deadline: Avoid double-accounting in case of missed deadlines
  sched/deadline: Fix migration of SCHED_DEADLINE tasks
  sched: Fix odd values in effective_load() calculations
  sched, fanotify: Deal with nested sleeps
  sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation
2015-01-11 11:51:49 -08:00
Linus Torvalds
dc9319f5a3 Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux
Pull two nfsd bugfixes from Bruce Fields.

* 'for-3.19' of git://linux-nfs.org/~bfields/linux:
  rpc: fix xdr_truncate_encode to handle buffer ending on page boundary
  nfsd: fix fi_delegees leak when fi_had_conflict returns true
2015-01-09 18:10:48 -08:00
Linus Torvalds
20ebb34528 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull two Ceph fixes from Sage Weil:
 "These are both pretty trivial: a sparse warning fix and size_t printk
  thing"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: fix sparse endianness warnings
  ceph: use %zu for len in ceph_fill_inline_data()
2015-01-09 17:55:00 -08:00
Linus Torvalds
03c751a5e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "None of these are huge, but my commit does fix a regression from 3.18
  that could cause lost files during log replay.

  This also adds Dave Sterba to the list of Btrfs maintainers.  It
  doesn't mean we're doing things differently, but Dave has really been
  helping with the maintainer workload for years"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: don't delay inode ref updates during log replay
  Btrfs: correctly get tree level in tree_backref_for_extent
  Btrfs: call inode_dec_link_count() on mkdir error path
  Btrfs: abort transaction if we don't find the block group
  Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
  Btrfs: add more maintainers
2015-01-09 17:46:07 -08:00
kbuild test robot
08e4126e1e f2fs: pids_lock can be static
fs/f2fs/trace.c:19:12: sparse: symbol 'pids_lock' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:29 -08:00
Jaegeuk Kim
351f4fba84 f2fs: add f2fs_destroy_trace_ios to free radix tree
This patch removes radix tree after finishing tracing IOs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim
c05086506f f2fs: add spin_lock to cover radix operations in IO tracer
This patch adds spin_lock to cover radix tree operations in IO tracer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim
dd4e4b59b1 f2fs: add nat/sit entries into status
This patch adds NAT/SIT entry informations.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim
7aed0d4537 f2fs: free radix_tree_nodes used by nat_set entries
In the normal case, the radix_tree_nodes are freed successfully.
But, when cp_error was detected, we should destroy them forcefully.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:28 -08:00
Jaegeuk Kim
df1991391f f2fs: fix wrong unlock_page call
This patch removes wrongly called unlock_page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Chao Yu
9e5ba77fdb f2fs: get rid of kzalloc in __recover_inline_status
We use kzalloc to allocate memory in __recover_inline_status, and use this
all-zero memory to check the inline date content of inode page by comparing
them. This is low effective and not needed, let's check inline date content
directly.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
[Jaegeuk Kim: make the code more neat]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Jaegeuk Kim
38aa0889b2 f2fs: align direct_io'ed data to section
This patch aligns the start block address of a file for direct io to the f2fs's
section size.

Some flash devices manage an over 4KB-sized page as a write unit, and if the
direct_io'ed data are written but not aligned to that unit, the performance can
be degraded due to the partial page copies.

Thus, since f2fs has a section that is well aligned to FTL units, we can align
the block address to the section size so that f2fs avoids this misalignment.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Jaegeuk Kim
41ef94b35c f2fs: remove uncovered code path
This patch removes unnecessary function calls.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:27 -08:00
Jaegeuk Kim
3547ea961d f2fs: avoid potential unnecessary codes
This patch relocates some operations to avoid unnecessary execution.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Jaegeuk Kim
e1509cf294 f2fs: clean up to remove parameter
This patch uses dn->data_blkaddr as a parameter for the destination block
address.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Chao Yu
062920734c f2fs: reuse inode_entry_slab in gc procedure for using slab more effectively
There are two slab cache inode_entry_slab and winode_slab using the same
structure as below:

struct dir_inode_entry {
	struct list_head list;	/* list head */
	struct inode *inode;	/* vfs inode pointer */
};

struct inode_entry {
	struct list_head list;
	struct inode *inode;
};

It's a little waste that the two cache can not share their memory space for each
other.
So in this patch we remove one redundant winode_slab slab cache, then use more
universal name struct inode_entry as remaining data structure name of slab,
finally we reuse the inode_entry_slab to store dirty dir item and gc item for
more effective.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Chao Yu
2ace38e00e f2fs: cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
Cleanup parameters for trace_f2fs_submit_{read_,write_,page_,page_m}bio with fio
as one parameter.

Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:26 -08:00
Chao Yu
3e1c8f125e f2fs: cleanup trace event of f2fs_submit_page_{m,}bio with DECLARE_EVENT_CLASS
This patch adds missing parameter _type_ for trace_f2fs_submit_page_bio, then
use DECLARE_EVENT_CLASS/DEFINE_EVENT_CONDITION pair to cleanup some trace event
code related to f2fs_submit_page_{m,}bio.

Additionally, after we remove redundant code, size of code can be reduced:
   text    data     bss     dec     hex filename
 176787    8712      56  185555   2d4d3 f2fs.ko.org
 174408    8648      56  183112   2cb48 f2fs.ko

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Jaegeuk Kim
09eb483e89 f2fs: fix missing cold bit during recovery
In do_recover_data, we find and update previous node pages after updating
its new block addresses.
After then, we call fill_node_footer without reset field, we erase its
cold bit so that this new cold node block is written to wrong log area.
This patch fixes not to miss its old flag.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Changman Lee
b9a2c25207 f2fs: add block count by in-place-update in stat info
This patch adds block count by in-place-update in stat.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Jaegeuk Kim
dd802406e3 f2fs: avoid double lock for cp_rwsem
The __f2fs_add_link is covered by cp_rwsem all the time.
This calls init_inode_metadata, which conducts some acl operations including
memory allocation with GFP_KERNEL previously.
But, under memory pressure, f2fs_write_data_page can be called, which also
grabs cp_rwsem too.

In this case, this incurs a deadlock pointed by Chao.
Thread #1        Thread #2
 down_read
                 down_write
  down_read
 -> here down_read should wait forever.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:25 -08:00
Jaegeuk Kim
db9f7c1a95 f2fs: activate f2fs_trace_ios
This patch activates f2fs_trace_ios.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim
9e4ded3f30 f2fs: activate f2fs_trace_pid
This patch activates f2fs_trace_pid.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim
0e689d0369 f2fs: add key functions for f2fs_io_tracer
This patch adds two key functions to trace process ids and IOs.
The basic idea is to
1. remain process ids, pids, in page->private.
2. show pids in IO traces.

So, later we can retrieve process information according to IO traces.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim
63f92ddc8a f2fs: add f2fs_io_tracer support
This patch adds:
 o initial trace.c and trace.h with skeleton functions
 o Kconfig and Makefile to activate this feature

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:24 -08:00
Jaegeuk Kim
cf04e8eb55 f2fs: use f2fs_io_info to clean up messy parameters during IO path
This patch cleans up parameters on IO paths.
The key idea is to use f2fs_io_info adding a parameter, block address, and then
use this structure as parameters.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Chao Yu
9ecf4b80bd f2fs: use ra_meta_pages to simplify readahead code in restore_node_summary
Use more common function ra_meta_pages() with META_POR to readahead node blocks
in restore_node_summary() instead of ra_sum_pages(), hence we can simplify the
readahead code there, and also we can remove unused function ra_sum_pages().

changes from v2:
 o use invalidate_mapping_pages as before suggested by Changman Lee.
changes from v1:
 o fix one bug when using truncate_inode_pages_range which is pointed out by
   Jaegeuk Kim.

Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Chao Yu
5c27f4ee44 f2fs: merge two uchar variable in struct node_info to reduce memory cost
This patch moves one member of struct nat_entry: _flag_ to struct node_info,
so _version_ in struct node_info and _flag_ which are unsigned char type will
merge to one 32-bit space in register/memory. So the size of nat_entry will be
reduced from 28 bytes to 24 bytes (for 64-bit machine, reduce its size from 40
bytes to 32 bytes) and then slab memory using by f2fs will be reduced.

changes from v2:
 o update description of memory usage gain for 64-bit machine suggested by
   Changman Lee.
changes from v1:
 o introduce inline copy_node_info() to copy valid data from node info suggested
   by Jaegeuk Kim, it can avoid bug.

Reviewed-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Chao Yu
3fa06d7bc9 f2fs: readahead contiguous current summary blocks in checkpoint
Let's add readahead code for reading contiguous compact/normal summary blocks
in checkpoint, then we will gain better performance in mount procedure.

Changes from v1
  o remove inappropriate 'unlikely' in npages_for_summary_flush.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:23 -08:00
Jaegeuk Kim
5df1f1da7a f2fs: use missing the use of f2fs_kunmap_page
This patch calls f2fs_kunmap_page which I missed before.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim
042b7816aa f2fs: remove unnecessary call to invalidate inmemory pages
Now we use inmemory pages for atomic write only and provide abort procedure,
we don't need to truncate them explicitly.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim
d7bc2484b8 f2fs: fix small discards not to issue redundantly
The ckpt_valid_map and cur_valid_map are synced by seg_info_to_raw_sit.

In the case of small discards, the candidates are selected before sync,
while fitrim selects candidates after sync.

So, for small discards, we need to add candidates only just being obsoleted.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim
1e84371ffe f2fs: change atomic and volatile write policies
This patch adds two new ioctls to release inmemory pages grabbed by atomic
writes.
 o f2fs_ioc_abort_volatile_write
  - If transaction was failed, all the grabbed pages and data should be written.
 o f2fs_ioc_release_volatile_write
  - This is to enhance the performance of PERSIST mode in sqlite.

In order to avoid huge memory consumption which causes OOM, this patch changes
volatile writes to use normal dirty pages, instead blocked flushing to the disk
as long as system does not suffer from memory pressure.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:22 -08:00
Jaegeuk Kim
70c640b1d6 f2fs: don't need to call lock_op and lock_page for abort
We don't need to call lock_op and lock_page at the aborting path.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:21 -08:00
Jaegeuk Kim
88a70a69c0 f2fs: fix wrong condition check to trigger f2fs_sync_fs
If there is not enough available memory, we need to trigger f2fs_sync_fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:21 -08:00
Jaegeuk Kim
cd52b6368f f2fs: remove checking dirty_exceed
We don't need to force to write dirty_exceeded for f2fs_balance_fs_bg.
This flag was only meaningful to write bypassing conditions.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-01-09 17:02:21 -08:00
Rasmus Villemoes
72392ed0eb kernfs: Fix kernfs_name_compare
Returning a difference from a comparison functions is usually wrong
(see acbbe6fbb2 "kcmp: fix standard comparison bug" for the long
story). Here there is the additional twist that if the void pointers
ns and kn->ns happen to differ by a multiple of 2^32,
kernfs_name_compare returns 0, falsely reporting a match to the
caller.

Technically 'hash - kn->hash' is ok since the hashes are restricted to
31 bits, but it's better to avoid that subtlety.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-09 15:51:08 -08:00
hujianyang
4330397e4e ovl: discard independent cursor in readdir()
Since the ovl_dir_cache is stable during a directory reading, the cursor
of struct ovl_dir_file don't need to be an independent entry in the list
of a merged directory.

This patch changes *cursor* to a pointer which points to the entry in the
ovl_dir_cache. After this, we don't need to check *is_cursor* either.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-09 14:55:57 +01:00
Bob Peterson
8f6cb409f0 GFS2: Eliminate __gfs2_glock_remove_from_lru
Since the only caller of function __gfs2_glock_remove_from_lru locks the
same spin_lock as gfs2_glock_remove_from_lru, the functions can be combined.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2015-01-09 11:33:13 +00:00
Peter Zijlstra
536ebe9ca9 sched, fanotify: Deal with nested sleeps
As per e23738a730 ("sched, inotify: Deal with nested sleeps").

fanotify_read is a wait loop with sleeps in. Wait loops rely on
task_struct::state and sleeps do too, since that's the only means of
actually sleeping. Therefore the nested sleeps destroy the wait loop
state and the wait loop breaks the sleep functions that assume
TASK_RUNNING (mutex_lock).

Fix this by using the new woken_wake_function and wait_woken() stuff,
which registers wakeups in wait and thereby allows shrinking the
task_state::state changes to the actual sleep part.

Reported-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Paris <eparis@redhat.com>
Link: http://lkml.kernel.org/r/20141216152838.GZ3337@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-09 11:18:12 +01:00
Dave Chinner
6bcf0939ff Merge branch 'xfs-misc-fixes-for-3.20-2' into for-next 2015-01-09 11:06:17 +11:00
Nicholas Mc Guire
43fd1fce96 xfs: fix implicit bool to int conversion
try_wait_for_completion returns bool so the wrapper function
xfs_dqflock_nowait should probably also return bool and not int.

Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:48:58 +11:00
Christoph Hellwig
d32057fc84 xfs: pass a 64-bit count argument to xfs_iomap_write_unwritten
The code is already ready for it, and the pnfs layout commit code expects
to be able to pass a larger than 32-bit argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:48:12 +11:00
Dave Chinner
64af7a6ea5 xfs: remove deprecated sysctls
xfsbufd_centisecs and age_buffer_centisecs were due for removal in
3.14. We forgot to do that - it's now well past time to remove these
deprecated, unused sysctls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:47:43 +11:00
Dave Chinner
aa5d95c1b5 xfs: move xfs_bmap_finish prototype
This function is used libxfs code, but is implemented separately in
userspace. Move the function prototype to xfs_bmap.h so that the
prototype is shared even if the implementations aren't.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:47:14 +11:00
Dave Chinner
9799b438ce xfs: move struct xfs_bmalloca to libxfs
It no long is used for stack splits, so strip the kernel workqueue
bits from it and push it back into libxfs/xfs_bmap.h so that
it can be shared with the userspace code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:46:49 +11:00
Dave Chinner
5ebdc213ac xfs: move xfs_types.h to libxfs
The types used by the core XFS code are common between kernel and
userspace. xfs_types.h is duplicated in both kernel and userspace,
so move it to libxfs along with all the other shared code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:46:31 +11:00
Dave Chinner
2155355fda xfs: move xfs_fs.h to libxfs
Ioctl API definitions are shared with userspace, so move the header
file that defines them all to libxfs along with all the other code
shared with userspace.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-01-09 10:45:13 +11:00
David Drysdale
75069f2b5b vfs: renumber FMODE_NONOTIFY and add to uniqueness check
Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc.  The
clashing O_PATH value was added in commit 5229645bdc ("vfs: add
nonconflicting values for O_PATH") but this can't be changed as it is
user-visible.

FMODE_NONOTIFY is only used internally in the kernel, but it is in the
same numbering space as the other O_* flags, as indicated by the comment
at the top of include/uapi/asm-generic/fcntl.h (and its use in
fs/notify/fanotify/fanotify_user.c).  So renumber it to avoid the clash.

All of this has happened before (commit 12ed2e36c9: "fanotify:
FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will
happen again -- so update the uniqueness check in fcntl_init() to
include __FMODE_NONOTIFY.

Signed-off-by: David Drysdale <drysdale@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08 15:10:52 -08:00
Xue jiufei
53dc20b9a3 ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file
In ocfs2_link(), the parent directory inode passed to function
ocfs2_lookup_ino_from_name() is wrong.  Parameter dir is the parent of
new_dentry not old_dentry.  We should get old_dir from old_dentry and
lookup old_dentry in old_dir in case another node remove the old dentry.

With this change, hard linking works again, when paths are relative with
at least one subdirectory.  This is how the problem was reproducable:

  # mkdir a
  # mkdir b
  # touch a/test
  # ln a/test b/test
  ln: failed to create hard link `b/test' => `a/test': No such file or  directory

However when creating links in the same dir, it worked well.

Now the link gets created.

Fixes: 0e048316ff ("ocfs2: check existence of old dentry in ocfs2_link()")
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Szabo Aron - UBIT <aron@ubit.hu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Tested-by: Aron Szabo <aron@ubit.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08 15:10:51 -08:00
Joseph Qi
eb4f73b4ca ocfs2: remove bogus check in dlm_process_recovery_data
In dlm_process_recovery_data, only when dlm_new_lock failed the ret will
be set to -ENOMEM.  And in this case, newlock is definitely NULL.  So
test newlock is meaningless, remove it.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-01-08 15:10:51 -08:00
Ilya Dryomov
0668ff52e2 ceph: use %zu for len in ceph_fill_inline_data()
len is size_t, should be printed with %zu.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-01-08 20:36:56 +03:00
Seunghun Lee
3cdf6fe910 ovl: Prevent rw remount when it should be ro mount
Overlayfs should be mounted read-only when upper-fs is read-only or nonexistent.
But now it can be remounted read-write and this can cause kernel panic.
So we should prevent read-write remount when the above situation happens.

Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-08 14:47:21 +01:00
hujianyang
a425c037f3 ovl: Fix opaque regression in ovl_lookup
Current multi-layer support overlayfs has a regression in
.lookup(). If there is a directory in upperdir and a regular
file has same name in lowerdir in a merged directory, lower
file is hidden and upper directory is set to opaque in former
case. But it is changed in present code.

In lowerdir lookup path, if a found inode is not directory,
the type checking of previous inode is missing. This inode
will be copied to the lowerstack of ovl_entry directly.

That will lead to several wrong conditions, for example,
the reading of the directory in upperdir may return an error
like:

   ls: reading directory .: Not a directory

This patch makes the lowerdir lookup path check the opaque
for non-directory file too.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-08 14:47:20 +01:00
hujianyang
2f83fd8c28 ovl: Fix kernel panic while mounting overlayfs
The function ovl_fill_super() in recently multi-layer support
version will incorrectly return 0 at error handling path and
then cause kernel panic.

This failure can be reproduced by mounting a overlayfs with
upperdir and workdir in different mounts.

And also, If the memory allocation of *lower_mnt* fail, this
function may return an zero either.

This patch fix this problem by setting *err* to proper error
number before jumping to error handling path.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-08 14:47:20 +01:00
Borislav Petkov
26e022727f efi: Rename efi_guid_unparse to efi_guid_to_str
Call it what it does - "unparse" is plain-misleading.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
2015-01-07 19:07:44 -08:00
J. Bruce Fields
0ec016e3e0 nfsd4: tweak rd_dircount accounting
RFC 3530 14.2.24 says

	This value represents the length of the names of the directory
	entries and the cookie value for these entries.  This length
	represents the XDR encoding of the data (names and cookies)...

The "xdr encoding" of the name should probably include the 4 bytes for
the length.

But this is all just a hint so not worth e.g. backporting to stable.

Also reshuffle some lines to more clearly group together the
dircount-related code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-07 14:48:10 -05:00
Jeff Layton
67db103448 nfsd: fi_delegees doesn't need to be an atomic_t
fi_delegees is always handled under the fi_lock, so there's no need to
use an atomic_t for this field.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-07 14:05:35 -05:00
Jeff Layton
94ae1db226 nfsd: fix fi_delegees leak when fi_had_conflict returns true
Currently, nfs4_set_delegation takes a reference to an existing
delegation and then checks to see if there is a conflict. If there is
one, then it doesn't release that reference.

Change the code to take the reference after the check and only if there
is no conflict.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-07 13:38:21 -05:00
Jan Kara
23b133bdc4 udf: Check length of extended attributes and allocation descriptors
Check length of extended attributes and allocation descriptors when
loading inodes from disk. Otherwise corrupted filesystems could confuse
the code and make the kernel oops.

Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-07 13:51:58 +01:00
Jan Kara
7914495427 udf: Remove repeated loads blocksize
Store blocksize in a local variable in udf_fill_inode() since it is used
a lot of times.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-07 13:46:16 +01:00
Oscar Forner Martinez
e4a93be6cb isofs: Fix bug in the way to check if the year is a leap year
Changed the whole algorithm for a call to mktime64 that takes
care of all that details.

Signed-off-by: Oscar Forner Martinez <oscar.forner.martinez@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-07 09:51:49 +01:00
Linus Torvalds
3b421b80be Revert a potential seek_data/hole regression which shows up when using
ext4 to handle ext3 file systems, plus two minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUqxTQAAoJENNvdpvBGATwUC8QAIYfP02XYyGrBQIoAoCiRJji
 TmhAe2Amy9xHgBq2I8Yz3bi8j1wGuiqc57fYgwwVScf10tzmO/HuwnAZZDAmg9hK
 ZWp9WyPoyf+U/nbkIfC5mRh3Qz0dt1pt6R3uQDUlcUuAamdMBrdJnhkQC6WMbpU2
 fAqsJT3/yGrLnMF29eVqJzcxb5KORJ8hEcD7kwkvJwe4sGm3C7iDsjS0i63YWDz4
 QNclW6zF4THhmuVNxwRupOgMQNSq8sHg8U23nP4DZLvLE7GlgtwfDvehU7uBfw5n
 WO5UfsEYLoeODNmujUJCtjXNLpzDXmrtByyWbbTK7EX3MmV94ym4uu5lHLfyMiTc
 o2ppxcsKBVcOsPWnFwuhJ5p/Wyy0Uld9Q3P6b5ymhyzDhkuwcTURpeRxBRXHEgcm
 nY5GE1bBdO7OigDz/+DFL/Zgr8EO7hW72hrBaLDWMEbrrl0asZw/ReC/bnreMmm4
 sP87DB+MqRXzRs8aOPWmCofJwGSgCYmOq2nqNCAaxgk/ofvrDURrnZYfLrbzspGa
 hqE1W0X5hvQydcifi4qq2Na76+Js3atSY38EOH/HNknSqlQjysnkW4ajTDWk/GFy
 M/fKUCfIl1tmCMN2myZzl89E7uMSyod75ycd0BQy36iHPE14JVvk/u7GfcKHLs53
 1rAPpW90a72GX2Z9+xxA
 =uPYz
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Revert a potential seek_data/hole regression which shows up when using
  ext4 to handle ext3 file systems, plus two minor bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove spurious KERN_INFO from ext4_warning call
  Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
  ext4: prevent online resize with backup superblock
2015-01-06 14:05:40 -08:00
Pranith Kumar
83fe27ea53 rcu: Make SRCU optional by using CONFIG_SRCU
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.

The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.

If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.

   text    data     bss     dec     hex filename
   2007       0       0    2007     7d7 kernel/rcu/srcu.o

Size of arch/powerpc/boot/zImage changes from

   text    data     bss     dec     hex filename
 831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
 829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after

so the savings are about ~2000 bytes.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
2015-01-06 11:04:29 -08:00
Miklos Szeredi
9759bd5189 fuse: add memory barrier to INIT
Theoretically we need to order setting of various fields in fc with
fc->initialized.

No known bug reports related to this yet.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2015-01-06 10:45:35 +01:00
Miklos Szeredi
21f621741a fuse: fix LOOKUP vs INIT compat handling
Analysis from Marc:

 "Commit 7078187a79 ("fuse: introduce fuse_simple_request() helper")
  from the above pull request triggers some EIO errors for me in some tests
  that rely on fuse

  Looking at the code changes and a bit of debugging info I think there's a
  general problem here that fuse_get_req checks and possibly waits for
  fc->initialized, and this was always called first.  But this commit
  changes the ordering and in many places fc->minor is now possibly used
  before fuse_get_req, and we can't be sure that fc has been initialized.
  In my case fuse_lookup_init sets req->out.args[0].size to the wrong size
  because fc->minor at that point is still 0, leading to the EIO error."

Fix by moving the compat adjustments into fuse_simple_request() to after
fuse_get_req().

This is also more readable than the original, since now compatibility is
handled in a single function instead of cluttering each operation.

Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 7078187a79 ("fuse: introduce fuse_simple_request() helper")
2015-01-06 10:45:35 +01:00
Trond Myklebust
4e379d36c0 NFSv4: Remove incorrect check in can_open_delegated()
Remove an incorrect check for NFS_DELEGATION_NEED_RECLAIM in
can_open_delegated(). We are allowed to cache opens even in
a situation where we're doing reboot recovery.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05 19:40:54 -08:00
Chuck Lever
7a01edf005 NFS: Ignore transport protocol when detecting server trunking
Detect server trunking across transport protocols. Otherwise, an
RDMA mount and a TCP mount of the same server will end up with
separate nfs_clients using the same clientid4.

Reported-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05 19:40:54 -08:00
Trond Myklebust
55b9df93dd NFSv4/v4.1: Verify the client owner id during trunking detection
While we normally expect the NFSv4 client to always send the same client
owner to all servers, there are a couple of situations where that is not
the case:
 1) In NFSv4.0, switching between use of '-omigration' and not will cause
    the kernel to switch between using the non-uniform and uniform client
    strings.
 2) In NFSv4.1, or NFSv4.0 when using uniform client strings, if the
    uniquifier string is suddenly changed.

This patch will catch those situations by checking the client owner id
in the trunking detection code, and will do the right thing if it notices
that the strings differ.

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05 19:40:53 -08:00
Trond Myklebust
ceb3a16c07 NFSv4: Cache the NFSv4/v4.1 client owner_id in the struct nfs_client
Ensure that we cache the NFSv4/v4.1 client owner_id so that we can
verify it when we're doing trunking detection.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05 19:40:53 -08:00
Trond Myklebust
1fc0703af3 NFSv4.1: Fix client id trunking on Linux
Currently, our trunking code will check for session trunking, but will
fail to detect client id trunking. This is a problem, because it means
that the client will fail to recognise that the two connections represent
shared state, even if they do not permit a shared session.
By removing the check for the server minor id, and only checking the
major id, we will end up doing the right thing in both cases: we close
down the new nfs_client and fall back to using the existing one.

Fixes: 05f4c350ee ("NFS: Discover NFSv4 server trunking when mounting")
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # 3.7.x
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05 19:40:53 -08:00
Trond Myklebust
06bed7d18c LOCKD: Fix a race when initialising nlmsvc_timeout
This commit fixes a race whereby nlmclnt_init() first starts the lockd
daemon, and then calls nlm_bind_host() with the expectation that
nlmsvc_timeout has already been initialised. Unfortunately, there is no
no synchronisation between lockd() and lockd_up() to guarantee that this
is the case.

Fix is to move the initialisation of nlmsvc_timeout into lockd_create_svc

Fixes: 9a1b6bf818 ("LOCKD: Don't call utsname()->nodename...")
Cc: Bruce Fields <bfields@fieldses.org>
Cc: stable@vger.kernel.org # 3.10.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-01-05 19:40:53 -08:00
Leif Lindholm
62c204ddfe fs: Make efivarfs a pseudo filesystem, built by default with EFI
efivars is currently enabled under MISC_FILESYSTEMS, which is decribed
as "such as filesystems that came from other operating systems".
In reality, it is a pseudo filesystem, providing access to the kernel
UEFI variable interface.

Since this is the preferred interface for accessing UEFI variables, over
the legacy efivars interface, also build it by default as a module if
CONFIG_EFI.

Signed-off-by: Leif Lindholm <leif.lindholm@linaro.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-01-05 14:15:58 +00:00
Fabian Frederick
6744e90b0f ext3: destroy sbi mutexes in put_super
Call mutex_destroy() on superblock mutexes in ext3_put_super().
Otherwise mutex debugging code isn't able to detect that mutex is used
after being freed.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-05 11:13:55 +01:00
Jan Kara
2c561bc362 udf: Update Kconfig description
Update description of UDF in Kconfig to mention that UDF is also
suitable for removable USB disks.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-01-05 11:04:37 +01:00
Jakub Wilk
363307e6e5 ext4: remove spurious KERN_INFO from ext4_warning call
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02 15:31:14 -05:00
Theodore Ts'o
ad7fefb109 Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
This reverts commit 14516bb7bb.

This was causing regression test failures with generic/285 with an ext3
filesystem using CONFIG_EXT4_USE_FOR_EXT23.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-01-02 15:16:00 -05:00
Chris Mason
6f8960541b Btrfs: don't delay inode ref updates during log replay
Commit 1d52c78afb (Btrfs: try not to ENOSPC on log replay) added a
check to skip delayed inode updates during log replay because it
confuses the enospc code.  But the delayed processing will end up
ignoring delayed refs from log replay because the inode itself wasn't
put through the delayed code.

This can end up triggering a warning at commit time:

WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()

Which is repeated for each commit because we never process the delayed
inode ref update.

The fix used here is to change btrfs_delayed_delete_inode_ref to return
an error if we're currently in log replay.  The caller will do the ref
deletion immediately and everything will work properly.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.18 and any stable series that picked 1d52c78afb
2015-01-02 14:47:56 -05:00
Filipe Manana
a1317f455a Btrfs: correctly get tree level in tree_backref_for_extent
If we are using skinny metadata, the block's tree level is in the offset
of the key and not in a btrfs_tree_block_info structure following the
extent item (it doesn't exist). Therefore fix it.

Besides returning the correct level in the tree, this also prevents reading
past the leaf's end in the case where the extent item is the last item in
the leaf (eb) and it has only 1 inline reference - this is because
sizeof(struct btrfs_tree_block_info) is greater than
sizeof(struct btrfs_extent_inline_ref).

Got it while running a scrub which produced the following warning:

    BTRFS: checksum error at logical 42123264 on dev /dev/sde, sector 15840: metadata node (level 24) in tree 5

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:56 -05:00
Wang Shilong
c7cfb8a540 Btrfs: call inode_dec_link_count() on mkdir error path
In btrfs_mkdir(), if it fails to create dir, we should
clean up existed items, setting inode's link properly
to make sure it could be cleaned up properly.

Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Josef Bacik
df95e7f0d9 Btrfs: abort transaction if we don't find the block group
We shouldn't BUG_ON() if there is corruption.  I hit this while testing my block
group patch and the abort worked properly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Dan Carpenter
6b6d24b389 Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()
The only way that "ret" is set is when we call scrub_pages_for_parity()
so the skip to "if (ret) " test doesn't make sense and causes a static
checker warning.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
Linus Torvalds
5faa0154fe Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "A set of three minor cifs fixes"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: make new inode cache when file type is different
  Fix signed/unsigned pointer warning
  Convert MessageID in smb2_hdr to LE
2014-12-29 21:09:57 -08:00
Linus Torvalds
b9d4a35f0a Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF & isofs fixes from Jan Kara:
 "A couple of UDF fixes of handling of corrupted media and one iso9660
  fix of the same"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Reduce repeated dereferences
  udf: Check component length before reading it
  udf: Check path length when reading symlink
  udf: Verify symlink size before loading it
  udf: Verify i_size when loading inode
  isofs: Fix unchecked printing of ER records
2014-12-29 20:43:10 -08:00
Theodore Ts'o
011fa99404 ext4: prevent online resize with backup superblock
Prevent BUG or corrupted file systems after the following:

mkfs.ext4 /dev/vdc 100M
mount -t ext4 -o sb=40961 /dev/vdc /vdc
resize2fs /dev/vdc

We previously prevented online resizing using the old resize ioctl.
Move the code to ext4_resize_begin(), so the check applies for all of
the resize ioctl's.

Reported-by: Maxim Malkov <malkov@ispras.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-26 23:58:21 -05:00
Al Viro
e1f1fe798d jfs: get rid of homegrown endianness helpers
Get rid of le24 stuff, along with the bitfields use - all that stuff
can be done with standard stuff, in sparse-verifiable manner.  Moreover,
that way (shift-and-mask) often generates better code - gcc optimizer
sucks on bitfields...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
----
2014-12-23 17:01:24 -06:00
Dave Chinner
efdca7aa3c Merge branch 'xfs-misc-fixes-for-3.20-1' into for-next 2014-12-24 09:49:53 +11:00
Jan Kara
1a43ec03dd xfs: Keep sb_bad_features2 consistent with sb_features2
Currently when we modify sb_features2, we store the same value also in
sb_bad_features2. However in most places we forget to mark field
sb_bad_features2 for logging and thus it can happen that a change to it
is lost. This results in an inconsistent sb_features2 and
sb_bad_features2 fields e.g. after xfstests test xfs/187.

Fix the problem by changing XFS_SB_FEATURES2 to actually mean both
sb_features2 and sb_bad_features2 fields since this is always what we
want to log. This isn't ideal because the fact that XFS_SB_FEATURES2
means two fields could cause some problem in future however the code is
hopefully less error prone that it is now.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24 09:48:35 +11:00
Eric Sandeen
77af574eef xfs: remove extra newlines from xfs messages
xfs_warn() and friends add a newline by default, but some
messages add another one.

Particularly for the failing write message below, this can
waste a lot of console real estate!

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24 09:47:27 +11:00
Brian Foster
96ab7954bc xfs: initialize log buf I/O completion wq on log alloc
Log buffer I/O completion passes through the high priority
m_log_workqueue rather than the default metadata buffer workqueue. The
log buffer wq is initialized at I/O submission time. The log buffers are
reused once initialized, however, so this is not necessary.

Initialize the log buffer I/O completion workqueue pointers once when
the log is allocated and log buffers initialized rather than on every
log buffer I/O submission.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24 09:46:23 +11:00
Carlos Maiolino
d31a182545 xfs: Add support to RENAME_EXCHANGE flag
Adds a new function named xfs_cross_rename(), responsible for
handling requests from sys_renameat2() using RENAME_EXCHANGE flag.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24 08:51:42 +11:00
Carlos Maiolino
dbe1b5ca26 xfs: Make xfs_vn_rename compliant with renameat2() syscall
To be able to support RENAME_EXCHANGE flag from renameat2() system
call, XFS must have its inode_operations updated, exporting .rename2
method, instead of .rename.

This patch just replaces the (now old) .rename method by .rename2,
using the same infra-structure, but checking rename flags.  Calls to
.rename2 using RENAME_EXCHANGE flag, although now handled inside
XFS, still return -EINVAL.

RENAME_NOREPLACE is handled via VFS and we don't need to care about
it inside xfs_vn_rename.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-24 08:51:38 +11:00
Nakajima Akira
9e6d722f3d cifs: make new inode cache when file type is different
In spite of different file type,
 if file is same name and same inode number, old inode cache is used.
This causes that you can not cd directory, can not cat SymbolicLink.
So this patch is that if file type is different, return error.

Reproducible sample :
1. create file 'a' at cifs client.
2. repeat rm and mkdir 'a' 4 times at server, then direcotry 'a' having same inode number is created.
   (Repeat 4 times, then same inode number is recycled.)
   (When server is under RHEL 6.6, 1 time is O.K.  Always same inode number is recycled.)
3. ls -li at client, then you can not cd directory, can not remove directory.

SymbolicLink has same problem.

Bug link:
https://bugzilla.kernel.org/show_bug.cgi?id=90011

Signed-off-by: Nakajima Akira <nakajima.akira@nttcom.co.jp>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-22 14:16:21 -06:00
Jan Kara
3ee3039c5b udf: Reduce repeated dereferences
Replace repeated dereferences like dir->i_sb by storing superblock
pointer in a variable and using that.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-21 22:42:37 +01:00
Jan Kara
e237ec37ec udf: Check component length before reading it
Check that length specified in a component of a symlink fits in the
input buffer we are reading. Also properly ignore component length for
component types that do not use it. Otherwise we read memory after end
of buffer for corrupted udf image.

Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-21 22:42:19 +01:00
Linus Torvalds
ecb5ec044a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #3 from Al Viro:
 "Assorted fixes and patches from the last cycle"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  [regression] chunk lost from bd9b51
  vfs: make mounts and mountstats honor root dir like mountinfo does
  vfs: cleanup show_mountinfo
  init: fix read-write root mount
  unfuck binfmt_misc.c (broken by commit e6084d4)
  vm_area_operations: kill ->migrate()
  new helper: iter_is_iovec()
  move_extent_per_page(): get rid of unused w_flags
  lustre: get rid of playing with ->fs
  btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
2014-12-19 18:19:19 -08:00
Linus Torvalds
298647e31a Fixes for filename decryption and encrypted view plus a cleanup
- The filename decryption routines were, at times, writing a zero byte one
   character past the end of the filename buffer
 - The encrypted view feature attempted, and failed, to roll its own form of
   enforcing a read-only mount instead of letting the VFS enforce it
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJUlGlWAAoJENaSAD2qAscKgdwQAJLCgt5xgoij8uH65mYVy6PE
 ++E8KggCHchhUcFh19EuuG++ENzwZedqKqbbRhmZt+svla08uPDO0jY+sYaOZzqB
 HMzZyyL4Fw/DiVsK6LUMGfN2CS7tua9ReK0dj4tUc3jwBplnQnGrQlUPbgdPRJJp
 XJDKHVtHGQPQC1dJAeProCJTg23jv5wXacly2I5VxZuh5DnDWjuo2KkzqGMKfadU
 9bxOf5DjVwQ/X6apgkNxG/8Q6J8a+80K7SGs/aRELIuRL0+mIj5AX9ht+p5Z0XI+
 E8jU4HM2biuqj8PtxK/WbyF1bHiREVyOLdneW+n3pcqGTLMZ3TLpDURkNd0cwZcd
 AGoZtUZZuQ/xvzE9IrEj+KYeINnZzPQpyJu9jXXp1CsQJ8u/wUG9e0kcJrUI1dZ/
 q26zhPaVhJ9x0go/BNOxzTUpOtuMWXttAVZGICthp74I5l+DgfSZ4J4CeQmFMMRs
 kbiTg5h4qsrD4jnEU/BCscpCLoz4qlF6+q/+lspwkg7OwvhHc0qSMJn3UjZ/eYzb
 ncDp6gc4Iju3RhWnVEZphA4ttbZJX7ahER4y0NdIG65Fa37AUEUxy0OmGfoyFOmC
 KVtlHkl2hKoCB5+ujLzL7WimH33ogtVhtOJTZlXzS0lsdSfSxUWU4/OZ22jTy0hL
 Bnx6uoNIPWsr6b5OJiA0
 =jTyq
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Fixes for filename decryption and encrypted view plus a cleanup

   - The filename decryption routines were, at times, writing a zero
     byte one character past the end of the filename buffer

   - The encrypted view feature attempted, and failed, to roll its own
     form of enforcing a read-only mount instead of letting the VFS
     enforce it"

* tag 'ecryptfs-3.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: Remove buggy and unnecessary write in file name decode routine
  eCryptfs: Remove unnecessary casts when parsing packet lengths
  eCryptfs: Force RO mount when encrypted view is enabled
2014-12-19 18:15:12 -08:00
Linus Torvalds
5c68eac68b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of our merge window patches.

  These are all from Filipe, and fix some really hard to find races that
  can cause corruptions.  Most of them involved block group removal
  (balance) or discard"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove non-sense btrfs_error_discard_extent() function
  Btrfs: fix fs corruption on transaction abort if device supports discard
  Btrfs: always clear a block group node when removing it from the tree
  Btrfs: ensure deletion from pinned_chunks list is protected
2014-12-19 18:10:42 -08:00
Linus Torvalds
ac88ee3b6c Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core fix from Thomas Gleixner:
 "A single fix plugging a long standing race between proc/stat and
  proc/interrupts access and freeing of interrupt descriptors"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Prevent proc race against freeing of irq descriptors
2014-12-19 13:26:08 -08:00
Jan Kara
0e5cc9a40a udf: Check path length when reading symlink
Symlink reading code does not check whether the resulting path fits into
the page provided by the generic code. This isn't as easy as just
checking the symlink size because of various encoding conversions we
perform on path. So we have to check whether there is still enough space
in the buffer on the fly.

CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-19 14:12:08 +01:00
Jan Kara
a1d47b2629 udf: Verify symlink size before loading it
UDF specification allows arbitrarily large symlinks. However we support
only symlinks at most one block large. Check the length of the symlink
so that we don't access memory beyond end of the symlink block.

CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-19 13:17:10 +01:00
Jan Kara
e159332b9a udf: Verify i_size when loading inode
Verify that inode size is sane when loading inode with data stored in
ICB. Otherwise we may get confused later when working with the inode and
inode size is too big.

CC: stable@vger.kernel.org
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-19 13:13:05 +01:00
Jan Kara
4e2024624e isofs: Fix unchecked printing of ER records
We didn't check length of rock ridge ER records before printing them.
Thus corrupted isofs image can cause us to access and print some memory
behind the buffer with obvious consequences.

Reported-and-tested-by: Carl Henrik Lunde <chlunde@ping.uio.no>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-19 11:29:24 +01:00
Linus Torvalds
018cb13eb3 Merge branch 'akpm' (patches from Andrew)
Merge misc patches from Andrew Morton:
 "A few stragglers"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  tools/testing/selftests/Makefile: alphasort the TARGETS list
  mm/zsmalloc: adjust order of functions
  ocfs2: fix journal commit deadlock
  ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_worker
  ocfs2: reflink: fix slow unlink for refcounted file
  mm/memory.c:do_shared_fault(): add comment
  .mailmap: Santosh Shilimkar has moved
  .mailmap: update akpm@osdl.org
  lib/show_mem.c: add cma reserved information
  fs/proc/meminfo.c: include cma info in proc/meminfo
  mm: cma: split cma-reserved in dmesg log
  hfsplus: fix longname handling
  mm/mempolicy.c: remove unnecessary is_valid_nodemask()
2014-12-18 19:08:25 -08:00
Junxiao Bi
136f49b917 ocfs2: fix journal commit deadlock
For buffer write, page lock will be got in write_begin and released in
write_end, in ocfs2_write_end_nolock(), before it unlock the page in
ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
for the read lock of journal->j_trans_barrier.  Holding page lock and
ask for journal->j_trans_barrier breaks the locking order.

This will cause a deadlock with journal commit threads, ocfs2cmt will
get write lock of journal->j_trans_barrier first, then it wakes up
kjournald2 to do the commit work, at last it waits until done.  To
commit journal, kjournald2 needs flushing data first, it needs get the
cache page lock.

Since some ocfs2 cluster locks are holding by write process, this
deadlock may hung the whole cluster.

unlock pages before ocfs2_run_deallocs() can fix the locking order, also
put unlock before ocfs2_commit_trans() to make page lock is unlocked
before j_trans_barrier to preserve unlocking order.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 19:08:11 -08:00
Joseph Qi
1e58958160 ocfs2/dlm: fix race between dispatched_work and dlm_lockres_grab_inflight_worker
Commit ac4fef4d23 ("ocfs2/dlm: do not purge lockres that is queued for
assert master") may have the following possible race case:

  dlm_dispatch_assert_master       dlm_wq
  ========================================================================
  queue_work(dlm->quedlm_worker,
      &dlm->dispatched_work);
                                 dispatch work,
                                 dlm_lockres_drop_inflight_worker
                                 *BUG_ON(res->inflight_assert_workers == 0)*
  dlm_lockres_grab_inflight_worker
  inflight_assert_workers++

So ensure inflight_assert_workers to be increased first.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Xue jiufei <xuejiufei@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 19:08:11 -08:00
Junxiao Bi
f62f12b3a4 ocfs2: reflink: fix slow unlink for refcounted file
When running ocfs2 test suite multiple nodes reflink stress test, for a
4 nodes cluster, every unlink() for refcounted file needs about 700s.

The slow unlink is caused by the contention of refcount tree lock since
all nodes are unlink files using the same refcount tree.  When the
unlinking file have many extents(over 1600 in our test), most of the
extents has refcounted flag set.  In ocfs2_commit_truncate(), it will
execute the following call trace for every extents.  This means it needs
get and released refcount tree lock about 1600 times.  And when several
nodes are do this at the same time, the performance will be very low.

  ocfs2_remove_btree_range()
  --  ocfs2_lock_refcount_tree()
  ----  ocfs2_refcount_lock()
  ------  __ocfs2_cluster_lock()

ocfs2_refcount_lock() is costly, move it to ocfs2_commit_truncate() to
do lock/unlock once can improve a lot performance.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Wengang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 19:08:11 -08:00
Pintu Kumar
47f8f9297d fs/proc/meminfo.c: include cma info in proc/meminfo
This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo.
Currently, in a CMA enabled system, if somebody wants to know the total
CMA size declared, there is no way to tell, other than the dmesg or
/var/log/messages logs.

With this patch we are showing the CMA info as part of meminfo, so that it
can be determined at any point of time.  This will be populated only when
CMA is enabled.

Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB.

  MemTotal:         471172 kB
  MemFree:          111712 kB
  MemAvailable:     271172 kB
  .
  .
  .
  CmaTotal:          16384 kB
  CmaFree:            6144 kB

This patch also fix below checkpatch errors that were found during these changes.

  ERROR: space required after that ',' (ctx:ExV)
  199: FILE: fs/proc/meminfo.c:199:
  +       ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
          ^

  ERROR: space required after that ',' (ctx:ExV)
  202: FILE: fs/proc/meminfo.c:202:
  +       ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
          ^

  ERROR: space required after that ',' (ctx:ExV)
  206: FILE: fs/proc/meminfo.c:206:
  +       ,K(totalcma_pages)
          ^

  total: 3 errors, 0 warnings, 2 checks, 236 lines checked

Signed-off-by: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Vishnu Pratap Singh <vishnu.ps@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 19:08:10 -08:00
Sougata Santra
89ac9b4d3d hfsplus: fix longname handling
Longname is not correctly handled by hfsplus driver.  If an attempt to
create a longname(>255) file/directory is made, it succeeds by creating a
file/directory with HFSPLUS_MAX_STRLEN and incorrect catalog key.  Thus
leaving the volume in an inconsistent state.  This patch fixes this issue.

Although lookup is always called first to create a negative entry, so just
doing a check in lookup would probably fix this issue.  I choose to
propagate error to other iops as well.

Please NOTE: I have factored out hfsplus_cat_build_key_with_cnid from
hfsplus_cat_build_key, to avoid unncessary branching.

Thanks a lot.

  TEST:
  ------
  dir="TEST_DIR"
  cdir=`pwd`
  name255="_123456789_123456789_123456789_123456789_123456789_123456789\
  _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
  _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
  _123456789_123456789_123456789_123456789_123456789_1234"
  name256="${name255}5"

  mkdir $dir
  cd $dir
  touch $name255
  rm -f $name255
  touch $name256
  ls -la
  cd $cdir
  rm -rf $dir

  RESULT:
  -------
  [sougata@ultrabook tmp]$ cdir=`pwd`
  [sougata@ultrabook tmp]$
  name255="_123456789_123456789_123456789_123456789_123456789_123456789\
   > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
   > _123456789_123456789_123456789_123456789_123456789_123456789_123456789\
   > _123456789_123456789_123456789_123456789_123456789_1234"
  [sougata@ultrabook tmp]$ name256="${name255}5"
  [sougata@ultrabook tmp]$
  [sougata@ultrabook tmp]$ mkdir $dir
  [sougata@ultrabook tmp]$ cd $dir
  [sougata@ultrabook TEST_DIR]$ touch $name255
  [sougata@ultrabook TEST_DIR]$ rm -f $name255
  [sougata@ultrabook TEST_DIR]$ touch $name256
  [sougata@ultrabook TEST_DIR]$ ls -la
  ls: cannot access
  _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234:
  No such file or directory
  total 0
  drwxrwxr-x 1 sougata sougata 3 Feb 20 19:56 .
  drwxrwxrwx 1 root    root    6 Feb 20 19:56 ..
  -????????? ? ?       ?       ?            ?
  _123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_123456789_1234
  [sougata@ultrabook TEST_DIR]$ cd $cdir
  [sougata@ultrabook tmp]$ rm -rf $dir
  rm: cannot remove `TEST_DIR': Directory not empty

-ENAMETOOLONG returned from hfsplus_asc2uni was not propaged to iops.
This allowed hfsplus to create files/directories with HFSPLUS_MAX_STRLEN
and incorrect keys, leaving the FS in an inconsistent state.  This patch
fixes this issue.

Signed-off-by: Sougata Santra <sougata@tuxera.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 19:08:10 -08:00
Eric W. Biederman
c297abfdf1 mnt: Fix a memory stomp in umount
While reviewing the code of umount_tree I realized that when we append
to a preexisting unmounted list we do not change pprev of the former
first item in the list.

Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
the former first item of the list will stomp unmounted.first leaving
it set to some random mount point which we are likely to free soon.

This isn't likely to hit, but if it does I don't know how anyone could
track it down.

[ This happened because we don't have all the same operations for
  hlist's as we do for normal doubly-linked lists. In particular,
  list_splice() is easy on our standard doubly-linked lists, while
  hlist_splice() doesn't exist and needs both start/end entries of the
  hlist.  And commit 38129a13e6 incorrectly open-coded that missing
  hlist_splice().

  We should think about making these kinds of "mindless" conversions
  easier to get right by adding the missing hlist helpers   - Linus ]

Fixes: 38129a13e6 switch mnt_hash to hlist
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-18 11:22:02 -08:00
Linus Torvalds
44e8967d59 Ceph: remove left-over reject file
Neither Sage nor I noticed that Zheng Yan had mistakenly committed
fs/ceph/super.h.rej as part of commit 31c542a199 ("ceph: add inline
data to pagecache").

Remove it.

Requested-by: Yan, Zheng <ukernel@gmail.com>
Cc: Sage Weil <sweil@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-17 18:47:01 -08:00
Linus Torvalds
57666509b7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "The big item here is support for inline data for CephFS and for
  message signatures from Zheng.  There are also several bug fixes,
  including interrupted flock request handling, 0-length xattrs, mksnap,
  cached readdir results, and a message version compat field.  Finally
  there are several cleanups from Ilya, Dan, and Markus.

  Note that there is another series coming soon that fixes some bugs in
  the RBD 'lingering' requests, but it isn't quite ready yet"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
  ceph: fix setting empty extended attribute
  ceph: fix mksnap crash
  ceph: do_sync is never initialized
  libceph: fixup includes in pagelist.h
  ceph: support inline data feature
  ceph: flush inline version
  ceph: convert inline data to normal data before data write
  ceph: sync read inline data
  ceph: fetch inline data when getting Fcr cap refs
  ceph: use getattr request to fetch inline data
  ceph: add inline data to pagecache
  ceph: parse inline data in MClientReply and MClientCaps
  libceph: specify position of extent operation
  libceph: add CREATE osd operation support
  libceph: add SETXATTR/CMPXATTR osd operations support
  rbd: don't treat CEPH_OSD_OP_DELETE as extent op
  ceph: remove unused stringification macros
  libceph: require cephx message signature by default
  ceph: introduce global empty snap context
  ceph: message versioning fixes
  ...
2014-12-17 16:03:12 -08:00
Linus Torvalds
87c31b39ab Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace related fixes from Eric Biederman:
 "As these are bug fixes almost all of thes changes are marked for
  backporting to stable.

  The first change (implicitly adding MNT_NODEV on remount) addresses a
  regression that was created when security issues with unprivileged
  remount were closed.  I go on to update the remount test to make it
  easy to detect if this issue reoccurs.

  Then there are a handful of mount and umount related fixes.

  Then half of the changes deal with the a recently discovered design
  bug in the permission checks of gid_map.  Unix since the beginning has
  allowed setting group permissions on files to less than the user and
  other permissions (aka ---rwx---rwx).  As the unix permission checks
  stop as soon as a group matches, and setgroups allows setting groups
  that can not later be dropped, results in a situtation where it is
  possible to legitimately use a group to assign fewer privileges to a
  process.  Which means dropping a group can increase a processes
  privileges.

  The fix I have adopted is that gid_map is now no longer writable
  without privilege unless the new file /proc/self/setgroups has been
  set to permanently disable setgroups.

  The bulk of user namespace using applications even the applications
  using applications using user namespaces without privilege remain
  unaffected by this change.  Unfortunately this ix breaks a couple user
  space applications, that were relying on the problematic behavior (one
  of which was tools/selftests/mount/unprivileged-remount-test.c).

  To hopefully prevent needing a regression fix on top of my security
  fix I rounded folks who work with the container implementations mostly
  like to be affected and encouraged them to test the changes.

    > So far nothing broke on my libvirt-lxc test bed. :-)
    > Tested with openSUSE 13.2 and libvirt 1.2.9.
    > Tested-by: Richard Weinberger <richard@nod.at>

    > Tested on Fedora20 with libvirt 1.2.11, works fine.
    > Tested-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>

    > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
    > Just to be sure I was testing the right thing I also tested using
    > my unprivileged nsexec testcases, and they failed on setgroup/setgid
    > as now expected, and succeeded there without your patches.
    > Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com>

    > I tested this with Sandstorm.  It breaks as is and it works if I add
    > the setgroups thing.
    > Tested-by: Andy Lutomirski <luto@amacapital.net> # breaks things as designed :("

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  userns: Unbreak the unprivileged remount tests
  userns; Correct the comment in map_write
  userns: Allow setting gid_maps without privilege when setgroups is disabled
  userns: Add a knob to disable setgroups on a per user namespace basis
  userns: Rename id_map_mutex to userns_state_mutex
  userns: Only allow the creator of the userns unprivileged mappings
  userns: Check euid no fsuid when establishing an unprivileged uid mapping
  userns: Don't allow unprivileged creation of gid mappings
  userns: Don't allow setgroups until a gid mapping has been setablished
  userns: Document what the invariant required for safe unprivileged mappings.
  groups: Consolidate the setgroups permission checks
  mnt: Clear mnt_expire during pivot_root
  mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
  mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
  umount: Do not allow unmounting rootfs.
  umount: Disallow unprivileged mount force
  mnt: Update unprivileged remount test
  mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
2014-12-17 12:31:40 -08:00
Linus Torvalds
d6666be6f0 MTD updates for 3.19:
* Add device tree support for DoC3
 
  * SPI NOR:
 
     Refactoring, for better layering between spi-nor.c and its driver users
     (e.g., m25p80.c)
 
     New flash device support
 
     Support 6-byte ID strings
 
  * NAND
 
     New NAND driver for Allwinner SoC's (sunxi)
 
     GPMI NAND: add support for raw (no ECC) access, for testing purposes
 
     Add ATO manufacturer ID
 
     A few odd driver fixes
 
  * MTD tests:
 
     Allow testers to compensate for OOB bitflips in oobtest
 
     Fix a torturetest regression
 
  * nandsim: Support longer ID byte strings
 
 And more.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUj6PbAAoJEFySrpd9RFgtahcP/RGvknk9lnitaZI7+aZPP8Zs
 AopfiuisLNv3v87EEBAWGYjRYm6vuuhO1z3K55iOIlemBVzoMIP0jf68XGy9uXnL
 Ox6AHqxm55wwmc+CHry5/GssaqE6GzdPm8TBP+AGGNhHrhc+raJL55R0QJaoYVwX
 pUxkhWaa4lZ6CrOIMQ3n+MEnduilHZoFIcXSc1UtI0y9IXsf1m0Qs8M5jGN8BQ16
 HOVNJN9wOXF89swoi7bKsyAn+QFUYgdksAisncb6b9r5Ks5KHmcOuS1LM5X9YoUr
 KfeogHfDM68fcaHsSvMU1xmxjXGtE+HFJE52eYNPB1fNbT3JAC13FFj92GeSsZtE
 ekpCQh4OPLazT/2wCUHTQwC7T1dCilwyW7VJB9MSl7fSBo9P7jIiUHxVUdM43kez
 Di02/XWi4IULTwrgzZiTT8yplFrVdvkmKHAAFEIoaVWiF/l4DeSodLGUw7owmNYn
 rJPBPQulpPHRwKZY7gThJuOUXpgbT715GSbvmPYXimHBqmViiPkrhqQ/b/v4PRRs
 Nlfhwbswr7WBq6vmPkd6eOyfdFANmWcZQMp/++BCdI/7mhfaik72GWMTBSuJ7hN5
 PB+z95soHaKBWlbiDGGGPvuqJmPkOVq1R8itQdIYBWEh7eNSHecwVxyUJJ+V3oPv
 QkD7mEP2ZozZe3Ys2EJQ
 =gDW8
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20141215' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "Summary:
   - Add device tree support for DoC3

   - SPI NOR:
        Refactoring, for better layering between spi-nor.c and its
        driver users (e.g., m25p80.c)

        New flash device support

        Support 6-byte ID strings

   - NAND:
        New NAND driver for Allwinner SoC's (sunxi)

        GPMI NAND: add support for raw (no ECC) access, for testing
        purposes

        Add ATO manufacturer ID

        A few odd driver fixes

   - MTD tests:
        Allow testers to compensate for OOB bitflips in oobtest

        Fix a torturetest regression

   - nandsim: Support longer ID byte strings

  And more"

* tag 'for-linus-20141215' of git://git.infradead.org/linux-mtd: (63 commits)
  mtd: tests: abort torturetest on erase errors
  mtd: physmap_of: fix potential NULL dereference
  mtd: spi-nor: allow NULL as chip name and try to auto detect it
  mtd: nand: gpmi: add raw oob access functions
  mtd: nand: gpmi: add proper raw access support
  mtd: nand: gpmi: add gpmi_copy_bits function
  mtd: spi-nor: factor out write_enable() for erase commands
  mtd: spi-nor: add support for s25fl128s
  mtd: spi-nor: remove the jedec_id/ext_id
  mtd: spi-nor: add id/id_len for flash_info{}
  mtd: nand: correct the comment of function nand_block_isreserved()
  jffs2: Drop bogus if in comment
  mtd: atmel_nand: replace memcpy32_toio/memcpy32_fromio with memcpy
  mtd: cafe_nand: drop duplicate .write_page implementation
  mtd: m25p80: Add support for serial flash Spansion S25FL132K
  MTD: m25p80: fix inconsistency in m25p_ids compared to spi_nor_ids
  mtd: spi-nor: improve wait-till-ready timeout loop
  mtd: delete unnecessary checks before two function calls
  mtd: nand: omap: Fix NAND enumeration on 3430 LDP
  mtd: nand: add ATO manufacturer info
  ...
2014-12-17 09:59:26 -08:00
Linus Torvalds
c103b21c20 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
 "The first part makes sure we don't hold up umount with pending async
  requests.  In addition to being a cleanup, this is a small behavioral
  change (for the better) and unlikely to break anything.

  The second part prepares for a cleanup of the fuse device I/O code by
  adding a helper for simple request submission, with some savings in
  line numbers already realized"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: use file_inode() in fuse_file_fallocate()
  fuse: introduce fuse_simple_request() helper
  fuse: reduce max out args
  fuse: hold inode instead of path after release
  fuse: flush requests on umount
  fuse: don't wake up reserved req in fuse_conn_kill()
2014-12-17 09:41:32 -08:00
Yan, Zheng
0aeff37aba ceph: fix setting empty extended attribute
make sure 'value' is not null. otherwise __ceph_setxattr will remove
the extended attribute.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17 20:18:49 +03:00
Yan, Zheng
275dd19ea4 ceph: fix mksnap crash
mksnap reply only contain 'target', does not contain 'dentry'. So
it's wrong to use req->r_reply_info.head->is_dentry to detect traceless
reply.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17 20:09:53 +03:00
Dan Carpenter
021b77bee2 ceph: do_sync is never initialized
Probably this code was syncing a lot more often then intended because
the do_sync variable wasn't set to zero.

Cc: stable@vger.kernel.org # v3.11+
Fixes: c62988ec09 ('ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:53 +03:00
Yan, Zheng
65a22662bf ceph: support inline data feature
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:53 +03:00
Yan, Zheng
e20d258d73 ceph: flush inline version
After converting inline data to normal data, client need to flush
the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes
cap messages (sent to MDS) contain inline_version and inline_data.
Client always converts inline data to normal data before data write,
so the inline data length part is always zero.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:53 +03:00
Yan, Zheng
28127bdd2f ceph: convert inline data to normal data before data write
Before any data write, convert inline data to normal data and set
i_inline_version to CEPH_INLINE_NONE. The OSD request that saves
inline data to object contains 3 operations (CMPXATTR, WRITE and
SETXATTR). It compares a xattr named 'inline_version' to prevent
old data overwrites newer data.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng
83701246ae ceph: sync read inline data
we can't use getattr to fetch inline data while holding Fr cap,
because it can cause deadlock. If we need to sync read inline data,
drop cap refs first, then use getattr to fetch inline data.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng
3738daa68a ceph: fetch inline data when getting Fcr cap refs
we can't use getattr to fetch inline data after getting Fcr caps,
because it can cause deadlock. The solution is try bringing inline
data to page cache when not holding any cap, and hope the inline
data page is still there after getting the Fcr caps. If the page
is still there, pin it in page cache for later IO.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng
01deead041 ceph: use getattr request to fetch inline data
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng
31c542a199 ceph: add inline data to pagecache
Request reply and cap message can contain inline data. add inline data
to the page cache if there is Fc cap.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng
fb01d1f8b0 ceph: parse inline data in MClientReply and MClientCaps
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:52 +03:00
Yan, Zheng
715e4cd405 libceph: specify position of extent operation
allow specifying position of extent operation in multi-operations
osd request. This is required for cephfs to convert inline data to
normal data (compare xattr, then write object).

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:52 +03:00
Ilya Dryomov
ca3995ad13 ceph: remove unused stringification macros
These were used to report git versions a long time ago.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:51 +03:00
Yan, Zheng
97c85a828f ceph: introduce global empty snap context
Current snaphost code does not properly handle moving inode from one
empty snap realm to another empty snap realm. After changing inode's
snap realm, some dirty pages' snap context can be not equal to inode's
i_head_snap. This can trigger BUG() in ceph_put_wrbuffer_cap_refs()

The fix is introduce a global empty snap context for all empty snap
realm. This avoids triggering the BUG() for filesystem with no snapshot.

Fixes: http://tracker.ceph.com/issues/9928

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:51 +03:00
John Spray
7cfa0313d0 ceph: message versioning fixes
There were two places we were assigning version in host byte order
instead of network byte order.

Also in MSG_CLIENT_SESSION we weren't setting compat_version in the
header to reflect continued compatability with older MDSs.

Fixes: http://tracker.ceph.com/issues/9945

Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-12-17 20:09:51 +03:00
Yan, Zheng
33d0733796 libceph: message signature support
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:50 +03:00
SF Markus Elfring
e96a650a81 ceph, rbd: delete unnecessary checks before two function calls
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:50 +03:00
Yan, Zheng
70db4f3629 ceph: introduce a new inode flag indicating if cached dentries are ordered
After creating/deleting/renaming file, offsets of sibling dentries may
change. So we can not use cached dentries to satisfy readdir. But we can
still use the cached dentries to conclude -ENOENT for lookup.

This patch introduces a new inode flag indicating if child dentries are
ordered. The flag is set at the same time marking a directory complete.
After creating/deleting/renaming file, we clear the flag on directory
inode. This prevents ceph_readdir() from using cached dentries to satisfy
readdir syscall.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:50 +03:00
Yan, Zheng
9280be24dc ceph: fix file lock interruption
When a lock operation is interrupted, current code sends a unlock request to
MDS to undo the lock operation. This method does not work as expected because
the unlock request can drop locks that have already been acquired.

The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR
requests to interrupt blocked file lock request. These requests do not drop
locks that have alread been acquired, they only interrupt blocked file lock
request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-12-17 20:09:49 +03:00
Dmitry V. Levin
9d4d65748a vfs: make mounts and mountstats honor root dir like mountinfo does
As we already show mountpoints relative to the root directory, thanks
to the change made back in 2000, change show_vfsmnt() and show_vfsstat()
to skip out-of-root mountpoints the same way as show_mountinfo() does.

Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 08:27:15 -05:00
Dmitry V. Levin
9ad4dc4f73 vfs: cleanup show_mountinfo
Starting with commit v3.2-rc4-1-g02125a8, seq_path_root() no longer
changes the value of its "struct path *root" argument.
Starting with commit v3.2-rc7-104-g8c9379e, the "struct path *root"
argument of seq_path_root() is const.
As result, the temporary variable "root" in show_mountinfo() that
holds a copy of struct path root is no longer needed.

Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 08:27:15 -05:00
Al Viro
7d65cf10e3 unfuck binfmt_misc.c (broken by commit e6084d4)
scanarg(s, del) never returns s; the empty field results in s + 1.
Restore the correct checks, and move NUL-termination into scanarg(),
while we are at it.

Incidentally, mixing "coding style cleanups" (for small values of cleanup)
with functional changes is a Bad Idea(tm)...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 08:27:14 -05:00
Al Viro
50062175ff vm_area_operations: kill ->migrate()
the only instance this method has ever grown was one in kernfs -
one that call ->migrate() of another vm_ops if it exists.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 08:26:51 -05:00
Al Viro
b1bc6d7f16 move_extent_per_page(): get rid of unused w_flags
... and comparing get_fs() with KERNEL_DS used only to initialize that

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 06:43:56 -05:00
Al Viro
98af592f5b btrfs: filp_open() returns ERR_PTR() on failure, not NULL...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 06:43:56 -05:00
Linus Torvalds
603ba7e41b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile #2 from Al Viro:
 "Next pile (and there'll be one or two more).

  The large piece in this one is getting rid of /proc/*/ns/* weirdness;
  among other things, it allows to (finally) make nameidata completely
  opaque outside of fs/namei.c, making for easier further cleanups in
  there"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  coda_venus_readdir(): use file_inode()
  fs/namei.c: fold link_path_walk() call into path_init()
  path_init(): don't bother with LOOKUP_PARENT in argument
  fs/namei.c: new helper (path_cleanup())
  path_init(): store the "base" pointer to file in nameidata itself
  make default ->i_fop have ->open() fail with ENXIO
  make nameidata completely opaque outside of fs/namei.c
  kill proc_ns completely
  take the targets of /proc/*/ns/* symlinks to separate fs
  bury struct proc_ns in fs/proc
  copy address of proc_ns_ops into ns_common
  new helpers: ns_alloc_inum/ns_free_inum
  make proc_ns_operations work with struct ns_common * instead of void *
  switch the rest of proc_ns_operations to working with &...->ns
  netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
  make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
  common object embedded into various struct ....ns
2014-12-16 15:53:03 -08:00
Linus Torvalds
31f48fc8f2 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull isofs and reiserfs fixes from Jan Kara:
 "A reiserfs and an isofs fix.  They arrived after I sent you my first
  pull request and I don't want to delay them unnecessarily till rc2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  isofs: Fix infinite looping over CE entries
  reiserfs: destroy allocated commit workqueue
2014-12-16 15:46:01 -08:00
Linus Torvalds
0b233b7c79 Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "A comparatively quieter cycle for nfsd this time, but still with two
  larger changes:

   - RPC server scalability improvements from Jeff Layton (using RCU
     instead of a spinlock to find idle threads).

   - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
     Schumaker, enabling fallocate on new clients"

* 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
  nfsd4: fix xdr4 count of server in fs_location4
  nfsd4: fix xdr4 inclusion of escaped char
  sunrpc/cache: convert to use string_escape_str()
  sunrpc: only call test_bit once in svc_xprt_received
  fs: nfsd: Fix signedness bug in compare_blob
  sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
  sunrpc: convert to lockless lookup of queued server threads
  sunrpc: fix potential races in pool_stats collection
  sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
  sunrpc: require svc_create callers to pass in meaningful shutdown routine
  sunrpc: have svc_wake_up only deal with pool 0
  sunrpc: convert sp_task_pending flag to use atomic bitops
  sunrpc: move rq_cachetype field to better optimize space
  sunrpc: move rq_splice_ok flag into rq_flags
  sunrpc: move rq_dropme flag into rq_flags
  sunrpc: move rq_usedeferral flag to rq_flags
  sunrpc: move rq_local field to rq_flags
  sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
  nfsd: minor off by one checks in __write_versions()
  sunrpc: release svc_pool_map reference when serv allocation fails
  ...
2014-12-16 15:25:31 -08:00
Jan Kara
f54e18f1b8 isofs: Fix infinite looping over CE entries
Rock Ridge extensions define so called Continuation Entries (CE) which
define where is further space with Rock Ridge data. Corrupted isofs
image can contain arbitrarily long chain of these, including a one
containing loop and thus causing kernel to end in an infinite loop when
traversing these entries.

Limit the traversal to 32 entries which should be more than enough space
to store all the Rock Ridge data.

Reported-by: P J P <ppandit@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-15 15:53:26 +01:00
Linus Torvalds
67e2c38838 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security layer updates from James Morris:
 "In terms of changes, there's general maintenance to the Smack,
  SELinux, and integrity code.

  The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT,
  which allows IMA appraisal to require signatures.  Support for reading
  keys from rootfs before init is call is also added"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits)
  selinux: Remove security_ops extern
  security: smack: fix out-of-bounds access in smk_parse_smack()
  VFS: refactor vfs_read()
  ima: require signature based appraisal
  integrity: provide a hook to load keys when rootfs is ready
  ima: load x509 certificate from the kernel
  integrity: provide a function to load x509 certificate from the kernel
  integrity: define a new function integrity_read_file()
  Security: smack: replace kzalloc with kmem_cache for inode_smack
  Smack: Lock mode for the floor and hat labels
  ima: added support for new kernel cmdline parameter ima_template_fmt
  ima: allocate field pointers array on demand in template_desc_init_fields()
  ima: don't allocate a copy of template_fmt in template_desc_init_fields()
  ima: display template format in meas. list if template name length is zero
  ima: added error messages to template-related functions
  ima: use atomic bit operations to protect policy update interface
  ima: ignore empty and with whitespaces policy lines
  ima: no need to allocate entry for comment
  ima: report policy load status
  ima: use path names cache
  ...
2014-12-14 20:36:37 -08:00
Linus Torvalds
e6b5be2be4 Driver core patches for 3.19-rc1
Here's the set of driver core patches for 3.19-rc1.
 
 They are dominated by the removal of the .owner field in platform
 drivers.  They touch a lot of files, but they are "simple" changes, just
 removing a line in a structure.
 
 Other than that, a few minor driver core and debugfs changes.  There are
 some ath9k patches coming in through this tree that have been acked by
 the wireless maintainers as they relied on the debugfs changes.
 
 Everything has been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM
 53kAoLeteByQ3iVwWurwwseRPiWa8+MI
 =OVRS
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core update from Greg KH:
 "Here's the set of driver core patches for 3.19-rc1.

  They are dominated by the removal of the .owner field in platform
  drivers.  They touch a lot of files, but they are "simple" changes,
  just removing a line in a structure.

  Other than that, a few minor driver core and debugfs changes.  There
  are some ath9k patches coming in through this tree that have been
  acked by the wireless maintainers as they relied on the debugfs
  changes.

  Everything has been in linux-next for a while"

* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
  Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
  fs: debugfs: add forward declaration for struct device type
  firmware class: Deletion of an unnecessary check before the function call "vunmap"
  firmware loader: fix hung task warning dump
  devcoredump: provide a one-way disable function
  device: Add dev_<level>_once variants
  ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
  ath: use seq_file api for ath9k debugfs files
  debugfs: add helper function to create device related seq_file
  drivers/base: cacheinfo: remove noisy error boot message
  Revert "core: platform: add warning if driver has no owner"
  drivers: base: support cpu cache information interface to userspace via sysfs
  drivers: base: add cpu_device_create to support per-cpu devices
  topology: replace custom attribute macros with standard DEVICE_ATTR*
  cpumask: factor out show_cpumap into separate helper function
  driver core: Fix unbalanced device reference in drivers_probe
  driver core: fix race with userland in device_add()
  sysfs/kernfs: make read requests on pre-alloc files use the buffer.
  sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
  fs: sysfs: return EGBIG on write if offset is larger than file size
  ...
2014-12-14 16:10:09 -08:00
Linus Torvalds
7a02d08969 These patches optionally add LZ4 compression support to Squashfs.
LZ4 is a lightweight compression algorithm which can be used
 on embedded systems to reduce CPU and memory overhead (in comparison
 to the standard zlib compression).
 
 These patches add the wrapper code to allow Squashfs to use
 the existing LZ4 decompression code, and the necessary configuration
 option.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJUjgIJAAoJEJAch/D1fbHUArwP/iDiDSpqxdfTQHwUKxB57skO
 0iBzg6bXPlqwmsnllegg0SV0vOvFjpWiXWpOOtAVCJlfbov8DgUndsigpvG3UhD/
 qJpNXAJ+xSdOzBqj3bS7SOu2DwPY8Gz4rxGQcNN3PsuOVR/EUgAnNlv22ZHY10A5
 XQVyPbkwZ73TrZ2uKA8leWArFtCbM4oYGpxP+ramEox8nVFEOtixn5IcX5WkbGEL
 Yt0NRw8K8vDIIETWVariugUFE4C1olFk+YmqqAw7cmDGJ70cEg5jh9ocNkwDIZPj
 I9BNtkggBRMaCPwGsH6IvahMFUyWLQUgGayfY/fgbRiB9ZuYIQ1lyPDhzbgWczoE
 o34eXAIDdmfPrmYlDEBDkYnXXtwuYqdOVYOtcEnyFEYqpHfaeS2h2s9nTiM+rz21
 v0UEaDRmPtlkK/ZdLKUsrOf+8y9ejkT0R67swFaguHshL6EHey7X5ghmOuwCoL9x
 fzGWtPFR+Nbqga5T3dwf+apvyUVrPaw6gZu36NNim2779ZgpnIPzW6MEYUMhtXCn
 ef2+NvS9AeGyo7kiqlNQrihQWZSN0W/AiVsEeulzk5h+adzSNQ5eipzAO9DAAp16
 8muY4nq51bOGVaWzqJz/KacCmt7i0qUdmS1p4l2uqPp9gH/s/S91yrYn/iszf3AV
 CpwU2i9g3nQu9ecDc1Os
 =mr7J
 -----END PGP SIGNATURE-----

Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next

Pull squashfs update from Phillip Lougher:
 "These patches optionally add LZ4 compression support to Squashfs.

  LZ4 is a lightweight compression algorithm which can be used on
  embedded systems to reduce CPU and memory overhead (in comparison to
  the standard zlib compression).

  These patches add the wrapper code to allow Squashfs to use the
  existing LZ4 decompression code, and the necessary configuration
  option"

* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: Add LZ4 compression configuration option
  Squashfs: add LZ4 compression support
2014-12-14 14:42:53 -08:00
Linus Torvalds
7d22286ff7 Merge git://git.kvack.org/~bcrl/aio-next
Pull aio updates from Benjamin LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
  aio: Skip timer for io_getevents if timeout=0
  aio: Make it possible to remap aio ring
2014-12-14 13:36:57 -08:00
Kevin Cernekee
97c7134ae2 Fix signed/unsigned pointer warning
Commit 2ae83bf938 ("[CIFS] Fix setting time before epoch (negative
time values)") changed "u64 t" to "s64 t", which makes do_div() complain
about a pointer signedness mismatch:

      CC      fs/cifs/netmisc.o
    In file included from ./arch/mips/include/asm/div64.h:12:0,
                     from include/linux/kernel.h:124,
                     from include/linux/list.h:8,
                     from include/linux/wait.h:6,
                     from include/linux/net.h:23,
                     from fs/cifs/netmisc.c:25:
    fs/cifs/netmisc.c: In function ‘cifs_NTtimeToUnix’:
    include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                                ^
    fs/cifs/netmisc.c:941:22: note: in expansion of macro ‘do_div’
       ts.tv_nsec = (long)do_div(t, 10000000) * 100;

Introduce a temporary "u64 abs_t" variable to fix this.

Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-14 14:55:57 -06:00
Sachin Prabhu
9235d09873 Convert MessageID in smb2_hdr to LE
We have encountered failures when When testing smb2 mounts on ppc64
machines when using both Samba as well as Windows 2012.

On poking around, the problem was determined to be caused by the
high endian MessageID passed in the header for smb2. On checking the
corresponding MID for smb1 is converted to LE before being sent on the
wire.

We have tested this patch successfully on a ppc64 machine.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
2014-12-14 14:55:45 -06:00
Fam Zheng
5f785de588 aio: Skip timer for io_getevents if timeout=0
In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.

In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.

Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13 17:50:20 -05:00
Pavel Emelyanov
e4a0d3e720 aio: Make it possible to remap aio ring
There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13 17:49:50 -05:00
Linus Torvalds
caf292ae5b Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block
Pull block driver core update from Jens Axboe:
 "This is the pull request for the core block IO changes for 3.19.  Not
  a huge round this time, mostly lots of little good fixes:

   - Fix a bug in sysfs blktrace interface causing a NULL pointer
     dereference, when enabled/disabled through that API.  From Arianna
     Avanzini.

   - Various updates/fixes/improvements for blk-mq:

        - A set of updates from Bart, mostly fixing buts in the tag
          handling.

        - Cleanup/code consolidation from Christoph.

        - Extend queue_rq API to be able to handle batching issues of IO
          requests. NVMe will utilize this shortly. From me.

        - A few tag and request handling updates from me.

        - Cleanup of the preempt handling for running queues from Paolo.

        - Prevent running of unmapped hardware queues from Ming Lei.

        - Move the kdump memory limiting check to be in the correct
          location, from Shaohua.

        - Initialize all software queues at init time from Takashi. This
          prevents a kobject warning when CPUs are brought online that
          weren't online when a queue was registered.

   - Single writeback fix for I_DIRTY clearing from Tejun.  Queued with
     the core IO changes, since it's just a single fix.

   - Version X of the __bio_add_page() segment addition retry from
     Maurizio.  Hope the Xth time is the charm.

   - Documentation fixup for IO scheduler merging from Jan.

   - Introduce (and use) generic IO stat accounting helpers for non-rq
     drivers, from Gu Zheng.

   - Kill off artificial limiting of max sectors in a request from
     Christoph"

* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
  bio: modify __bio_add_page() to accept pages that don't start a new segment
  blk-mq: Fix uninitialized kobject at CPU hotplugging
  blktrace: don't let the sysfs interface remove trace from running list
  blk-mq: Use all available hardware queues
  blk-mq: Micro-optimize bt_get()
  blk-mq: Fix a race between bt_clear_tag() and bt_get()
  blk-mq: Avoid that __bt_get_word() wraps multiple times
  blk-mq: Fix a use-after-free
  blk-mq: prevent unmapped hw queue from being scheduled
  blk-mq: re-check for available tags after running the hardware queue
  blk-mq: fix hang in bt_get()
  blk-mq: move the kdump check to blk_mq_alloc_tag_set
  blk-mq: cleanup tag free handling
  blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
  blk: introduce generic io stat accounting help function
  blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
  genhd: check for int overflow in disk_expand_part_tbl()
  blk-mq: add blk_mq_free_hctx_request()
  blk-mq: export blk_mq_free_request()
  blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
  ...
2014-12-13 14:14:23 -08:00
Jan Kara
37d469e767 fsnotify: remove destroy_list from fsnotify_mark
destroy_list is used to track marks which still need waiting for srcu
period end before they can be freed.  However by the time mark is added to
destroy_list it isn't in group's list of marks anymore and thus we can
reuse fsnotify_mark->g_list for queueing into destroy_list.  This saves
two pointers for each fsnotify_mark.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:53 -08:00
Jan Kara
0809ab69a2 fsnotify: unify inode and mount marks handling
There's a lot of common code in inode and mount marks handling.  Factor it
out to a common helper function.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:53 -08:00
Heinrich Schuchardt
820c12d5d6 fallocate: create FAN_MODIFY and IN_MODIFY events
The fanotify and the inotify API can be used to monitor changes of the
file system.  System call fallocate() modifies files.  Hence it should
trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
events.  The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
this value allows to create arbitrary file content from random data.

This patch adds the missing call to fsnotify_modify().

The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
succeeds.  It will even be created if the file length remains unchanged,
e.g.  when calling fanotify with flag FALLOC_FL_KEEP_SIZE.

This logic was primarily chosen to keep the coding simple.

It resembles the logic of the write() system call.

When we call write() we always create a FAN_MODIFY event, even in the case
of overwriting with identical data.

Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
actually changed.

Furthermore even if if the filesize remains unchanged, fallocate() may
influence whether a subsequent write() will succeed and hence the
fallocate() call may be considered a modification.

The fallocate(2) man page teaches: After a successful call, subsequent
writes into the range specified by offset and len are guaranteed not to
fail because of lack of disk space.

So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
different outcomes of a subsequent write depending on the values of offset
and len.

Signed-off-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:53 -08:00
Fabian Frederick
92cab82b2c fs/affs/file.c: remove obsolete pagesize check
linux kernel doesn't manage page sizes below 4kb.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:52 -08:00
Fabian Frederick
9abb408307 fs/affs/file.c: add support to O_DIRECT
Based on ext2_direct_IO

Tested with O_DIRECT file open and sysbench/mariadb with 1% written
queries improvement (update_non_index test) on a volume created with
mkaffs.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
Fabian Frederick
1ee54b099a fs/affs/amigaffs.c: use va_format instead of buffer/vnsprintf
-Remove ErrorBuffer and use %pV

-Add __printf to enable argument mistmatch warnings

Original patch by Joe Perches.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
Fabian Frederick
7633978b43 fs/affs/file.c: forward declaration clean-up
-Move file_operations to avoid forward declarations.

-Remove unused declarations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
David Drysdale
51f39a1f0c syscalls: implement execveat() system call
This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts).  The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.

Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns.  The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.

This patch (of 4):

Add a new execveat(2) system call.  execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers.  This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).

Based on patches by Meredydd Luff.

Signed-off-by: David Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
Namjae Jeon
c0ef0cc9d2 fat: fix data past EOF resulting from fsx testsuite
When running FSX with direct I/O mode, fsx resulted in DATA past EOF issues.

  fsx ./file2 -Z -r 4096 -w 4096
  ...
  ..
  truncating to largest ever: 0x907c
  fallocating to largest ever: 0x11137
  truncating to largest ever: 0x2c6fe
  truncating to largest ever: 0x2cfdf
  fallocating to largest ever: 0x40000
  Mapped Read: non-zero data past EOF (0x18628) page offset 0x629 is 0x2a4e
  ...
  ..

The reason being, it is doing a truncate down, but the zeroing does not
happen on the last block boundary when offset is not aligned.  Even though
it calls truncate_setsize()->truncate_inode_pages()->
truncate_inode_pages_range() and considers the partial zeroout but it
retrieves the page using find_lock_page() - which only looks the page in
the cache.  So, zeroing out does not happen in case of direct IO.

Make a truncate page based around block_truncate_page for FAT filesystem
and invoke that helper to zerout in case the offset is not aligned with
the blocksize.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
Jan Kara
f441ada004 befs: remove dead code
Coverity id: 1042674

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:51 -08:00
David Rientjes
5cec38ac86 fs, seq_file: fallback to vmalloc instead of oom kill processes
Since commit 058504edd0 ("fs/seq_file: fallback to vmalloc allocation"),
seq_buf_alloc() falls back to vmalloc() when the kmalloc() for contiguous
memory fails.  This was done to address order-4 slab allocations for
reading /proc/stat on large machines and noticed because
PAGE_ALLOC_COSTLY_ORDER < 4, so there is no infinite loop in the page
allocator when allocating new slab for such high-order allocations.

Contiguous memory isn't necessary for caller of seq_buf_alloc(), however.
Other GFP_KERNEL high-order allocations that are <=
PAGE_ALLOC_COSTLY_ORDER will simply loop forever in the page allocator and
oom kill processes as a result.

We don't want to kill processes so that we can allocate contiguous memory
in situations when contiguous memory isn't necessary.

This patch does the kmalloc() allocation with __GFP_NORETRY for high-order
allocations.  This still utilizes memory compaction and direct reclaim in
the allocation path, the only difference is that it will fail immediately
instead of oom kill processes when out of memory.

[akpm@linux-foundation.org: add comment]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:49 -08:00
Johannes Weiner
6b4f7799c6 mm: vmscan: invoke slab shrinkers from shrink_zone()
The slab shrinkers are currently invoked from the zonelist walkers in
kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
eligible LRU pages and assemble a nodemask to pass to NUMA-aware
shrinkers, which then again have to walk over the nodemask.  This is
redundant code, extra runtime work, and fairly inaccurate when it comes to
the estimation of actually scannable LRU pages.  The code duplication will
only get worse when making the shrinkers cgroup-aware and requiring them
to have out-of-band cgroup hierarchy walks as well.

Instead, invoke the shrinkers from shrink_zone(), which is where all
reclaimers end up, to avoid this duplication.

Take the count for eligible LRU pages out of get_scan_count(), which
considers many more factors than just the availability of swap space, like
zone_reclaimable_pages() currently does.  Accumulate the number over all
visited lruvecs to get the per-zone value.

Some nodes have multiple zones due to memory addressing restrictions.  To
avoid putting too much pressure on the shrinkers, only invoke them once
for each such node, using the class zone of the allocation as the pivot
zone.

For now, this integrates the slab shrinking better into the reclaim logic
and gets rid of duplicative invocations from kswapd, direct reclaim, and
zone reclaim.  It also prepares for cgroup-awareness, allowing
memcg-capable shrinkers to be added at the lruvec level without much
duplication of both code and runtime work.

This changes kswapd behavior, which used to invoke the shrinkers for each
zone, but with scan ratios gathered from the entire node, resulting in
meaningless pressure quantities on multi-zone nodes.

Zone reclaim behavior also changes.  It used to shrink slabs until the
same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
merely invokes the shrinkers once with the zone's scan ratio, which makes
the shrinkers go easier on caches that implement aging and would prefer
feeding back pressure from recently used slab objects to unused LRU pages.

[vdavydov@parallels.com: assure class zone is populated]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:48 -08:00
Davidlohr Bueso
c8c06efa8b mm: convert i_mmap_mutex to rwsem
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting
similar data, one for file backed pages and the other for anon memory.  To
this end, this lock can also be a rwsem.  In addition, there are some
important opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward.  For now, all users take the write
lock.

[sfr@canb.auug.org.au: update fremap.c]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Davidlohr Bueso
83cde9e8ba mm: use new helper functions around the i_mmap_mutex
Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:45 -08:00
Thomas Gleixner
c291ee6221 genirq: Prevent proc race against freeing of irq descriptors
Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.

CPU0				CPU1
				show_interrupts()
				  desc = irq_to_desc(X);
free_desc(desc)
  remove_from_radix_tree();
  kfree(desc);
				  raw_spinlock_irq(&desc->lock);

/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.

The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.

For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.

Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.

Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.

Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.

Fixes: 1f5a5b87f7 "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
2014-12-13 13:33:07 +01:00
hujianyang
cead89bb08 ovl: Use macros to present ovl_xattr
This patch adds two macros:

OVL_XATTR_PRE_NAME and OVL_XATTR_PRE_LEN

to present ovl_xattr name prefix and its length. Also, a
new macro OVL_XATTR_OPAQUE is introduced to replace old
*ovl_opaque_xattr*.

Fix the length of "trusted.overlay." to *16*.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:52 +01:00
hujianyang
1ba38725a3 ovl: Cleanup redundant blank lines
This patch removes redundant blanks lines in overlayfs.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:52 +01:00
Miklos Szeredi
a78d9f0d5d ovl: support multiple lower layers
Allow "lowerdir=" option to contain multiple lower directories separated by
a colon (e.g. "lowerdir=/bin:/usr/bin").  Colon characters in filenames can
be escaped with a backslash.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:52 +01:00
Miklos Szeredi
53a08cb9b8 ovl: make upperdir optional
Make "upperdir=" mount option optional.  If "upperdir=" is not given, then
the "workdir=" option is also optional (and ignored if given).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:51 +01:00
Miklos Szeredi
ab508822ca ovl: improve mount helpers
Move common checks into ovl_mount_dir() helper.

Create helper for looking up lower directories.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:49 +01:00
Miklos Szeredi
3b7a9a249a ovl: mount: change order of initialization
Move allocation of root entry above to where it's needed.

Move initializations related to upperdir and workdir near each other.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:48 +01:00
Miklos Szeredi
4ebc581828 ovl: allow statfs if no upper layer
Handle "no upper layer" case in statfs.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:46 +01:00
Miklos Szeredi
09e10322b7 ovl: lookup ENAMETOOLONG on lower means ENOENT
"Suppose you have in one of the lower layers a filesystem with
->lookup()-enforced upper limit on name length.  Pretty much every local fs
has one, but... they are not all equal.  255 characters is the common upper
limit, but e.g. jffs2 stops at 254, minixfs upper limit is somewhere from
14 to 60, depending upon version, etc.  You are doing a lookup for
something that is present in upper layer, but happens to be too long for
one of the lower layers.  Too bad - ENAMETOOLONG for you..."

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:45 +01:00
Miklos Szeredi
3e01cee3b9 ovl: check whiteout on lowest layer as well
Not checking whiteouts on lowest layer was an optimization (there's nothing
to white out there), but it could result in inconsitent behavior when a
layer previously used as upper/middle is later used as lowest. 

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:45 +01:00
Miklos Szeredi
3d3c6b8939 ovl: multi-layer lookup
Look up dentry in all relevant layers.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:44 +01:00
Miklos Szeredi
9d7459d834 ovl: multi-layer readdir
If multiple lower layers exist, merge them as well in readdir according to
the same rules as merging upper with lower.  I.e. take whiteouts and opaque
directories into account on all but the lowers layer.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:44 +01:00
Miklos Szeredi
5ef88da56a ovl: helper to iterate layers
Add helper to iterate through all the layers, starting from the upper layer
(if exists) and continuing down through the lower layers.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:43 +01:00
Miklos Szeredi
dd662667e6 ovl: add mutli-layer infrastructure
Add multiple lower layers to 'struct ovl_fs' and 'struct ovl_entry'.

ovl_entry will have an array of paths, instead of just the dentry.  This
allows a compact array containing just the layers which exist at current
point in the tree (which is expected to be a small number for the majority
of dentries).

The number of layers is not limited by this infrastructure.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:43 +01:00
Miklos Szeredi
263b4a0fee ovl: dont replace opaque dir
When removing an empty opaque directory, then it makes no sense to replace
it with an exact replica of itself before removal.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:43 +01:00
Miklos Szeredi
1afaba1ecb ovl: make path-type a bitmap
OVL_PATH_PURE_UPPER -> __OVL_PATH_UPPER | __OVL_PATH_PURE
OVL_PATH_UPPER      -> __OVL_PATH_UPPER
OVL_PATH_MERGE      -> __OVL_PATH_UPPER | __OVL_PATH_MERGE
OVL_PATH_LOWER      -> 0

Multiple R/O layers will allow __OVL_PATH_MERGE without __OVL_PATH_UPPER.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:42 +01:00
Miklos Szeredi
49c21e1cac ovl: check whiteout while reading directory
Don't make a separate pass for checking whiteouts, since we can do it while
reading the upper directory.

This will make it easier to handle multiple layers.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-13 00:59:42 +01:00
Jiri Slaby
fa0c554073 reiserfs: destroy allocated commit workqueue
When resirefs is trying to mount a partition, it creates a commit
workqueue (sbi->commit_wq). But when mount fails later, the workqueue
is not freed.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reported-by: auxsvr@gmail.com
Reported-by: Benoît Monin <benoit.monin@gmx.fr>
Cc: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org # >= 3.16
Cc: reiserfs-devel@vger.kernel.org
Fixes: 797d9016ce
Signed-off-by: Jan Kara <jack@suse.cz>
2014-12-12 22:18:07 +01:00
Linus Torvalds
6ce4436c9c Couple of pstore-ram enhancements to allow use of different memory attributes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUi0B6AAoJEKurIx+X31iByH8P/jfMgzyUO+KpJMA1DbgCAG7x
 WPJgbMUyPwB63DH09RyMEmiwf61Rl1klXTPVNY0Dnj7qRJOmpB9U3vGIfO4HpD84
 5IZMBlc+Jl+kJCxSAJYbTJTZLsIMjFGOfuVTvlY+HnMBitQVBumKptmC0DoBBqgz
 yYy5MHRMaVoHcogyMyBiknmxdxu6/ruUKY+6yyvdUESt0SCcJG8V6Qik7TMmnx47
 NvIIPzfibvvLLnd8IOEj2fwh8XMtJdfcCxPpAEvEaNq0jZEDF9K22jttTQvl9r92
 NQf7JKQQrNfzloRZ3flKax5ZMGi9RkcirTLLdJ4I2xMGVHOA4XUAjsSCYR6INuuJ
 Ox00FnuiIrADNw37m52Y+ujPTF1C2PQUNK69gwsLd84MSjy+95F2dlC5cC3Yt4N5
 rpstXxWELZTqjMGD8GTPOpv6zlg799IbFexr4H6KTc+47EX0MNayJiI6L597gYnq
 gIiPmDnnz6WlWp4HHgBIwjNAH3Tbf/uU3MlgzqS3Ftd7YkYmLnxvClhrwgErviFn
 Nfnz2LtGuMxMHSt0uSWxODVEaR4reKRVJBvhRSGWL1PufylEyt0YWayiqpohuKD9
 6X/RufWK5qdCBHytoGyMUZ57oqxth9QSVG4RBkGPmaZgMq/5DdyOhBfW0yInjMuo
 AuDMmqrU5yFTitLMGcsG
 =kcmD
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore update #2 from Tony Luck:
 "Couple of pstore-ram enhancements to allow use of different memory
  attributes"

* tag 'please-pull-morepstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  pstore-ram: Allow optional mapping with pgprot_noncached
  pstore-ram: Fix hangs by using write-combine mappings
2014-12-12 11:34:13 -08:00
Linus Torvalds
bdeb03cada Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "From a feature point of view, most of the code here comes from Miao
  Xie and others at Fujitsu to implement scrubbing and replacing devices
  on raid56.  This has been in development for a while, and it's a big
  improvement.

  Filipe and Josef have a great assortment of fixes, many of which solve
  problems corruptions either after a crash or in error conditions.  I
  still have a round two from Filipe for next week that solves
  corruptions with discard and block group removal"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
  Btrfs: make get_caching_control unconditionally return the ctl
  Btrfs: fix unprotected deletion from pending_chunks list
  Btrfs: fix fs mapping extent map leak
  Btrfs: fix memory leak after block remove + trimming
  Btrfs: make btrfs_abort_transaction consider existence of new block groups
  Btrfs: fix race between writing free space cache and trimming
  Btrfs: fix race between fs trimming and block group remove/allocation
  Btrfs, replace: enable dev-replace for raid56
  Btrfs: fix freeing used extents after removing empty block group
  Btrfs: fix crash caused by block group removal
  Btrfs: fix invalid block group rbtree access after bg is removed
  Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  ...
2014-12-12 11:15:23 -08:00
Linus Torvalds
a7cb7bb664 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree update from Jiri Kosina:
 "Usual stuff: documentation updates, printk() fixes, etc"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
  intel_ips: fix a type in error message
  cpufreq: cpufreq-dt: Move newline to end of error message
  ps3rom: fix error return code
  treewide: fix typo in printk and Kconfig
  ARM: dts: bcm63138: change "interupts" to "interrupts"
  Replace mentions of "list_struct" to "list_head"
  kernel: trace: fix printk message
  scsi: mpt2sas: fix ioctl in comment
  zbud, zswap: change module author email
  clocksource: Fix 'clcoksource' typo in comment
  arm: fix wording of "Crotex" in CONFIG_ARCH_EXYNOS3 help
  gpio: msm-v1: make boolean argument more obvious
  usb: Fix typo in usb-serial-simple.c
  PCI: Fix comment typo 'COMFIG_PM_OPS'
  powerpc: Fix comment typo 'CONIFG_8xx'
  powerpc: Fix comment typos 'CONFiG_ALTIVEC'
  clk: st: Spelling s/stucture/structure/
  isci: Spelling s/stucture/structure/
  usb: gadget: zero: Spelling s/infrastucture/infrastructure/
  treewide: Fix company name in module descriptions
  ...
2014-12-12 10:08:06 -08:00
Linus Torvalds
ccb5a4910d This pull request includes the following UBI/UBIFS changes:
* UBI debug messages now include the UBI device number. This change
   is responsible for the big diffstat since it touched every debugging
   print statement.
 * An Xattr bug-fix which fixes SELinux support
 * Several error path fixes in UBI/UBIFS
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUixUjAAoJECmIfjd9wqK0n0oQAL5fAGFszUnmPa+NHi1IgDlv
 dUBcN9GrXM8CN5LxQX2NH4WuxyY9gZpQsZtDXolutICbHT55De/plQyJUE5XHXnq
 U2SHir1wsHnUeDJEqlAKE4zXWUEwY4C5mqDZh8fPUM+pyFNmlt4L/mi4hjkFmpqt
 1gPqJ9boa3fwrT3jdaClJTXN5d+8Y1JahQwuSINsX6rInB/cfh2FFZ2fxWWogxvf
 BoN1iQdbWJrmkd2KLLbQqOeI5LwBT5jdf0Z0hkwHEsDCA0ZiKBCoQRZKjkwlaTCZ
 JSQ2Fv/RkUGg+YJJgC5xJnpR4VlGyn6X2z7/W5idhKzELlmHrKaw3bXZJJoTElPr
 kRSpcq02eF3pJKLMpvFuV6rLpqbkpGDML3+VtZ1Fta3cqQ0E9TvrSJAII5j+EiNG
 D03IkCVX5ozmeZr08DmSx8W7HJ4beMs5E8eDkEaAS7AhQU5pmCcai1vZJQZjrsV0
 5yCmYsArN6yYS79mrH6eQuoKsJ48mOoU6zC8vmvu5uar6HzfK9eC6J3JndH1lGp1
 iDXJ9TS1AX/jFdZWAdyJic29TQi1hPhZITdhLfT11MZtYLWT1CpvNgBa4MefpD0X
 YBsgVjQXA96F4Aix3ILWGEyaKbHUOmqIBpKy95tRpGgMxwlpagsqm2jn1e8Sb4Kd
 H9YCeVsracNeK1E2ua13
 =sZ7o
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.19-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Artem Bityutskiy:
 "This includes the following UBI/UBIFS changes:
   - UBI debug messages now include the UBI device number.  This change
     is responsible for the big diffstat since it touched every
     debugging print statement.
   - An Xattr bug-fix which fixes SELinux support
   - Several error path fixes in UBI/UBIFS"

* tag 'upstream-3.19-rc1' of git://git.infradead.org/linux-ubifs:
  UBI: Fix invalid vfree()
  UBI: Fix double free after do_sync_erase()
  UBIFS: fix a couple bugs in UBIFS xattr length calculation
  UBI: vtbl: Use ubi_eba_atomic_leb_change()
  UBI: Extend UBI layer debug/messaging capabilities
  UBIFS: fix budget leak in error path
2014-12-12 09:57:22 -08:00
Linus Torvalds
c05e14f7b3 xfs: update for 3.19-rc1
This update contains:
 o more on-disk format header consolidation
 o move some structures shared with userspace to libxfs
 o new per-mount workqueue to fix for deadlocks between nested loop
   mounted filesystems
 o various bug fixes for ENOSPC, stats, quota off and preallocation
 o a bunch of compiler warning fixes for set-but-unused variables
 o various code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJUihOWAAoJEK3oKUf0dfodYbkP/iXuIYOhpmc1rUORMDl2JDBc
 iTjXqz1Ydp6vJrq2+3qeAsCbJciNdZ72eNKdvgRbFAN4BW8tv1Wc9QR5m2ZIpCkf
 7buCzbkI64j9HoNAiZJhrMp/eyJ0X1hRGk1ANUaBT9ouXWOBDaOD/sNj9cMptWOA
 72BpTMN0FszAJxW6rNEk1M/i+W2ly0qmD0QJPQU18Z62NU5E+D/uMkg2xif4dhwK
 CSNMgCIv0X1qmve2lMOgwHbgkmHRwbXKSb4Z5vV8pDUh49tkRtxJ2ky7mE7aglrq
 pjChpEqDktkCL/RHAT3XJ77tRIyBXwvpC7ewHXiYBY83OcGfRFv0jMCJ+R+1b3KD
 p1faOVwd/H0tStd+0rF+tMMn8TuujQ597upLGhWdy1BpY3nnkJ7iJ8lyJv+aiCzr
 Oh3DvyX1XgxnEo7yVr+x64TFz/GPkyuvVPSfL3gspqEZErC4BN+AEP/3fF+5SGed
 x9QplB+lcy7IpzB+HURPZL4TqWl4Ib29pArZY1mQ1rJz6IFFbDSzj6lo36YDBrP8
 HRG2LDxgc1udPPMxdZ3PAV3nt4/ufaxSTmT5HGV0Aj+hjkSfLvBDFMuVz9t6vfn9
 YN3ocKWxJr2QISc0fcQ/hsBDiHVyoFgDOikBAetaqpdoM7OM7FHtLXtwLDILldx9
 DZAIS0msNrjc7gGCrbxj
 =2SJP
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs update from Dave Chinner:
 "There's relatively little change in this update; it is mainly bug
  fixes, cleanups and more of the on-going libxfs restructuring and
  on-disk format header consolidation work.

  Details:
   - more on-disk format header consolidation
   - move some structures shared with userspace to libxfs
   - new per-mount workqueue to fix for deadlocks between nested loop
     mounted filesystems
   - various bug fixes for ENOSPC, stats, quota off and preallocation
   - a bunch of compiler warning fixes for set-but-unused variables
   - various code cleanups"

* tag 'xfs-for-linus-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (24 commits)
  xfs: split metadata and log buffer completion to separate workqueues
  xfs: fix set-but-unused warnings
  xfs: move type conversion functions to xfs_dir.h
  xfs: move ftype conversion functions to libxfs
  xfs: lobotomise xfs_trans_read_buf_map()
  xfs: active inodes stat is broken
  xfs: cleanup xfs_bmse_merge returns
  xfs: cleanup xfs_bmse_shift_one goto mess
  xfs: fix premature enospc on inode allocation
  xfs: overflow in xfs_iomap_eof_align_last_fsb
  xfs: fix simple_return.cocci warning in xfs_bmse_shift_one
  xfs: fix simple_return.cocci warning in xfs_file_readdir
  libxfs: fix simple_return.cocci warnings
  xfs: remove unnecessary null checks
  xfs: merge xfs_inum.h into xfs_format.h
  xfs: move most of xfs_sb.h to xfs_format.h
  xfs: merge xfs_ag.h into xfs_format.h
  xfs: move acl structures to xfs_format.h
  xfs: merge xfs_dinode.h into xfs_format.h
  xfs: catch invalid negative blknos in _xfs_buf_find()
  ...
2014-12-12 09:48:17 -08:00
Linus Torvalds
9bfccec24e Lots of bugs fixes, including Zheng and Jan's extent status shrinker
fixes, which should improve CPU utilization and potential soft lockups
 under heavy memory pressure, and Eric Whitney's bigalloc fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUiRUwAAoJENNvdpvBGATwltQP/3sjHtFw+RUvKgQ8vX9M2THk
 4b9j0ja0mrD3ObTXUxdDuOh1q09MsfSUiOYK6KZOav3nO/dRODqZnWgXz/zJt3LC
 R97s4velgzZi3F2ijnLiCo5RVZahN9xs8bUHZ85orMIr5wogwGdaUpnoqZSg0Ehr
 PIFnTNORyNXBwEm3XPjUmENTdyq9FZ8DsS6ACFzgFi79QTSyJFEM4LAl2XaqwMGV
 fVhNwnOGIyT8lHZAtDcobkaC86NjakmpW2Ip3p9/UEQtynh16UeVXKEO3K7CcQ+L
 YJRDNnSIlGpR1OJp+v6QJPUd8q4fc/8JW9AxxsLak0eqkszuB+MxoQXOCFV5AWaf
 jrs4TV3y0hCuB4OwuYUpnfcU1o+O7p39MqXMv8SA1ZBPbijN/LQSMErFtXj2oih6
 3gJHUWLwELGeR+d9JlI29zxhOeOIotX255UBgj2oasQ0X3BW3qAgQ4LmP3QY90Pm
 BUmxiMoIWB9N3kU4XQGf+Kyy8JeMLJj0frHDxI3XLz+B+IlWCCkBH6y3AD/a13kS
 HHMMLOwHGEs0lYEKsm89dkcij5GuKd8eKT8Q0+CvKD9Z6HPdYvQxoazmF87Q6j/7
 ZmshaVxtWaLpNbDaXVg+IgZifJAN0+mVzVHRhY9TSjx8k9qLdSgSEqYWjkSjx9Ij
 nNB2zVrHZDMvZ7MCZy85
 =ZrTc
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Lots of bugs fixes, including Zheng and Jan's extent status shrinker
  fixes, which should improve CPU utilization and potential soft lockups
  under heavy memory pressure, and Eric Whitney's bigalloc fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (26 commits)
  ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
  ext4: fix suboptimal seek_{data,hole} extents traversial
  ext4: ext4_inline_data_fiemap should respect callers argument
  ext4: prevent fsreentrance deadlock for inline_data
  ext4: forbid journal_async_commit in data=ordered mode
  jbd2: remove unnecessary NULL check before iput()
  ext4: Remove an unnecessary check for NULL before iput()
  ext4: remove unneeded code in ext4_unlink
  ext4: don't count external journal blocks as overhead
  ext4: remove never taken branch from ext4_ext_shift_path_extents()
  ext4: create nojournal_checksum mount option
  ext4: update comments regarding ext4_delete_inode()
  ext4: cleanup GFP flags inside resize path
  ext4: introduce aging to extent status tree
  ext4: cleanup flag definitions for extent status tree
  ext4: limit number of scanned extents in status tree shrinker
  ext4: move handling of list of shrinkable inodes into extent status code
  ext4: change LRU to round-robin in extent status tree shrinker
  ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
  ext4: fix block reservation for bigalloc filesystems
  ...
2014-12-12 09:28:03 -08:00
David Sterba
ce3e69847e btrfs: sink parameter len to alloc_extent_buffer
Because we're using globally known nodesize. Do the same for the sanity
test function variant.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:26:57 +01:00
David Sterba
3f556f7853 btrfs: unify extent buffer allocation api
Make the extent buffer allocation interface consistent.  Cloned eb will
set a valid fs_info.  For dummy eb, we can drop the length parameter and
set it from fs_info.

The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:26:55 +01:00
David Sterba
23d79d81b1 btrfs: use GFP_NOFS in __alloc_extent_buffer directly
Same mask from all callers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:23 +01:00
David Sterba
7476dfdaad btrfs: sink blocksize parameter to tree_block_processed
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:22 +01:00
David Sterba
a83fffb75d btrfs: sink blocksize parameter to btrfs_find_create_tree_block
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:21 +01:00
David Sterba
fe864576de btrfs: sink blocksize parameter to btrfs_init_new_buffer
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:20 +01:00
David Sterba
c0dcaa4d7b btrfs: sink blocksize parameter to reada_tree_block_flagged
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:20 +01:00
David Sterba
b6ae40ec76 btrfs: remove blocksize from reada_extent
Replace with global nodesize instead.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:19 +01:00
David Sterba
d3e46fea1b btrfs: sink blocksize parameter to readahead_tree_block
All callers pass nodesize.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:18 +01:00
Miklos Szeredi
1c68271cf1 fuse: use file_inode() in fuse_file_fallocate()
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12 10:04:51 +01:00
Miklos Szeredi
7078187a79 fuse: introduce fuse_simple_request() helper
The following pattern is repeated many times:

	req = fuse_get_req_nopages(fc);
	/* Initialize req->(in|out).args */
	fuse_request_send(fc, req);
	err = req->out.h.error;
	fuse_put_request(req);

Create a new replacement helper:

	/* Initialize args */
	err = fuse_simple_request(fc, &args);

In addition to reducing the code size, this will ease moving from the
complex arg-based to a simpler page-based I/O on the fuse device.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12 09:49:05 +01:00
Miklos Szeredi
f704dcb538 fuse: reduce max out args
The third out-arg is never actually used.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12 09:49:05 +01:00
Miklos Szeredi
baebccbe99 fuse: hold inode instead of path after release
path_put() in release could trigger a DESTROY request in fuseblk.  The
possible deadlock was worked around by doing the path_put() with
schedule_work().

This complexity isn't needed if we just hold the inode instead of the path.
Since we now flush all requests before destroying the super block we can be
sure that all held inodes will be dropped.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12 09:49:04 +01:00
Miklos Szeredi
580640ba5d fuse: flush requests on umount
Use fuse_abort_conn() instead of fuse_conn_kill() in fuse_put_super().
This flushes and aborts requests still on any queues.  But since we've
already reset fc->connected, those requests would not be useful anyway and
would be flushed when the fuse device is closed.

Next patches will rely on requests being flushed before the superblock is
destroyed.

Use fuse_abort_conn() in cuse_process_init_reply() too, since it makes no
difference there, and we can get rid of fuse_conn_kill().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12 09:49:04 +01:00
Miklos Szeredi
0c4dd4ba14 fuse: don't wake up reserved req in fuse_conn_kill()
Waking up reserved_req_waitq from fuse_conn_kill() doesn't make sense since
we aren't chaging ff->reserved_req here, which is what this waitqueue
signals.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-12-12 09:49:04 +01:00
Linus Torvalds
c0222ac086 Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
 "This is an unusually large pull request for MIPS - in parts because
  lots of patches missed the 3.18 deadline but primarily because some
  folks opened the flood gates.

   - Retire the MIPS-specific phys_t with the generic phys_addr_t.
   - Improvments for the backtrace code used by oprofile.
   - Better backtraces on SMP systems.
   - Cleanups for the Octeon platform code.
   - Cleanups and fixes for the Loongson platform code.
   - Cleanups and fixes to the firmware library.
   - Switch ATH79 platform to use the firmware library.
   - Grand overhault to the SEAD3 and Malta interrupt code.
   - Move the GIC interrupt code to drivers/irqchip
   - Lots of GIC cleanups and updates to the GIC code to use modern IRQ
     infrastructures and features of the kernel.
   - OF documentation updates for the GIC bindings
   - Move GIC clocksource driver to drivers/clocksource
   - Merge GIC clocksource driver with clockevent driver.
   - Further updates to bring the GIC clocksource driver up to date.
   - R3000 TLB code cleanups
   - Improvments to the Loongson 3 platform code.
   - Convert pr_warning to pr_warn.
   - Merge a bunch of small lantiq and ralink fixes that have been
     staged/lingering inside the openwrt tree for a while.
   - Update archhelp for IP22/IP32
   - Fix a number of issues for Loongson 1B.
   - New clocksource and clockevent driver for Loongson 1B.
   - Further work on clk handling for Loongson 1B.
   - Platform work for Broadcom BMIPS.
   - Error handling cleanups for TurboChannel.
   - Fixes and optimization to the microMIPS support.
   - Option to disable the FTLB.
   - Dump more relevant information on machine check exception
   - Change binfmt to allow arch to examine PT_*PROC headers
   - Support for new style FPU register model in O32
   - VDSO randomization.
   - BCM47xx cleanups
   - BCM47xx reimplement the way the kernel accesses NVRAM information.
   - Random cleanups
   - Add support for ATH25 platforms
   - Remove pointless locking code in some PCI platforms.
   - Some improvments to EVA support
   - Minor Alchemy cleanup"

* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (185 commits)
  MIPS: Add MFHC0 and MTHC0 instructions to uasm.
  MIPS: Cosmetic cleanups of page table headers.
  MIPS: Add CP0 macros for extended EntryLo registers
  MIPS: Remove now unused definition of phys_t.
  MIPS: Replace use of phys_t with phys_addr_t.
  MIPS: Replace MIPS-specific 64BIT_PHYS_ADDR with generic PHYS_ADDR_T_64BIT
  PCMCIA: Alchemy Don't select 64BIT_PHYS_ADDR in Kconfig.
  MIPS: lib: memset: Clean up some MIPS{EL,EB} ifdefery
  MIPS: iomap: Use __mem_{read,write}{b,w,l} for MMIO
  MIPS: <asm/types.h> fix indentation.
  MAINTAINERS: Add entry for BMIPS multiplatform kernel
  MIPS: Enable VDSO randomization
  MIPS: Remove a temporary hack for debugging cache flushes in SMTC configuration
  MIPS: Remove declaration of obsolete arch_init_clk_ops()
  MIPS: atomic.h: Reformat to fit in 79 columns
  MIPS: Apply `.insn' to fixup labels throughout
  MIPS: Fix microMIPS LL/SC immediate offsets
  MIPS: Kconfig: Only allow 32-bit microMIPS builds
  MIPS: signal.c: Fix an invalid cast in ISA mode bit handling
  MIPS: mm: Only build one microassembler that is suitable
  ...
2014-12-11 17:56:37 -08:00
Eric W. Biederman
9cc46516dd userns: Add a knob to disable setgroups on a per user namespace basis
- Expose the knob to user space through a proc file /proc/<pid>/setgroups

  A value of "deny" means the setgroups system call is disabled in the
  current processes user namespace and can not be enabled in the
  future in this user namespace.

  A value of "allow" means the segtoups system call is enabled.

- Descendant user namespaces inherit the value of setgroups from
  their parents.

- A proc file is used (instead of a sysctl) as sysctls currently do
  not allow checking the permissions at open time.

- Writing to the proc file is restricted to before the gid_map
  for the user namespace is set.

  This ensures that disabling setgroups at a user namespace
  level will never remove the ability to call setgroups
  from a process that already has that ability.

  A process may opt in to the setgroups disable for itself by
  creating, entering and configuring a user namespace or by calling
  setns on an existing user namespace with setgroups disabled.
  Processes without privileges already can not call setgroups so this
  is a noop.  Prodcess with privilege become processes without
  privilege when entering a user namespace and as with any other path
  to dropping privilege they would not have the ability to call
  setgroups.  So this remains within the bounds of what is possible
  without a knob to disable setgroups permanently in a user namespace.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-11 18:06:36 -06:00
Linus Torvalds
70e71ca0af Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) New offloading infrastructure and example 'rocker' driver for
    offloading of switching and routing to hardware.

    This work was done by a large group of dedicated individuals, not
    limited to: Scott Feldman, Jiri Pirko, Thomas Graf, John Fastabend,
    Jamal Hadi Salim, Andy Gospodarek, Florian Fainelli, Roopa Prabhu

 2) Start making the networking operate on IOV iterators instead of
    modifying iov objects in-situ during transfers.  Thanks to Al Viro
    and Herbert Xu.

 3) A set of new netlink interfaces for the TIPC stack, from Richard
    Alpe.

 4) Remove unnecessary looping during ipv6 routing lookups, from Martin
    KaFai Lau.

 5) Add PAUSE frame generation support to gianfar driver, from Matei
    Pavaluca.

 6) Allow for larger reordering levels in TCP, which are easily
    achievable in the real world right now, from Eric Dumazet.

 7) Add a variable of napi_schedule that doesn't need to disable cpu
    interrupts, from Eric Dumazet.

 8) Use a doubly linked list to optimize neigh_parms_release(), from
    Nicolas Dichtel.

 9) Various enhancements to the kernel BPF verifier, and allow eBPF
    programs to actually be attached to sockets.  From Alexei
    Starovoitov.

10) Support TSO/LSO in sunvnet driver, from David L Stevens.

11) Allow controlling ECN usage via routing metrics, from Florian
    Westphal.

12) Remote checksum offload, from Tom Herbert.

13) Add split-header receive, BQL, and xmit_more support to amd-xgbe
    driver, from Thomas Lendacky.

14) Add MPLS support to openvswitch, from Simon Horman.

15) Support wildcard tunnel endpoints in ipv6 tunnels, from Steffen
    Klassert.

16) Do gro flushes on a per-device basis using a timer, from Eric
    Dumazet.  This tries to resolve the conflicting goals between the
    desired handling of bulk vs.  RPC-like traffic.

17) Allow userspace to ask for the CPU upon what a packet was
    received/steered, via SO_INCOMING_CPU.  From Eric Dumazet.

18) Limit GSO packets to half the current congestion window, from Eric
    Dumazet.

19) Add a generic helper so that all drivers set their RSS keys in a
    consistent way, from Eric Dumazet.

20) Add xmit_more support to enic driver, from Govindarajulu
    Varadarajan.

21) Add VLAN packet scheduler action, from Jiri Pirko.

22) Support configurable RSS hash functions via ethtool, from Eyal
    Perry.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1820 commits)
  Fix race condition between vxlan_sock_add and vxlan_sock_release
  net/macb: fix compilation warning for print_hex_dump() called with skb->mac_header
  net/mlx4: Add support for A0 steering
  net/mlx4: Refactor QUERY_PORT
  net/mlx4_core: Add explicit error message when rule doesn't meet configuration
  net/mlx4: Add A0 hybrid steering
  net/mlx4: Add mlx4_bitmap zone allocator
  net/mlx4: Add a check if there are too many reserved QPs
  net/mlx4: Change QP allocation scheme
  net/mlx4_core: Use tasklet for user-space CQ completion events
  net/mlx4_core: Mask out host side virtualization features for guests
  net/mlx4_en: Set csum level for encapsulated packets
  be2net: Export tunnel offloads only when a VxLAN tunnel is created
  gianfar: Fix dma check map error when DMA_API_DEBUG is enabled
  cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call
  net: fec: only enable mdio interrupt before phy device link up
  net: fec: clear all interrupt events to support i.MX6SX
  net: fec: reset fep link status in suspend function
  net: sock: fix access via invalid file descriptor
  net: introduce helper macro for_each_cmsghdr
  ...
2014-12-11 14:27:06 -08:00
Tony Lindgren
027bc8b082 pstore-ram: Allow optional mapping with pgprot_noncached
On some ARMs the memory can be mapped pgprot_noncached() and still
be working for atomic operations. As pointed out by Colin Cross
<ccross@android.com>, in some cases you do want to use
pgprot_noncached() if the SoC supports it to see a debug printk
just before a write hanging the system.

On ARMs, the atomic operations on strongly ordered memory are
implementation defined. So let's provide an optional kernel parameter
for configuring pgprot_noncached(), and use pgprot_writecombine() by
default.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rob Herring <robherring2@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2014-12-11 13:38:31 -08:00
Rob Herring
7ae9cb8193 pstore-ram: Fix hangs by using write-combine mappings
Currently trying to use pstore on at least ARMs can hang as we're
mapping the peristent RAM with pgprot_noncached().

On ARMs, pgprot_noncached() will actually make the memory strongly
ordered, and as the atomic operations pstore uses are implementation
defined for strongly ordered memory, they may not work. So basically
atomic operations have undefined behavior on ARM for device or strongly
ordered memory types.

Let's fix the issue by using write-combine variants for mappings. This
corresponds to normal, non-cacheable memory on ARM. For many other
architectures, this change does not change the mapping type as by
default we have:

#define pgprot_writecombine pgprot_noncached

The reason why pgprot_noncached() was originaly used for pstore
is because Colin Cross <ccross@android.com> had observed lost
debug prints right before a device hanging write operation on some
systems. For the platforms supporting pgprot_noncached(), we can
add a an optional configuration option to support that. But let's
get pstore working first before adding new features.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Anton Vorontsov <cbouatmailru@gmail.com>
Cc: Colin Cross <ccross@android.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
[tony@atomide.com: updated description]
Signed-off-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2014-12-11 13:35:49 -08:00
Al Viro
93fe74b2e2 coda_venus_readdir(): use file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-11 16:28:12 -05:00
Al Viro
d465887f9d fs/namei.c: fold link_path_walk() call into path_init()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-11 16:27:57 -05:00
Al Viro
980f3ea2f6 path_init(): don't bother with LOOKUP_PARENT in argument
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-11 16:27:57 -05:00
Al Viro
893b7775a7 fs/namei.c: new helper (path_cleanup())
All callers of path_init() proceed to do the identical cleanup when
they are done with nameidata.  Don't open-code it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-11 16:27:57 -05:00
Al Viro
5e53084d77 path_init(): store the "base" pointer to file in nameidata itself
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-11 16:27:57 -05:00
Linus Torvalds
b6da0076ba Merge branch 'akpm' (patchbomb from Andrew)
Merge first patchbomb from Andrew Morton:
 - a few minor cifs fixes
 - dma-debug upadtes
 - ocfs2
 - slab
 - about half of MM
 - procfs
 - kernel/exit.c
 - panic.c tweaks
 - printk upates
 - lib/ updates
 - checkpatch updates
 - fs/binfmt updates
 - the drivers/rtc tree
 - nilfs
 - kmod fixes
 - more kernel/exit.c
 - various other misc tweaks and fixes

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits)
  exit: pidns: fix/update the comments in zap_pid_ns_processes()
  exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
  exit: exit_notify: re-use "dead" list to autoreap current
  exit: reparent: call forget_original_parent() under tasklist_lock
  exit: reparent: avoid find_new_reaper() if no children
  exit: reparent: introduce find_alive_thread()
  exit: reparent: introduce find_child_reaper()
  exit: reparent: document the ->has_child_subreaper checks
  exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
  exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
  exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
  exit: proc: don't try to flush /proc/tgid/task/tgid
  exit: release_task: fix the comment about group leader accounting
  exit: wait: drop tasklist_lock before psig->c* accounting
  exit: wait: don't use zombie->real_parent
  exit: wait: cleanup the ptrace_reparented() checks
  usermodehelper: kill the kmod_thread_locker logic
  usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
  fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
  nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
  ...
2014-12-10 18:34:42 -08:00
Al Viro
bd9b51e79c make default ->i_fop have ->open() fail with ENXIO
As it is, default ->i_fop has NULL ->open() (along with all other methods).
The only case where it matters is reopening (via procfs symlink) a file that
didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned
to something sane (default would fail on read/write/ioctl/etc.).

	Unfortunately, such case exists - alloc_file() users, especially
anon_get_file() ones.  There we have tons of opened files of very different
kinds sharing the same inode.  As the result, attempt to reopen those via
procfs succeeds and you get a descriptor you can't do anything with.

	Moreover, in case of sockets we set ->i_fop that will only be used
on such reopen attempts - and put a failing ->open() into it to make sure
those do not succeed.

	It would be simpler to put such ->open() into default ->i_fop and leave
it unchanged both for anon inode (as we do anyway) and for socket ones.  Result:
	* everything going through do_dentry_open() works as it used to
	* sock_no_open() kludge is gone
	* attempts to reopen anon-inode files fail as they really ought to
	* ditto for aio_private_file()
	* ditto for perfmon - this one actually tried to imitate sock_no_open()
trick, but failed to set ->i_fop, so in the current tree reopens succeed and
yield completely useless descriptor.  Intent clearly had been to fail with
-ENXIO on such reopens; now it actually does.
	* everything else that used alloc_file() keeps working - it has ->i_fop
set for its inodes anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10 21:32:15 -05:00
Al Viro
1f55a6ec94 make nameidata completely opaque outside of fs/namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10 21:32:13 -05:00
Al Viro
707c5960f1 Merge branch 'nsfs' into for-next 2014-12-10 21:31:59 -05:00
Al Viro
3d3d35b1e9 kill proc_ns completely
procfs inodes need only the ns_ops part; nsfs inodes don't need it at all

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10 21:30:57 -05:00
Al Viro
e149ed2b80 take the targets of /proc/*/ns/* symlinks to separate fs
New pseudo-filesystem: nsfs.  Targets of /proc/*/ns/* live there now.
It's not mountable (not even registered, so it's not in /proc/filesystems,
etc.).  Files on it *are* bindable - we explicitly permit that in do_loopback().

This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
get_proc_ns() is a macro now (it's simply returning ->i_private; would
have been an inline, if not for header ordering headache).
proc_ns_inode() is an ex-parrot.  The interface used in procfs is
ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).

Dentries and inodes are never hashed; a non-counting reference to dentry
is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
if present.  See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
of that mechanism.

As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
it does nd_jump_link() on a consistent <vfsmount,dentry> pair it gets
from ns_get_path().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10 21:30:20 -05:00
Oleg Nesterov
c35a7f18a0 exit: proc: don't try to flush /proc/tgid/task/tgid
proc_flush_task_mnt() always tries to flush task/pid, but this is
pointless if we reap the leader. d_invalidate() is recursive, and
if nothing else the next d_hash_and_lookup(tgid) should fail anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:17 -08:00
Rasmus Villemoes
ddbc22e27e fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
Relying on the sign (after casting to int) of the difference of two
quantities for comparison is usually wrong.  For example, should a-b
turn out to be 2^31, the return value of cmp(a,b) is -2^31; but that
would also be the return value from cmp(b, a).  So a compares less than
b and b compares less than a.  One can also easily find three values
a,b,c such that a compares less than b, b compares less than c, but a
does not compare less than c.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:16 -08:00
Ryusuke Konishi
705304a863 nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
Same story as in commit 41080b5a24 ("nfsd race fixes: ext2") (similar
ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
of insert_inode_locked() and a bug of a check for dead inodes needs to
be fixed.

If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.

If nilfs_iget() is called before nilfs_new_inode() calls
insert_inode_locked4(), it will create an in-core inode and read its
data from the on-disk inode.  But, nilfs_iget() will find i_nlink equals
zero and fail at nilfs_read_inode_common(), which will lead it to call
iget_failed() and cleanly fail.

However, this sanity check doesn't work as expected for reused on-disk
inodes because they leave a non-zero value in i_mode field and it
hinders the test of i_nlink.  This patch also fixes the issue by
removing the test on i_mode that nilfs2 doesn't need.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:16 -08:00
Markus Elfring
72b9918ea4 nilfs2: deletion of an unnecessary check before the function call "iput"
The iput() function tests whether its argument is NULL and then returns
immediately.  Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:16 -08:00
Andreas Rohner
75dc857c46 nilfs2: avoid duplicate segment construction for fsync()
This patch removes filemap_write_and_wait_range() from nilfs_sync_file(),
because it triggers a data segment construction by calling
nilfs_writepages() with WB_SYNC_ALL.  A data segment construction does not
remove the inode from the i_dirty list and it does not clear the
NILFS_I_DIRTY flag.  Therefore nilfs_inode_dirty() still returns true,
which leads to an unnecessary duplicate segment construction in
nilfs_sync_file().

A call to filemap_write_and_wait_range() is not needed, because NILFS2
does not rely on the generic writeback mechanisms.  Instead it implements
its own mechanism to collect all dirty pages and write them into segments.
 It is more efficient to initiate the segment construction directly in
nilfs_sync_file() without the detour over filemap_write_and_wait_range().

Additionally the lock of i_mutex is not needed, because all code blocks
that are protected by i_mutex are also protected by a NILFS transaction:

  Function                i_mutex     nilfs_transaction
  ------------------------------------------------------
  nilfs_ioctl_setflags:   yes         yes
  nilfs_fiemap:           yes         no
  nilfs_write_begin:      yes         yes
  nilfs_write_end:        yes         yes
  nilfs_lookup:           yes         no
  nilfs_create:           yes         yes
  nilfs_link:             yes         yes
  nilfs_mknod:            yes         yes
  nilfs_symlink:          yes         yes
  nilfs_mkdir:            yes         yes
  nilfs_unlink:           yes         yes
  nilfs_rmdir:            yes         yes
  nilfs_rename:           yes         yes
  nilfs_setattr:          yes         yes

For nilfs_lookup() i_mutex is held for the parent directory, to protect it
from modification.  The segment construction does not modify directory
inodes, so no lock is needed.

nilfs_fiemap() reads the block layout on the disk, by using
nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem.

Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:16 -08:00
Jan Kara
a682e9c28c ncpfs: return proper error from NCP_IOC_SETROOT ioctl
If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
return value is then (in most cases) just overwritten before we return.
This can result in reporting success to userspace although error happened.

This bug was introduced by commit 2e54eb96e2 ("BKL: Remove BKL from
ncpfs").  Propagate the errors correctly.

Coverity id: 1226925.

Fixes: 2e54eb96e2 ("BKL: Remove BKL from ncpfs")
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:13 -08:00
Jungseung Lee
52f5592e54 fs/binfmt_elf.c: fix internal inconsistency relating to vma dump size
vma_dump_size() has been used several times on actual dumper and it is
supposed to return the same value for the same vma.  But vma_dump_size()
could return different values for same vma.

The known problem case is concurrent shared memory removal.  If a vma is
used for a shared memory and that shared memory is removed between
writing program header and dumping vma memory, this will result in a
dump file which is internally consistent.

To fix the problem, we set baseline to get dump size and store the size
into vma_filesz and always use the same vma dump size which is stored in
vma_filsz.  The consistnecy with reality is not actually guranteed, but
it's tolerable since that is fully consistent with base line.

Signed-off-by: Jungseung Lee <js07.lee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:12 -08:00
Andrew Morton
f7e1ad1a1e fs/binfmt_misc.c: use GFP_KERNEL instead of GFP_USER
GFP_USER means "honour cpuset nodes-allowed beancounting".  These are
regular old kernel objects and there seems no reason to give them this
treatment.

Acked-by: Mike Frysinger <vapier@gentoo.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:12 -08:00
Mike Frysinger
e6084d4a08 binfmt_misc: clean up code style a bit
Clean up various coding style issues that checkpatch complains about.
No functional changes here.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:12 -08:00
Mike Frysinger
6b899c4e9a binfmt_misc: add comments & debug logs
When trying to develop a custom format handler, the errors returned all
effectively get bucketed as EINVAL with no kernel messages.  The other
errors (ENOMEM/EFAULT) are internal/obvious and basic.  Thus any time a
bad handler is rejected, the developer has to walk the dense code and
try to guess where it went wrong.  Needing to dive into kernel code is
itself a fairly high barrier for a lot of people.

To improve this situation, let's deploy extensive pr_debug markers at
logical parse points, and add comments to the dense parsing logic.  It
let's you see exactly where the parsing aborts, the string the kernel
received (useful when dealing with shell code), how it translated the
buffers to binary data, and how it will apply the mask at runtime.

Some example output:
  $ echo ':qemu-foo:M::\x7fELF\xAD\xAD\x01\x00:\xff\xff\xff\xff\xff\x00\xff\x00:/usr/bin/qemu-foo:POC' > register
  $ dmesg
  binfmt_misc: register: received 92 bytes
  binfmt_misc: register: delim: 0x3a {:}
  binfmt_misc: register: name: {qemu-foo}
  binfmt_misc: register: type: M (magic)
  binfmt_misc: register: offset: 0x0
  binfmt_misc: register: magic[raw]: 5c 78 37 66 45 4c 46 5c 78 41 44 5c 78 41 44 5c  \x7fELF\xAD\xAD\
  binfmt_misc: register: magic[raw]: 78 30 31 5c 78 30 30 00                          x01\x00.
  binfmt_misc: register:  mask[raw]: 5c 78 66 66 5c 78 66 66 5c 78 66 66 5c 78 66 66  \xff\xff\xff\xff
  binfmt_misc: register:  mask[raw]: 5c 78 66 66 5c 78 30 30 5c 78 66 66 5c 78 30 30  \xff\x00\xff\x00
  binfmt_misc: register:  mask[raw]: 00                                               .
  binfmt_misc: register: magic/mask length: 8
  binfmt_misc: register: magic[decoded]: 7f 45 4c 46 ad ad 01 00                          .ELF....
  binfmt_misc: register:  mask[decoded]: ff ff ff ff ff 00 ff 00                          ........
  binfmt_misc: register:  magic[masked]: 7f 45 4c 46 ad 00 01 00                          .ELF....
  binfmt_misc: register: interpreter: {/usr/bin/qemu-foo}
  binfmt_misc: register: flag: P (preserve argv0)
  binfmt_misc: register: flag: O (open binary)
  binfmt_misc: register: flag: C (preserve creds)

The [raw] lines show us exactly what was received from userspace.  The
lines after that show us how the kernel has decoded things.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:12 -08:00
Yann Droneaud
8d10a03582 fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code
start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
used, either by default or choosen by userspace.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:10 -08:00
Yann Droneaud
c6cb898b54 binfmt_misc: replace get_unused_fd() with get_unused_fd_flags(0)
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.

In a further patch, get_unused_fd() will be removed so that new code start
using get_unused_fd_flags(), with the hope O_CLOEXEC could be used, either
by default or choosen by userspace.

Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:10 -08:00
Oleg Nesterov
abdba6e9ea proc: task_state: ptrace_parent() doesn't need pid_alive() check
p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check.  Other callers already use it
directly without additional checks.

Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Oleg Nesterov
b0fafc1111 proc: task_state: move the main seq_printf() outside of rcu_read_lock()
task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id().  We can calculate
tgid/ngid and drop rcu lock.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Oleg Nesterov
0f4a0d53f2 proc: task_state: deuglify the max_fds calculation
1. The usage of fdt looks very ugly, it can't be NULL if ->files is
   not NULL. We can use "unsigned int max_fds" instead.

2. This also allows to move seq_printf(max_fds) outside of task_lock()
   and join it with the previous seq_printf(). See also the next patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Oleg Nesterov
4af1036df4 proc: task_state: read cred->group_info outside of task_lock()
task_state() reads cred->group_info under task_lock() because a long ago
it was task_struct->group_info and it was actually protected by
task->alloc_lock.  Today this task_unlock() after rcu_read_unlock() just
adds the confusion, move task_unlock() up.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Nicolas Dichtel
2fc1e948e8 fs/proc.c: use rb_entry_safe() instead of rb_entry()
Better to use existing macro that rewriting them.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Debabrata Banerjee
b208d54b75 procfs: fix error handling of proc_register()
proc_register() error paths are leaking inodes and directory refcounts.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Nicolas Dichtel
710585d492 fs/proc: use a rb tree for the directory entries
When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries.  This serie aims to accelerate this part.

The current implementation for the directories in /proc is using a single
linked list.  This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).

This patch replaces this linked list by a red-black tree.

Here are some numbers:

dummy30000.batch contains 30 000 times 'link add type dummy'.

Before the patch:
  $ time ip -b dummy30000.batch
  real    2m31.950s
  user    0m0.440s
  sys     2m21.440s
  $ time rmmod dummy
  real    1m35.764s
  user    0m0.000s
  sys     1m24.088s

After the patch:
  $ time ip -b dummy30000.batch
  real    2m0.874s
  user    0m0.448s
  sys     1m49.720s
  $ time rmmod dummy
  real    1m13.988s
  user    0m0.000s
  sys     1m1.008s

The idea of improving this part was suggested by Thierry Herbelot.

[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Thierry Herbelot <thierry.herbelot@6wind.com>.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Kirill A. Shutemov
c164e038ee mm: fix huge zero page accounting in smaps report
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.

For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds.  We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.

Let's add separate codepath to handle pmds.  follow_trans_huge_pmd() will
detect huge zero page for us.

We would need pmd_dirty() helper to do this properly.  The patch adds it
to THP-enabled architectures which don't yet have one.

[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengwei Yin <yfw.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:08 -08:00
Jan Kara
e2ab879e96 fs/char_dev.c: remove pointless assignment from __register_chrdev_region()
At one place we assign major number we found to ret.  That assignment is
then never used and actually doesn't make any sense given how the code is
currently structured (the assignment comes from pre-git times).  Just
remove it.

Coverity id: 1226852.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Dan Carpenter
b3e3e5af60 ocfs2: remove unneeded NULL check
In commit 1faf289454 ("ocfs2_dlm: disallow a domain join if node maps
mismatch") we introduced a new earlier NULL check so this one is not
needed.  Also static checkers complain because we dereference it first
and then check for NULL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
Dan Carpenter
88d69b92fc ocfs2: remove bogus NULL check in ocfs2_move_extents()
"inode" isn't NULL here, and also we dereference it on the previous line
so static checkers get annoyed.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:04 -08:00
jiangyiwen
61fb9ea4b3 ocfs2: do not set filesystem readonly if link down
Do not set the filesystem readonly if the storage link is down.  In this
case, metadata is not corrupted and only -EIO is returned.  And if it is
indeed corrupted metadata, it has already called ocfs2_error() in
ocfs2_validate_inode_block().

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Xue jiufei
d1e7823874 ocfs2: do not set OCFS2_LOCK_UPCONVERT_FINISHING if nonblocking lock can not be granted at once
ocfs2_readpages() use nonblocking flag to avoid page lock inversion.  It
will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING
is not cleared if nonblocking lock cannot be granted at once.  The flag
would prevent dc thread from downconverting.  So other nodes cannot
acheive this lockres for ever.

So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if
nonblocking lock had already returned.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Jan Kara
dc17158060 ocfs2: fix error handling when creating debugfs root in ocfs2_init()
Error handling if creation of root of debugfs in ocfs2_init() fails is
broken.  Although error code is set we fail to exit ocfs2_init() with
error and thus initialization ends with success.  Later when mounting a
filesystem, ocfs2 debugfs entries end up being created in the root of
debugfs filesystem which is confusing.

Fix the error handling to bail out.

Coverity id: 1227009.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Goldwyn Rodrigues
86b9c6f3f8 ocfs2: remove filesize checks for sync I/O journal commit
Filesize is not a good indication that the file needs to be synced.
An example where this breaks is:
 1. Open the file in O_SYNC|O_RDWR
 2. Read a small portion of the file (say 64 bytes)
 3. Lseek to starting of the file
 4. Write 64 bytes

If the node crashes, it is not written out to disk because this was not
committed in the journal and the other node which reads the file after
recovery reads stale data (even if the write on the other node was
successful)

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Junxiao Bi
196fe71d64 ocfs2: o2net: fix connect expired
Set nn_persistent_error to -ENOTCONN will stop reconnect since the
"stop" condition in o2net_start_connect() will be true.

    stop = (nn->nn_sc ||
                (nn->nn_persistent_error &&
                (nn->nn_persistent_error != -ENOTCONN || timeout == 0)));

This will make connection never be established if the first connection
request is lost.

Set nn_persistent_error to 0 when connect expired to fix this.  With
this changes, dlm will not be waken up when connect expired, this is OK
since dlm depends on network, dlm can do nothing in this case if waken
up.  Let it wait there for network recover and connect built again to
continue.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Srinivas Eeda
cb79662bc2 ocfs2: o2dlm: fix a race between purge and master query
Node A sends master query request to node B which is the master.  At this
time lockres happens to be on purgelist.  dlm_master_request_handler gets
the dlm spinlock, finds the resource and releases the dlm spin lock.
Right at this dlm_thread on this node could purge the lockres.
dlm_master_request_handler can then acquire lockres spinlock and reply to
Node A that node B is the master even though lockres on node B is purged.

The above scenario will now make node A falsely think node B is the master
which is inconsistent.  Further if another node C tries to master the same
resource, every node will respond they are not the master.  Node C then
masters the resource and sends assert master to all nodes.  This will now
make node A crash with the following message.

dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current
owner is 10!

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Tested-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Jan Kara
f5425fcea7 ocfs2: report error from o2hb_do_disk_heartbeat() to user
Report return value of o2hb_do_disk_heartbeat() as a part of ML_HEARTBEAT
message so that we know whether a heartbeat actually happened or not.
This also makes assigned but otherwise unused 'ret' variable used.

Coverity id: 1227053.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Jan Kara
4a635a113b ocfs2: remove bogus test from ocfs2_read_locked_inode()
'args' are always set for ocfs2_read_locked_inode() and brelse() checks
whether bh is NULL.  So the test (args && bh) is unnecessary (plus the
args part is really confusing anyway).  Remove it.

Coverity id: 1128856.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Jan Kara
2b693005b8 ocfs2: Fix xattr check in ocfs2_get_xattr_nolock()
ocfs2_get_xattr_nolock() checks whether inode has any extended attributes
(OCFS2_HAS_XATTR_FL).  If not, it just sets 'ret' to -ENODATA but
continues with checking inline and external attributes anyway (which is
pointless although it does not harm).  Just return immediately when we
know there are no extended attributes in the inode.

Coverity id: 1226906.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Dan Carpenter
519a286175 ocfs2: fix an off-by-one BUG_ON() statement
The ->si_slots[] array is allocated in ocfs2_init_slot_info() it has
"->max_slots" number of elements so this test should be >= instead of >.

Static checker work.  Compile tested only.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Joseph Qi
f08736bd6c ocfs2/dlm: let sender retry if dlm_dispatch_assert_master failed with -ENOMEM
Do not BUG() if GFP_ATOMIC allocation fails in dlm_dispatch_assert_master.
Instead, return -ENOMEM to the sender and then retry.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:03 -08:00
Fabian Frederick
662e9b2b98 fs/cifs/smb2file.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:02 -08:00
Fabian Frederick
4b99d39b1b fs/cifs/file.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:02 -08:00
Fabian Frederick
bc09d141eb fs/cifs: remove obsolete __constant
Replace all __constant_foo to foo() except in smb2status.h (1700 lines to
update).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:02 -08:00
Linus Torvalds
cbfe0de303 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS changes from Al Viro:
 "First pile out of several (there _definitely_ will be more).  Stuff in
  this one:

   - unification of d_splice_alias()/d_materialize_unique()

   - iov_iter rewrite

   - killing a bunch of ->f_path.dentry users (and f_dentry macro).

     Getting that completed will make life much simpler for
     unionmount/overlayfs, since then we'll be able to limit the places
     sensitive to file _dentry_ to reasonably few.  Which allows to have
     file_inode(file) pointing to inode in a covered layer, with dentry
     pointing to (negative) dentry in union one.

     Still not complete, but much closer now.

   - crapectomy in lustre (dead code removal, mostly)

   - "let's make seq_printf return nothing" preparations

   - assorted cleanups and fixes

  There _definitely_ will be more piles"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  copy_from_iter_nocache()
  new helper: iov_iter_kvec()
  csum_and_copy_..._iter()
  iov_iter.c: handle ITER_KVEC directly
  iov_iter.c: convert copy_to_iter() to iterate_and_advance
  iov_iter.c: convert copy_from_iter() to iterate_and_advance
  iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
  iov_iter.c: convert iov_iter_zero() to iterate_and_advance
  iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
  iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
  iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
  iov_iter.c: iterate_and_advance
  iov_iter.c: macros for iterating over iov_iter
  kill f_dentry macro
  dcache: fix kmemcheck warning in switch_names
  new helper: audit_file()
  nfsd_vfs_write(): use file_inode()
  ncpfs: use file_inode()
  kill f_dentry uses
  lockd: get rid of ->f_path.dentry->d_sb
  ...
2014-12-10 16:10:49 -08:00
Linus Torvalds
8322b6fddf dlm for 3.19
This set includes one feature, which allows locks that
 have been orphaned to be reacquired.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUiIOTAAoJEDgbc8f8gGmqIHcP/1JW/ji74+bdci533MH1pL9W
 l6HhoiGsNi4bF7QJckKXUKZYDii4VBjf5VkvvA0aNnXoncBr2XW6b96TUPsdbzBd
 St11wSkhlfMOMfg+aalBFd34fm4QwNt5hqmuPAJXbK24Jgy4JMzHsEHSi7yz5WDn
 Pgl2s9+6fDU648Vd0iu1u3jMXY8MP1apGWJkV7tFmt2XE1DyOP+yqghYp2PkQeDk
 hZPdLhO2JwynRliPg99qNLvBurzYwFk1RJ1fbi9WPnvgTp42i+JuxbMtZwh5ehjr
 FLWskJIJmbsm8S95uw4lGFhPE+76Psq4etmoTl+lyT2pZQeNItWX6JQ6u8UcGevD
 xJeokmdhvbF4NRIcgP7b3u3Mue78PdkqAy40nmkBdp4+9uJrXB/+Mts1FBJHgXIH
 jdEGGdVCBSGr7TRkbJ5hMfI51Wyrl6u2JICCBlzwGWbRiXy76u78YAIj+w5s45yL
 HxkRZll9UMNEDlO//Ldhwh0CV0yBW00bdeurwxm6i4xS9vAUEIXByJ62EwbPnMvC
 vD6oufWkfNzAKZoF8gvPwQStt9pXWPNe314QvUVHx6B9VZpcV9VfEqmNN4qzhuBU
 I5a5G03tnjtd1JdcsFfxduDIYVDYTmba/Bj/CLVMECWsBRAvCzr57a3JHL+cabA0
 Lz/LqaTNterF8l4zAo8J
 =u91p
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm update from David Teigland:
 "This set includes one feature, which allows locks that have been
  orphaned to be reacquired"

* tag 'dlm-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: adopt orphan locks
2014-12-10 16:02:12 -08:00
Linus Torvalds
1366f5d312 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota updates from Jan Kara:
 "Quota improvements and some minor cleanups.

  The main portion in the pull request are changes which move i_dquot
  array from struct inode into fs-private part of an inode which saves
  memory for filesystems which don't use VFS quotas"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: One function call less in udf_fill_super() after error detection
  udf: Deletion of unnecessary checks before the function call "iput"
  jbd: Deletion of an unnecessary check before the function call "iput"
  vfs: Remove i_dquot field from inode
  jfs: Convert to private i_dquot field
  reiserfs: Convert to private i_dquot field
  ocfs2: Convert to private i_dquot field
  ext4: Convert to private i_dquot field
  ext3: Convert to private i_dquot field
  ext2: Convert to private i_dquot field
  quota: Use function to provide i_dquot pointers
  xfs: Set allowed quota types
  gfs2: Set allowed quota types
  quota: Allow each filesystem to specify which quota types it supports
  quota: Remove const from function declarations
  quota: Add log level to printk
2014-12-10 15:43:30 -08:00
Linus Torvalds
4b0a268eec Merge tag 'for-f2fs-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "This patch-set includes lots of bug fixes based on clean-ups and
  refactored codes.  And inline_dir was introduced and two minor mount
  options were added.  Details from signed tag:

  This series includes the following enhancement with refactored flows.
   - fix inmemory page operations
   - fix wrong inline_data & inline_dir logics
   - enhance memory and IO control under memory pressure
   - consider preemption on radix_tree operation
   - fix memory leaks and deadlocks

  But also, there are a couple of new features:
   - support inline_dir to store dentries inside inode page
   - add -o fastboot to reduce booting time
   - implement -o dirsync

  And a lot of clean-ups and minor bug fixes as well"

* tag 'for-f2fs-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (88 commits)
  f2fs: avoid to ra unneeded blocks in recover flow
  f2fs: introduce is_valid_blkaddr to cleanup codes in ra_meta_pages
  f2fs: fix to enable readahead for SSA/CP blocks
  f2fs: use atomic for counting inode with inline_{dir,inode} flag
  f2fs: cleanup path to need cp at fsync
  f2fs: check if inode state is dirty at fsync
  f2fs: count the number of inmemory pages
  f2fs: release inmemory pages when the file was closed
  f2fs: set page private for inmemory pages for truncation
  f2fs: count inline_xx in do_read_inode
  f2fs: do retry operations with cond_resched
  f2fs: call radix_tree_preload before radix_tree_insert
  f2fs: use rw_semaphore for nat entry lock
  f2fs: fix missing kmem_cache_free
  f2fs: more fast lookup for gc_inode list
  f2fs: cleanup redundant macro
  f2fs: fix to return correct error number in f2fs_write_begin
  f2fs: cleanup if-statement of phase in gc_data_segment
  f2fs: fix to recover converted inline_data
  f2fs: make clean the page before writing
  ...
2014-12-10 15:41:28 -08:00
Linus Torvalds
a6b849578e Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs update from Steve French:
 "Mostly cifs cleanup but also a few cifs fixes"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: remove unneeded condition check
  Set UID in sess_auth_rawntlmssp_authenticate too
  cifs: convert printk(LEVEL...) to pr_<level>
  cifs: convert to print_hex_dump() instead of custom implementation
  cifs: call strtobool instead of custom implementation
  Update MAINTAINERS entry
  Update modinfo cifs version for cifs.ko
  decode_negTokenInit had wrong calling sequence
  Add missing defines for ACL query support
  Add support for original fallocate
2014-12-10 15:37:51 -08:00
Linus Torvalds
1715ac63d3 In contrast to recent merge windows, there are a number of interesting features
this time. There is a set of patches to improve performance in relation to
 block reservations. Some correctness fixes for fallocate, and an update
 to the freeze/thaw code which greatly simplyfies this code path. In
 addition there is a set of clean ups from Al Viro too.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJUher6AAoJEMrg3m4a/8jSaJwP/Ai9cohCBohYgzgBIas0L8zy
 H6BwYwLoUU0E7UlL7RBkjE9ZNL2meFcDM4NGpzXkOcJaJw5hkWHcwSmLBOU1V27N
 v3wgaLd1J2BXwaYMrJ0XTqbdzU63Y27KkXOHPBr+UwEtd3azeugNX2sfgrKg8cqd
 6AM8sbPifGs+2u1viTbtAhirIo/TE2kk60OuBeX6hCNjvN/PcOKKF+ISewtpqfFD
 1vHwjVDX7USuUkjGQRCmM7A032b2YilMf+57Oe/a2Q+CyI7E41259nrwWC0/vcst
 AuKb48WyL6Y6YLMXA2HlqxeYkyEAyr0pk0D4hRYYofebSn3d4mDaxvTU0y/vKuL1
 bD9J3niPv44B9OtrjzbKf0Utsk9cUeYMOcb6ydMTcEYdMIEITG21N/yR1bU2MkYt
 4KpnjcdEtoNteo0OsxtWq2poL0RxlKde8P7wUtwvnrK0wcVDdWbLU1iXf0t2r2RF
 JO9ZSTYrKoFvTpg34zCcUlHBMarZSdP1Kou9hUkTXmZtmirwqR+9T6GtexD60jxz
 TIRMHOf8HXz9wM4kUI442IBaHIW38AsXNEPVUp3vk04qLCqCPmE7ISBvAB4NHbIn
 Yw/X9fJwK3hn+/R9+u09aJKLGDKWwlSOVdTb+yFgQcqz6BcaBoZMdamiKQcOGEk2
 5qQ8J/F5f87BZOvuUUpI
 =t1F/
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull GFS2 update from Steven Whitehouse:
 "In contrast to recent merge windows, there are a number of interesting
  features this time:

  There is a set of patches to improve performance in relation to block
  reservations.  Some correctness fixes for fallocate, and an update to
  the freeze/thaw code which greatly simplyfies this code path.  In
  addition there is a set of clean ups from Al Viro too"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: gfs2_atomic_open(): simplify the use of finish_no_open()
  GFS2: gfs2_dir_get_hash_table(): avoiding deferred vfree() is easy here...
  GFS2: use kvfree() instead of open-coding it
  GFS2: gfs2_create_inode(): don't bother with d_splice_alias()
  GFS2: bugger off early if O_CREAT open finds a directory
  GFS2: Deletion of unnecessary checks before two function calls
  GFS2: update freeze code to use freeze/thaw_super on all nodes
  fs: add freeze_super/thaw_super fs hooks
  GFS2: Update timestamps on fallocate
  GFS2: Update i_size properly on fallocate
  GFS2: Use inode_newsize_ok and get_write_access in fallocate
  GFS2: If we use up our block reservation, request more next time
  GFS2: Only increase rs_sizehint
  GFS2: Set of distributed preferences for rgrps
  GFS2: directly return gfs2_dir_check()
2014-12-10 15:34:29 -08:00
Linus Torvalds
08e2fb6ce6 On a system that restricts access to dmesg, don't let people
side-step that by reading copies that pstore saved.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUheJgAAoJEKurIx+X31iB5F0P/jdpAw6cI26icGiOcRvRYvce
 jLq/WbGggxZlx3rtgGpekJmcJ1NBBTLdyx4b86q4q/zstQkoJ9lqGCn63YcIMJNB
 pdctmbkGyoQQXBTAzSCFs6pybMUmtYKMDiT3OJddcCm4fUjd4RQHvNP+5ESsf0lQ
 9YpIS+rZOtB2/5N6/i4+Lnaffc3s5gXw/dJMxOm/laWtRFRyhf22YP18cRp5LmuV
 NHqu1NoeLnar/qL6plPl73lEyZVOPRC01T7OWmmCkcLieYPGkqQlkoXp95VBKf5u
 CvD167oM71OccMa0gOTlCS8a6y5KO6y8I+YAR60iANTLDh+rHZiwNj1gY4v/Z29m
 2ba1xAulQrpCxqml6eVxAKaF+4HXaXVXKqjQIivJcGyfYf6BXLMvC0M3Lsv7XQdz
 HKl++o0JELDEJjVW0i9Wa5CjgcqXdvuRXOoKDaKTZWff2yfUxqIN5Xl7zIV2kgVy
 ZqPDBHJSmHjuzmJ6inhPkmdS2uz94PVSE7ykeaa8iCBbpdsS+FchtF2sRMvUhU23
 ekHsxk0Mk/pS5EBNc6rrrM9NtKrUQMa1e/oT5G7QowksDeNpsPjx92OeUImxgh3x
 +hmObN9vx6SepwVSfjI1rwrMsAknphJfPmyi/XJgkVbfRMCv2we1npvYd6hqFUMV
 daekMzGOi5eqoaWB8hje
 =Ezg0
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux

Pull pstore fixes from Tony Luck:
 "On a system that restricts access to dmesg, don't let people side-step
  that by reading copies that pstore saved"

* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
  syslog: Provide stub check_syslog_permissions
  pstore: Honor dmesg_restrict sysctl on dmesg dumps
  pstore/ram: Strip ramoops header for correct decompression
2014-12-10 15:15:56 -08:00
Linus Torvalds
e20db597b6 NFS client updates for Linux 3.19
Highlights include:
 
 Features:
 - NFSv4.2 client support for hole punching and preallocation.
 - Further RPC/RDMA client improvements.
 - Add more RPC transport debugging tracepoints.
 - Add RPC debugging tools in debugfs.
 
 Bugfixes:
 - Stable fix for layoutget error handling
 - Fix a change in COMMIT behaviour resulting from the recent io code updates
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUhRVTAAoJEGcL54qWCgDyfeUP/RoFo3ImTMbGxfcPJqoELjcO
 lZbQ+27pOE/whFDkWgiOVTwlgGct5a0WRo7GCZmpYJA4q1kmSv4ngTb3nMTCUztt
 xMJ0mBr0BqttVs+ouKiVPm3cejQXedEhttwWcloIXS8lNenlpL29Zlrx2NHdU8UU
 13+souocj0dwIyTYYS/4Lm9KpuCYnpDBpP5ShvQjVaMe/GxJo6GyZu70c7FgwGNz
 Nh9onzZV3mz1elhfizlV38aVA7KWVXtLWIqOFIKlT2fa4nWB8Hc07miR5UeOK0/h
 r+icnF2qCQe83MbjOxYNxIKB6uiA/4xwVc90X4AQ7F0RX8XPWHIQWG5tlkC9jrCQ
 3RGzYshWDc9Ud2mXtLMyVQxHVVYlFAe1WtdP8ZWb1oxDInmhrarnWeNyECz9xGKu
 VzIDZzeq9G8slJXATWGRfPsYr+Ihpzcen4QQw58cakUBcqEJrYEhlEOfLovM71k3
 /S/jSHBAbQqiw4LPMw87bA5A6+ZKcVSsNE0XCtNnhmqFpLc1kKRrl5vaN+QMk5tJ
 v4/zR0fPqH7SGAJWYs4brdfahyejEo0TwgpDs7KHmu1W9zQ0LCVTaYnQuUmQjta6
 WyYwIy3TTibdfR191O0E3NOW82Q/k/NBD6ySvabN9HqQ9eSk6+rzrWAslXCbYohb
 BJfzcQfDdx+lsyhjeTx9
 =wOP3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:
   - NFSv4.2 client support for hole punching and preallocation.
   - Further RPC/RDMA client improvements.
   - Add more RPC transport debugging tracepoints.
   - Add RPC debugging tools in debugfs.

  Bugfixes:
   - Stable fix for layoutget error handling
   - Fix a change in COMMIT behaviour resulting from the recent io code
     updates"

* tag 'nfs-for-3.19-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits)
  sunrpc: add a debugfs rpc_xprt directory with an info file in it
  sunrpc: add debugfs file for displaying client rpc_task queue
  nfs: Add DEALLOCATE support
  nfs: Add ALLOCATE support
  NFS: Clean up nfs4_init_callback()
  NFS: SETCLIENTID XDR buffer sizes are incorrect
  SUNRPC: serialize iostats updates
  xprtrdma: Display async errors
  xprtrdma: Enable pad optimization
  xprtrdma: Re-write rpcrdma_flush_cqs()
  xprtrdma: Refactor tasklet scheduling
  xprtrdma: unmap all FMRs during transport disconnect
  xprtrdma: Cap req_cqinit
  xprtrdma: Return an errno from rpcrdma_register_external()
  nfs: define nfs_inc_fscache_stats and using it as possible
  nfs: replace nfs_add_stats with nfs_inc_stats when add one
  NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
  sunrpc: eliminate RPC_TRACEPOINTS
  sunrpc: eliminate RPC_DEBUG
  lockd: eliminate LOCKD_DEBUG
  ...
2014-12-10 15:13:13 -08:00
David S. Miller
22f10923dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/amd/xgbe/xgbe-desc.c
	drivers/net/ethernet/renesas/sh_eth.c

Overlapping changes in both conflict cases.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 15:48:20 -05:00
Linus Torvalds
8139548136 Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI updates from Ingo Molnar:
 "Changes in this cycle are:

   - support module unload for efivarfs (Mathias Krause)

   - another attempt at moving x86 to libstub taking advantage of the
     __pure attribute (Ard Biesheuvel)

   - add EFI runtime services section to ptdump (Mathias Krause)"

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, ptdump: Add section for EFI runtime services
  efi/x86: Move x86 back to libstub
  efivarfs: Allow unloading when build as module
2014-12-10 12:42:16 -08:00
Filipe Manana
1edb647bb9 Btrfs: remove non-sense btrfs_error_discard_extent() function
It doesn't do anything special, it just calls btrfs_discard_extent(),
so just remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:32 -08:00
Filipe Manana
678886bdc6 Btrfs: fix fs corruption on transaction abort if device supports discard
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info->pinned_extents pointed to
      fs_info->freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info->pinned_extents now points to
      fs_info->freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info->pinned_extents (fs_info->freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info->freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info->pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b02).

Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:31 -08:00
Filipe Manana
01eacb2779 Btrfs: always clear a block group node when removing it from the tree
Always clear a block group's rbnode after removing it from the rbtree to
ensure that any tasks that might be holding a reference on the block group
don't end up accessing stale rbnode left and right child pointers through
next_block_group().

This is a leftover from the change titled:
"Btrfs: fix invalid block group rbtree access after bg is removed"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:29 -08:00
Filipe Manana
a1e7e16ed3 Btrfs: ensure deletion from pinned_chunks list is protected
The call to remove_extent_mapping() actually deletes the extent map
from the list it's included in - fs_info->pinned_chunks - and that
list is protected by the chunk mutex. Therefore make that call
while holding the chunk mutex and remove the redundant list delete
call because it's a noop.

This fixes an overlook of the patch titled
"Btrfs: fix race between fs trimming and block group remove/allocation"
following the same obvervation from the patch titled
"Btrfs: fix unprotected deletion from pending_chunks list".

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:28 -08:00
Daniel Borkmann
87545899b5 net: replace remaining users of arch_fast_hash with jhash
This patch effectively reverts commit 500f808726 ("net: ovs: use CRC32
accelerated flow hash if available"), and other remaining arch_fast_hash()
users such as from nfsd via commit 6282cd5655 ("NFSD: Don't hand out
delegations for 30 seconds after recalling them.") where it has been used
as a hash function for bloom filtering.

While we think that these users are actually not much of concern, it has
been requested to remove the arch_fast_hash() library bits that arose
from [1] entirely as per recent discussion [2]. The main argument is that
using it as a hash may introduce bias due to its linearity (see avalanche
criterion) and thus makes it less clear (though we tried to document that)
when this security/performance trade-off is actually acceptable for a
general purpose library function.

Lets therefore avoid any further confusion on this matter and remove it to
prevent any future accidental misuse of it. For the time being, this is
going to make hashing of flow keys a bit more expensive in the ovs case,
but future work could reevaluate a different hashing discipline.

  [1] https://patchwork.ozlabs.org/patch/299369/
  [2] https://patchwork.ozlabs.org/patch/418756/

Cc: Neil Brown <neilb@suse.de>
Cc: Francesco Fusco <fusco@ntop.org>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10 15:17:45 -05:00
Linus Torvalds
3eb5b893eb Merge branch 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MPX support from Thomas Gleixner:
 "This enables support for x86 MPX.

  MPX is a new debug feature for bound checking in user space.  It
  requires kernel support to handle the bound tables and decode the
  bound violating instruction in the trap handler"

* 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  asm-generic: Remove asm-generic arch_bprm_mm_init()
  mm: Make arch_unmap()/bprm_mm_init() available to all architectures
  x86: Cleanly separate use of asm-generic/mm_hooks.h
  x86 mpx: Change return type of get_reg_offset()
  fs: Do not include mpx.h in exec.c
  x86, mpx: Add documentation on Intel MPX
  x86, mpx: Cleanup unused bound tables
  x86, mpx: On-demand kernel allocation of bounds tables
  x86, mpx: Decode MPX instruction to get bound violation information
  x86, mpx: Add MPX-specific mmap interface
  x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
  x86, mpx: Add MPX to disabled features
  ia64: Sync struct siginfo with general version
  mips: Sync struct siginfo with general version
  mpx: Extend siginfo structure to include bound violation information
  x86, mpx: Rename cfg_reg_u and status_reg
  x86: mpx: Give bndX registers actual names
  x86: Remove arbitrary instruction size limit in instruction decoder
2014-12-10 09:34:43 -08:00
Linus Torvalds
86c6a2fddf Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - 'Nested Sleep Debugging', activated when CONFIG_DEBUG_ATOMIC_SLEEP=y.

     This instruments might_sleep() checks to catch places that nest
     blocking primitives - such as mutex usage in a wait loop.  Such
     bugs can result in hard to debug races/hangs.

     Another category of invalid nesting that this facility will detect
     is the calling of blocking functions from within schedule() ->
     sched_submit_work() -> blk_schedule_flush_plug().

     There's some potential for false positives (if secondary blocking
     primitives themselves are not ready yet for this facility), but the
     kernel will warn once about such bugs per bootup, so the warning
     isn't much of a nuisance.

     This feature comes with a number of fixes, for problems uncovered
     with it, so no messages are expected normally.

   - Another round of sched/numa optimizations and refinements, for
     CONFIG_NUMA_BALANCING=y.

   - Another round of sched/dl fixes and refinements.

  Plus various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched: Add missing rcu protection to wake_up_all_idle_cpus
  sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK
  sched/numa: Init numa balancing fields of init_task
  sched/deadline: Remove unnecessary definitions in cpudeadline.h
  sched/cpupri: Remove unnecessary definitions in cpupri.h
  sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task()
  sched/fair: Fix stale overloaded status in the busiest group finding logic
  sched: Move p->nr_cpus_allowed check to select_task_rq()
  sched/completion: Document when to use wait_for_completion_io_*()
  sched: Update comments about CLONE_NEWUTS and CLONE_NEWIPC
  sched/fair: Kill task_struct::numa_entry and numa_group::task_list
  sched: Refactor task_struct to use numa_faults instead of numa_* pointers
  sched/deadline: Don't check CONFIG_SMP in switched_from_dl()
  sched/deadline: Reschedule from switched_from_dl() after a successful pull
  sched/deadline: Push task away if the deadline is equal to curr during wakeup
  sched/deadline: Add deadline rq status print
  sched/deadline: Fix artificial overrun introduced by yield_task_dl()
  sched/rt: Clean up check_preempt_equal_prio()
  sched/core: Use dl_bw_of() under rcu_read_lock_sched()
  sched: Check if we got a shallowest_idle_cpu before searching for least_loaded_cpu
  ...
2014-12-09 21:21:34 -08:00
Al Viro
c0371da604 put iov_iter into msghdr
Note that the code _using_ ->msg_iter at that point will be very
unhappy with anything other than unshifted iovec-backed iov_iter.
We still need to convert users to proper primitives.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-09 16:29:03 -05:00
Benjamin Coddington
bf7491f1be nfsd4: fix xdr4 count of server in fs_location4
Fix a bug where nfsd4_encode_components_esc() incorrectly calculates the
length of server array in fs_location4--note that it is a count of the
number of array elements, not a length in bytes.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 082d4bd72a (nfsd4: "backfill" using write_bytes_to_xdr_buf)
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 15:52:17 -05:00
Benjamin Coddington
5a64e56976 nfsd4: fix xdr4 inclusion of escaped char
Fix a bug where nfsd4_encode_components_esc() includes the esc_end char as
an additional string encoding.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Fixes: e7a0444aef "nfsd: add IPv6 addr escaping to fs_location hosts"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 15:51:30 -05:00
Rasmus Villemoes
ef17af2a81 fs: nfsd: Fix signedness bug in compare_blob
Bugs similar to the one in acbbe6fbb2 (kcmp: fix standard comparison
bug) are in rich supply.

In this variant, the problem is that struct xdr_netobj::len has type
unsigned int, so the expression o1->len - o2->len _also_ has type
unsigned int; it has completely well-defined semantics, and the result
is some non-negative integer, which is always representable in a long
long. But this means that if the conditional triggers, we are
guaranteed to return a positive value from compare_blob.

In this case it could be fixed by

-       res = o1->len - o2->len;
+       res = (long long)o1->len - (long long)o2->len;

but I'd rather eliminate the usually broken 'return a - b;' idiom.

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:29:14 -05:00
Jeff Layton
0b5707e452 sunrpc: require svc_create callers to pass in meaningful shutdown routine
Currently all svc_create callers pass in NULL for the shutdown parm,
which then gets fixed up to be svc_rpcb_cleanup if the service uses
rpcbind.

Simplify this by instead having the the only caller that requires it
(lockd) pass in svc_rpcb_cleanup and get rid of the special casing.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:22:21 -05:00
Jeff Layton
779fb0f3af sunrpc: move rq_splice_ok flag into rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:22:21 -05:00
Jeff Layton
78b65eb3fd sunrpc: move rq_dropme flag into rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:22:20 -05:00
Jeff Layton
30660e04b0 sunrpc: move rq_usedeferral flag to rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:22:20 -05:00
Jeff Layton
7501cc2bcf sunrpc: move rq_local field to rq_flags
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:21:21 -05:00
Jeff Layton
4d152e2c9a sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
In a later patch, we're going to need some atomic bit flags. Since that
field will need to be an unsigned long, we mitigate that space
consumption by migrating some other bitflags to the new field. Start
with the rq_secure flag.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-09 11:21:20 -05:00
J. Bruce Fields
2941b0e91b NFS client updates for Linux 3.19
Highlights include:
 
 Features:
 - NFSv4.2 client support for hole punching and preallocation.
 - Further RPC/RDMA client improvements.
 - Add more RPC transport debugging tracepoints.
 - Add RPC debugging tools in debugfs.
 
 Bugfixes:
 - Stable fix for layoutget error handling
 - Fix a change in COMMIT behaviour resulting from the recent io code updates
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUhRVTAAoJEGcL54qWCgDyfeUP/RoFo3ImTMbGxfcPJqoELjcO
 lZbQ+27pOE/whFDkWgiOVTwlgGct5a0WRo7GCZmpYJA4q1kmSv4ngTb3nMTCUztt
 xMJ0mBr0BqttVs+ouKiVPm3cejQXedEhttwWcloIXS8lNenlpL29Zlrx2NHdU8UU
 13+souocj0dwIyTYYS/4Lm9KpuCYnpDBpP5ShvQjVaMe/GxJo6GyZu70c7FgwGNz
 Nh9onzZV3mz1elhfizlV38aVA7KWVXtLWIqOFIKlT2fa4nWB8Hc07miR5UeOK0/h
 r+icnF2qCQe83MbjOxYNxIKB6uiA/4xwVc90X4AQ7F0RX8XPWHIQWG5tlkC9jrCQ
 3RGzYshWDc9Ud2mXtLMyVQxHVVYlFAe1WtdP8ZWb1oxDInmhrarnWeNyECz9xGKu
 VzIDZzeq9G8slJXATWGRfPsYr+Ihpzcen4QQw58cakUBcqEJrYEhlEOfLovM71k3
 /S/jSHBAbQqiw4LPMw87bA5A6+ZKcVSsNE0XCtNnhmqFpLc1kKRrl5vaN+QMk5tJ
 v4/zR0fPqH7SGAJWYs4brdfahyejEo0TwgpDs7KHmu1W9zQ0LCVTaYnQuUmQjta6
 WyYwIy3TTibdfR191O0E3NOW82Q/k/NBD6ySvabN9HqQ9eSk6+rzrWAslXCbYohb
 BJfzcQfDdx+lsyhjeTx9
 =wOP3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.19-1' into nfsd for-3.19 branch

Mainly what I need is 860a0d9e51 "sunrpc: add some tracepoints in
svc_rqst handling functions", which subsequent server rpc patches from
jlayton depend on.  I'm merging this later tag on the assumption that's
more likely to be a tested and stable point.
2014-12-09 11:12:26 -05:00
Al Viro
ba00410b81 Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
Chao Yu
635aee1fef f2fs: avoid to ra unneeded blocks in recover flow
To improve recovery speed, f2fs try to readahead many contiguous blocks in warm
node segment, but for most time, abnormal power-off do not occur frequently, so
when mount a normal power-off f2fs image, by contrary ra so many blocks and then
invalid them will hurt the performance of mount.
It's better to just ra the first next-block for normal condition.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 14:19:09 -08:00
Chao Yu
66b00c1867 f2fs: introduce is_valid_blkaddr to cleanup codes in ra_meta_pages
This patch does cleanup work, it introduces is_valid_blkaddr() to include
verification code for blkaddr with upper and down boundary value which were in
ra_meta_pages previous.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 14:19:08 -08:00
Chao Yu
13da549460 f2fs: fix to enable readahead for SSA/CP blocks
1.We use zero as upper boundary value for ra SSA/CP blocks, we will skip
readahead as verification failure with max number, it causes low performance.
2.Low boundary value is not accurate for SSA/CP/POR region verification, so
these values need to be redefined.

This patch fixes above issues.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 14:19:07 -08:00
Chao Yu
03e14d522e f2fs: use atomic for counting inode with inline_{dir,inode} flag
As inline_{dir,inode} stat is increased/decreased concurrently by multi threads,
so the value is not so accurate, let's use atomic type for counting accurately.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:54:59 -08:00
Changman Lee
51455b1938 f2fs: cleanup path to need cp at fsync
Added some commentaries for code readability and cleaned up if-statement
clearly.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:40:22 -08:00
Changman Lee
9c7bb70212 f2fs: check if inode state is dirty at fsync
If inode state is dirty, go straight to write.

Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:37:13 -08:00
Jaegeuk Kim
8dcf2ff721 f2fs: count the number of inmemory pages
This patch adds counting # of inmemory pages in the page cache.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:35:15 -08:00
Jaegeuk Kim
126622343a f2fs: release inmemory pages when the file was closed
If file is closed, let's drop inmemory pages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:35:15 -08:00
Jaegeuk Kim
0722b1011a f2fs: set page private for inmemory pages for truncation
The inmemory pages should be handled by invalidate_page since it needs to be
released int the truncation path.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:35:14 -08:00
Jaegeuk Kim
9d1015dd4c f2fs: count inline_xx in do_read_inode
In do_read_inode, if we failed __recover_inline_status, the inode has inline
flag without increasing its count.
Later, f2fs_evict_inode will decrease the count, which causes -1.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:35:13 -08:00
Jaegeuk Kim
9be32d72be f2fs: do retry operations with cond_resched
This patch revists retrial paths in f2fs.
The basic idea is to use cond_resched instead of retrying from the very early
stage.

Suggested-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-08 10:35:05 -08:00
Namjae Jeon
15d9870633 cifs: remove unneeded condition check
file->private_data can never be null after calling initiate_cifs_search.
So private null check condition is not needed.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-07 23:43:10 -06:00
Sachin Prabhu
ee9bbf465d Set UID in sess_auth_rawntlmssp_authenticate too
A user complained that they were unable to login to their cifs share
after a kernel update. From the wiretrace we can see that the server
returns different UIDs as response to NTLMSSP_NEGOTIATE and NTLMSSP_AUTH
phases.

With changes in the authentication code, we no longer set the
cifs_sess->Suid returned in response to the NTLM_AUTH phase and continue
to use the UID sent in response to the NTLMSSP_NEGOTIATE phase. This
results in the server denying access to the user when the user attempts
to do a tcon connect.

See https://bugzilla.redhat.com/show_bug.cgi?id=1163927

A test kernel containing patch was tested successfully by the user.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-07 23:43:02 -06:00
Andy Shevchenko
0b456f04bc cifs: convert printk(LEVEL...) to pr_<level>
The useful macros embed message level in the name. Thus, it cleans up the code
a bit. In cases when it was plain printk() the conversion was done to info
level.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-07 22:48:07 -06:00
Andy Shevchenko
55d83e0dbb cifs: convert to print_hex_dump() instead of custom implementation
This patch converts custom dumper to use native print_hex_dump() instead. The
cifs_dump_mem() will have an offsets per each line which differs it from the
original code.

In the dump_smb() we may use native print_hex_dump() as well. It will show
slightly different output in ASCII part when character is unprintable,
otherwise it keeps same structure.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-07 22:48:01 -06:00
Andy Shevchenko
28e2aed244 cifs: call strtobool instead of custom implementation
Meanwhile it cleans up the code, the behaviour is slightly changed. In case of
providing non-boolean value it will fails with corresponding error. In the
original code the attempt of an update was just ignored in such case.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Reviewed-by: Alexander Bokovoy <ab@samba.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
2014-12-07 22:47:58 -06:00
Steve French
f8098b82aa Update modinfo cifs version for cifs.ko
update cifs version to 2.06

Signed-off-by: Steve French <smfrench@gmail.com>
2014-12-07 22:17:19 -06:00
Steve French
ebdd207e29 decode_negTokenInit had wrong calling sequence
For krb5 enablement of SMB3, decoding negprot, caller now passes
server struct not the old sec_type
2014-12-07 22:17:19 -06:00
Steve French
911a8dfa47 Add missing defines for ACL query support
Add missing defines needed for ACL query support.
 For definitions of these security info type additionalinfo flags
 and also the EA Flags see MS-SMB2 (2.2.37) or MS-DTYP

Signed-of-by: Steven French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
2014-12-07 22:17:19 -06:00
Steve French
9ccf321623 Add support for original fallocate
In many cases the simple fallocate call is
a no op (since the file is already not sparse) or
can simply be converted from a sparse to a non-sparse
file if we are fallocating the whole file and keeping
the size.

Signed-off-by: Steven French <smfrench@gmail.com>
2014-12-07 22:17:19 -06:00
Dmitry Monakhov
50db71abc5 ext4: ext4_da_convert_inline_data_to_extent drop locked page after error
Testcase:
xfstests generic/270
MKFS_OPTIONS="-q -I 256 -O inline_data,64bit"

Call Trace:
 [<ffffffff81144c76>] lock_page+0x35/0x39 -------> DEADLOCK
 [<ffffffff81145260>] pagecache_get_page+0x65/0x15a
 [<ffffffff811507fc>] truncate_inode_pages_range+0x1db/0x45c
 [<ffffffff8120ea63>] ? ext4_da_get_block_prep+0x439/0x4b6
 [<ffffffff811b29b7>] ? __block_write_begin+0x284/0x29c
 [<ffffffff8120e62a>] ? ext4_change_inode_journal_flag+0x16b/0x16b
 [<ffffffff81150af0>] truncate_inode_pages+0x12/0x14
 [<ffffffff81247cb4>] ext4_truncate_failed_write+0x19/0x25
 [<ffffffff812488cf>] ext4_da_write_inline_data_begin+0x196/0x31c
 [<ffffffff81210dad>] ext4_da_write_begin+0x189/0x302
 [<ffffffff810c07ac>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810ddd13>] ? read_seqcount_begin.clone.1+0x9f/0xcc
 [<ffffffff8114309d>] generic_perform_write+0xc7/0x1c6
 [<ffffffff810c040e>] ? mark_held_locks+0x59/0x77
 [<ffffffff811445d1>] __generic_file_write_iter+0x17f/0x1c5
 [<ffffffff8120726b>] ext4_file_write_iter+0x2a5/0x354
 [<ffffffff81185656>] ? file_start_write+0x2a/0x2c
 [<ffffffff8107bcdb>] ? bad_area_nosemaphore+0x13/0x15
 [<ffffffff811858ce>] new_sync_write+0x8a/0xb2
 [<ffffffff81186e7b>] vfs_write+0xb5/0x14d
 [<ffffffff81186ffb>] SyS_write+0x5c/0x8c
 [<ffffffff816f2529>] system_call_fastpath+0x12/0x17

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-05 21:37:15 -05:00
Jaegeuk Kim
769ec6e5b7 f2fs: call radix_tree_preload before radix_tree_insert
This patch tries to fix:

 BUG: using smp_processor_id() in preemptible [00000000] code: f2fs_gc-254:0/384
  (radix_tree_node_alloc+0x14/0x74) from [<c033d8a0>] (radix_tree_insert+0x110/0x200)
  (radix_tree_insert+0x110/0x200) from [<c02e8264>] (gc_data_segment+0x340/0x52c)
  (gc_data_segment+0x340/0x52c) from [<c02e8658>] (f2fs_gc+0x208/0x400)
  (f2fs_gc+0x208/0x400) from [<c02e8a98>] (gc_thread_func+0x248/0x28c)
  (gc_thread_func+0x248/0x28c) from [<c0139944>] (kthread+0xa0/0xac)
  (kthread+0xa0/0xac) from [<c0105ef8>] (ret_from_fork+0x14/0x3c)

The reason is that f2fs calls radix_tree_insert under enabled preemption.
So, before calling it, we need to call radix_tree_preload.

Otherwise, we should use _GFP_WAIT for the radix tree, and use mutex or
semaphore to cover the radix tree operations.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-05 09:51:04 -08:00
Al Viro
f77c80142e bury struct proc_ns in fs/proc
a) make get_proc_ns() return a pointer to struct ns_common
b) mirror ns_ops in dentry->d_fsdata of ns dentries, so that
is_mnt_ns_file() could get away with fewer dereferences.

That way struct proc_ns becomes invisible outside of fs/proc/*.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:54 -05:00
Al Viro
33c429405a copy address of proc_ns_ops into ns_common
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:47 -05:00
Al Viro
6344c433a4 new helpers: ns_alloc_inum/ns_free_inum
take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:36 -05:00
Al Viro
64964528b2 make proc_ns_operations work with struct ns_common * instead of void *
We can do that now.  And kill ->inum(), while we are at it - all instances
are identical.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:34:17 -05:00
Al Viro
58be28256d make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:33:24 -05:00
Al Viro
435d5f4bb2 common object embedded into various struct ....ns
for now - just move corresponding ->proc_inum instances over there

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04 14:31:00 -05:00
Jaegeuk Kim
8b26ef98da f2fs: use rw_semaphore for nat entry lock
Previoulsy, we used rwlock for nat_entry lock.
But, now we have a lot of complex operations in set_node_addr.
(e.g., allocating kernel memories, handling radix_trees, and so on)

So, this patches tries to change spinlock to rw_semaphore to give CPUs to other
threads.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-03 21:23:29 -08:00
Jaegeuk Kim
4634d71ed1 f2fs: fix missing kmem_cache_free
This patch fixes missing kmem_cache_free when handling errors.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-03 16:40:28 -08:00
Dave Chinner
6044e4386c Merge branch 'xfs-misc-fixes-for-3.19-2' into for-next
Conflicts:
	fs/xfs/xfs_iops.c
2014-12-04 09:46:17 +11:00
Brian Foster
b29c70f598 xfs: split metadata and log buffer completion to separate workqueues
XFS traditionally sends all buffer I/O completion work to a single
workqueue. This includes metadata buffer completion and log buffer
completion. The log buffer completion requires a high priority queue to
prevent stalls due to log forces getting stuck behind other queued work.

Rather than continue to prioritize all buffer I/O completion due to the
needs of log completion, split log buffer completion off to
m_log_workqueue and move the high priority flag from m_buf_workqueue to
m_log_workqueue.

Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue
customization on a per-buffer basis. Initialize b_ioend_wq to
m_buf_workqueue by default in the generic buffer I/O submission path.
Finally, override the default wq with the high priority m_log_workqueue
in the log buffer I/O submission path.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
32296f865e xfs: fix set-but-unused warnings
The kernel compile doesn't turn on these checks by default, so it's
only when I do a kernel-user sync that I find that there are lots of
compiler warnings waiting to be fixed. Fix up these set-but-unused
warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
9a2cc41cda xfs: move type conversion functions to xfs_dir.h
These are currently considered private to libxfs, but they are
widely used by the userspace code to decode, walk and check
directory structures. Hence they really form part of the external
API and as such need to bemoved to xfs_dir2.h.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
1b767ee386 xfs: move ftype conversion functions to libxfs
These functions are needed in userspace for repair and mkfs to
do the right thing. Move them to libxfs so they can be easily
shared.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:17 +11:00
Dave Chinner
2d3d0c53df xfs: lobotomise xfs_trans_read_buf_map()
There's a case in that code where it checks for a buffer match in a
transaction where the buffer is not marked done. i.e. trying to
catch a buffer we have locked in the transaction but have not
completed IO on.

The only way we can find a buffer that has not had IO completed on
it is if it had readahead issued on it, but we never do readahead on
buffers that we have already joined into a transaction. Hence this
condition cannot occur, and buffers locked and joined into a
transaction should always be marked done and not under IO.

Remove this code and re-order xfs_trans_read_buf_map() to remove
duplicated IO dispatch and error handling code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:43:13 +11:00
Dave Chinner
cdc9cec7c0 xfs: active inodes stat is broken
vn_active only ever gets decremented, so it has a very large
negative number.  Make it track the inode count we currently have
allocated properly so we can easily track the size of the inode
cache via tools like PCP.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:40 +11:00
Dave Chinner
4db431f57b xfs: cleanup xfs_bmse_merge returns
Signed-off-by: Dave Chinner <dchinner@redhat.com>

xfs_bmse_merge() has a jump label for return that just returns the
error value. Convert all the code to just return the error directly
and use XFS_WANT_CORRUPTED_RETURN. This also allows the final call
to xfs_bmbt_update() to return directly.

Noticed while reviewing coccinelle return cleanup patches and
wondering why the same return pattern as in xfs_bmse_shift_one()
wasn't picked up by the checker pattern...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:40 +11:00
Dave Chinner
b11bd671ba xfs: cleanup xfs_bmse_shift_one goto mess
xfs_bmse_shift_one() jumps around determining whether to shift or
merge, making the code flow difficult to follow. Clean it up and
use direct error returns (including XFS_WANT_CORRUPTED_RETURN) to
make the code flow better and be easier to read.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:24 +11:00
Dave Chinner
7a1df15616 xfs: fix premature enospc on inode allocation
After growing a filesystem, XFS can fail to allocate inodes even
though there is a large amount of space available in the filesystem
for inodes. The issue is caused by a nearly full allocation group
having enough free space in it to be considered for inode
allocation, but not enough contiguous free space to actually
allocation inodes.  This situation results in successful selection
of the AG for allocation, then failure of the allocation resulting
in ENOSPC being reported to the caller.

It is caused by two possible issues. Firstly, we only consider the
lognest free extent and whether it would fit an inode chunk. If the
extent is not correctly aligned, then we can't allocate an inode
chunk in it regardless of the fact that it is large enough. This
tends to be a permanent error until space in the AG is freed.

The second issue is that we don't actually lock the AGI or AGF when
we are doing these checks, and so by the time we get to actually
allocating the inode chunk the space we thought we had in the AG may
have been allocated. This tends to be a spurious error as it
requires a race to trigger. Hence this case is ignored in this patch
as the reported problem is for permanent errors.

The first issue could be addressed by simply taking into account the
alignment when checking the longest extent. This, however, would
prevent allocation in AGs that have aligned, exact sized extents
free. However, this case should be fairly rare compared to the
number of allocations that occur near ENOSPC that would trigger this
condition.

Hence, when selecting the inode AG, take into account the inode
cluster alignment when checking the lognest free extent in the AG.
If we can't find any AGs with a contiguous free space large
enough to be aligned, drop the alignment addition and just try for
an AG that has enough contiguous free space available for an inode
chunk. This won't prevent issues from occurring, but should avoid
situations where other AGs have lots of free space but the selected
AG can't allocate due to alignment constraints.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:42:21 +11:00
Peter Watkins
76b5730252 xfs: overflow in xfs_iomap_eof_align_last_fsb
If extsize is set and new_last_fsb is larger than 32 bits, the
roundup to extsize will overflow the align variable. Instead,
combine alignments by rounding stripe size up to extsize.

Signed-off-by: Peter Watkins <treestem@gmail.com>
Reviewed-by: Nathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-04 09:30:51 +11:00
Dave Chinner
e77b8547ca Merge branch 'xfs-coccinelle-cleanups' into xfs-misc-fixes-for-3.19-2 2014-12-04 09:18:21 +11:00
Al Viro
1ead0e79bf fat: fix oops on corrupted vfat fs
a) don't bother with ->d_time for positives - we only check it for
   negatives anyway.

b) make sure to set it at unlink and rmdir time - at *that* point
   soon-to-be negative dentry matches then-current directory contents

c) don't go into renaming of old alias in vfat_lookup() unless it
   has the same parent (which it will, unless we are seeing corrupted
   image)

[hirofumi@mail.parknet.co.jp: make change minimum, don't call d_move() for dir]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>	[3.17.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03 09:36:03 -08:00
Chris Mason
9627aeee3e Merge branch 'raid56-scrub-replace' of git://github.com/miaoxie/linux-btrfs into for-linus 2014-12-02 18:42:03 -08:00
Josef Bacik
cb83b7b816 Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space
cache loading.  However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
8dbcd10f69 Btrfs: fix unprotected deletion from pending_chunks list
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
495e64f4fe Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana
946ddbe805 Btrfs: fix memory leak after block remove + trimming
There was a free space entry structure memeory leak if a block
group is remove while a free space entry is being trimmed, which
the following diagram explains:

           CPU 1                                          CPU 2

  btrfs_trim_block_group()
      trim_no_bitmap()
          remove free space entry from
          block group cache's rbtree
          do_trimming()

                                                btrfs_remove_block_group()
                                                    btrfs_remove_free_space_cache()

              add back free space entry to
              block group's cache rbtree
  btrfs_put_block_group()

                                                    (...)
                                                    btrfs_put_block_group()
                                                        kfree(bg->free_space_ctl)
                                                        kfree(bg)

The free space entry added after doing the discard of its respective
range ends up never being freed.
Detected after doing an "rmmod btrfs" after running the stress test
recently submitted for fstests:

[ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
[ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-2+ #1
[ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 8234.642664]  0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
[ 8234.642670]  ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
[ 8234.642676]  ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
[ 8234.642682] Call Trace:
[ 8234.642692]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
[ 8234.642699]  [<ffffffff811149ce>] kmem_cache_destroy+0x4d/0x92
[ 8234.642731]  [<ffffffffa042dbe7>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
[ 8234.642757]  [<ffffffffa0487422>] exit_btrfs_fs+0x9/0xbe7 [btrfs]
[ 8234.642762]  [<ffffffff810a76a5>] SyS_delete_module+0x155/0x1c6
[ 8234.642768]  [<ffffffff8122a7eb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 8234.642773]  [<ffffffff814122d2>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
c92f6be34c Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
[25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
[25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
[25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
[25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
[25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
[25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
[25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
[25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
55507ce361 Btrfs: fix race between writing free space cache and trimming
Trimming is completely transactionless, and the way it operates consists
of hiding free space entries from a block group, perform the trim/discard
and then make the free space entries visible again.
Therefore while a free space entry is being trimmed, we can have free space
cache writing running in parallel (as part of a transaction commit) which
will miss the free space entry. This means that an unmount (or crash/reboot)
after that transaction commit and mount again before another transaction
starts/commits after the discard finishes, we will have some free space
that won't be used again unless the free space cache is rebuilt. After the
unmount, fsck (btrfsck, btrfs check) reports the issue like the following
example:

        *** fsck.btrfs output ***
        checking extents
        checking free space cache
        There is no free space entry for 521764864-521781248
        There is no free space entry for 521764864-1103101952
        cache appears valid but isnt 29360128
        Checking filesystem on /dev/sdc
        UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
        found 1235681286 bytes used err is -22
        (...)

Another issue caused by this race is a crash while writing bitmap entries
to the cache, because while the cache writeout task accesses the bitmaps,
the trim task can be concurrently modifying the bitmap or worse might
be freeing the bitmap. The later case results in the following crash:

[55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
[55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
[55650.807166] RIP: 0010:[<ffffffffa037cf45>]  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.807514] RSP: 0018:ffff88007153bc30  EFLAGS: 00010246
[55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
[55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
[55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
[55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
[55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
[55650.808017] FS:  0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
[55650.808017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
[55650.808017] Stack:
[55650.808017]  ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
[55650.808017]  ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
[55650.808017]  ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
[55650.808017] Call Trace:
[55650.808017]  [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
[55650.808017]  [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
[55650.808017]  [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
[55650.808017]  [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
[55650.808017]  [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
[55650.808017]  [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
[55650.808017]  [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
[55650.808017]  [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[55650.808017]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
[55650.808017] RIP  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
[55650.808017]  RSP <ffff88007153bc30>
[55650.815725] ---[ end trace 1c032e96b149ff86 ]---

Fix this by serializing both tasks in such a way that cache writeout
doesn't wait for the trim/discard of free space entries to finish and
doesn't miss any free space entry.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana
04216820fe Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Zhao Lei
5d3edd8f44 Btrfs, replace: enable dev-replace for raid56
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:20:48 +08:00
Filipe Manana
ae0ab003f2 Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).

That is:

     CPU 1                                CPU 2

                                  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info->unused_bgs
                                    got block group from the list
                                    clear_extent_bits for the whole
                                    block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info->pinned_extents)

                                  block group deleted, and a new block
                                  group with the same logical address is
                                  created

                                  reserve space from the new block group
                                  for new data or metadata - the reserved
                                  space overlaps the range specified by
                                  CPU 1 for set_extent_dirty()

                                  commit transaction
                                    find all ranges marked as dirty in
                                    fs_info->pinned_extents, clear them
                                    and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
4f69cb987e Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana
292cbd51ec Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Miao Xie
4245215d6a Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:47 +08:00
Miao Xie
7603597690 Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
2c8cdd6ee4 Btrfs, replace: write dirty pages into the replace target device
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:46 +08:00
Miao Xie
5a6ac9eacb Btrfs, raid56: support parity scrub on raid56
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
1b94b5567e Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
af8e2d1df9 Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:45 +08:00
Miao Xie
b89e1b012c Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2014-12-03 10:18:44 +08:00
Zhao Lei
6de6565075 Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Zhao Lei
f90523d1aa Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909     goto out;

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-12-03 10:18:44 +08:00
Dmitry Monakhov
14516bb7bb ext4: fix suboptimal seek_{data,hole} extents traversial
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.

TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);

Original report: https://lkml.org/lkml/2014/10/16/620

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 18:08:53 -05:00
Dmitry Monakhov
d952d69e26 ext4: ext4_inline_data_fiemap should respect callers argument
Currently ext4_inline_data_fiemap ignores requested arguments (start
and len) which may lead endless loop if start != 0.  Also fix incorrect
extent length determination.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:11:20 -05:00
Dmitry Monakhov
5cc28a9eaa ext4: prevent fsreentrance deadlock for inline_data
ext4_da_convert_inline_data_to_extent() invokes
grab_cache_page_write_begin().  grab_cache_page_write_begin performs
memory allocation, so fs-reentrance should be prohibited because we
are inside journal transaction.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-02 16:09:50 -05:00
Changman Lee
7dda2af83b f2fs: more fast lookup for gc_inode list
If there are many inodes that have data blocks in victim segment,
it takes long time to find a inode in gc_inode list.
Let's use radix_tree to reduce lookup time.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-02 11:02:50 -08:00
Eric W. Biederman
4fed655c41 mnt: Clear mnt_expire during pivot_root
When inspecting the pivot_root and the current mount expiry logic I
realized that pivot_root fails to clear like mount move does.

Add the missing line in case someone does the interesting feat of
moving an expirable submount.  This gives a strong guarantee that root
of the filesystem tree will never expire.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:51 -06:00
Eric W. Biederman
381cacb12c mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
old->mnt_expiry should be ignored unless CL_EXPIRE is set.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:50 -06:00
Eric W. Biederman
8486a7882b mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
Clear MNT_LOCKED in the callers of copy_tree except copy_mnt_ns, and
collect_mounts.  In copy_mnt_ns it is necessary to create an exact
copy of a mount tree, so not clearing MNT_LOCKED is important.
Similarly collect_mounts is used to take a snapshot of the mount tree
for audit logging purposes and auditing using a faithful copy of the
tree is important.

This becomes particularly significant when we start setting MNT_LOCKED
on rootfs to prevent it from being unmounted.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:50 -06:00
Eric W. Biederman
da362b09e4 umount: Do not allow unmounting rootfs.
Andrew Vagin <avagin@parallels.com> writes:

> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <unistd.h>
> #include <sys/mount.h>
>
> int main(int argc, char **argv)
> {
> 	int fd;
>
> 	fd = open("/proc/self/ns/mnt", O_RDONLY);
> 	if (fd < 0)
> 	   return 1;
> 	   while (1) {
> 	   	 if (umount2("/", MNT_DETACH) ||
> 		        setns(fd, CLONE_NEWNS))
> 					break;
> 					}
>
> 					return 0;
> }
>
> root@ubuntu:/home/avagin# gcc -Wall nsenter.c -o nsenter
> root@ubuntu:/home/avagin# strace ./nsenter
> execve("./nsenter", ["./nsenter"], [/* 22 vars */]) = 0
> ...
> open("/proc/self/ns/mnt", O_RDONLY)     = 3
> umount("/", MNT_DETACH)                 = 0
> setns(3, 131072)                        = 0
> umount("/", MNT_DETACH
>
causes:

> [  260.548301] ------------[ cut here ]------------
> [  260.550941] kernel BUG at /build/buildd/linux-3.13.0/fs/pnode.c:372!
> [  260.552068] invalid opcode: 0000 [#1] SMP
> [  260.552068] Modules linked in: xt_CHECKSUM iptable_mangle xt_tcpudp xt_addrtype xt_conntrack ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bridge stp llc dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison iptable_filter ip_tables x_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel binfmt_misc nfsd auth_rpcgss nfs_acl aesni_intel nfs lockd aes_x86_64 sunrpc fscache lrw gf128mul glue_helper ablk_helper cryptd serio_raw ppdev parport_pc lp parport btrfs xor raid6_pq libcrc32c psmouse floppy
> [  260.552068] CPU: 0 PID: 1723 Comm: nsenter Not tainted 3.13.0-30-generic #55-Ubuntu
> [  260.552068] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  260.552068] task: ffff8800376097f0 ti: ffff880074824000 task.ti: ffff880074824000
> [  260.552068] RIP: 0010:[<ffffffff811e9483>]  [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [  260.552068] RSP: 0018:ffff880074825e98  EFLAGS: 00010246
> [  260.552068] RAX: ffff88007c741140 RBX: 0000000000000002 RCX: ffff88007c741190
> [  260.552068] RDX: ffff88007c741190 RSI: ffff880074825ec0 RDI: ffff880074825ec0
> [  260.552068] RBP: ffff880074825eb0 R08: 00000000000172e0 R09: ffff88007fc172e0
> [  260.552068] R10: ffffffff811cc642 R11: ffffea0001d59000 R12: ffff88007c741140
> [  260.552068] R13: ffff88007c741140 R14: ffff88007c741140 R15: 0000000000000000
> [  260.552068] FS:  00007fd5c7e41740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
> [  260.552068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  260.552068] CR2: 00007fd5c7968050 CR3: 0000000070124000 CR4: 00000000000406f0
> [  260.552068] Stack:
> [  260.552068]  0000000000000002 0000000000000002 ffff88007c631000 ffff880074825ed8
> [  260.552068]  ffffffff811dcfac ffff88007c741140 0000000000000002 ffff88007c741160
> [  260.552068]  ffff880074825f38 ffffffff811dd12b ffffffff811cc642 0000000075640000
> [  260.552068] Call Trace:
> [  260.552068]  [<ffffffff811dcfac>] umount_tree+0x20c/0x260
> [  260.552068]  [<ffffffff811dd12b>] do_umount+0x12b/0x300
> [  260.552068]  [<ffffffff811cc642>] ? final_putname+0x22/0x50
> [  260.552068]  [<ffffffff811cc849>] ? putname+0x29/0x40
> [  260.552068]  [<ffffffff811dd88c>] SyS_umount+0xdc/0x100
> [  260.552068]  [<ffffffff8172aeff>] tracesys+0xe1/0xe6
> [  260.552068] Code: 89 50 08 48 8b 50 08 48 89 02 49 89 45 08 e9 72 ff ff ff 0f 1f 44 00 00 4c 89 e6 4c 89 e7 e8 f5 f6 ff ff 48 89 c3 e9 39 ff ff ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 90 66 66 66 66 90 55 b8 01
> [  260.552068] RIP  [<ffffffff811e9483>] propagate_umount+0x123/0x130
> [  260.552068]  RSP <ffff880074825e98>
> [  260.611451] ---[ end trace 11c33d85f1d4c652 ]--

Which in practice is totally uninteresting.  Only the global root user can
do it, and it is just a stupid thing to do.

However that is no excuse to allow a silly way to oops the kernel.

We can avoid this silly problem by setting MNT_LOCKED on the rootfs
mount point and thus avoid needing any special cases in the unmount
code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:49 -06:00
Eric W. Biederman
b2f5d4dc38 umount: Disallow unprivileged mount force
Forced unmount affects not just the mount namespace but the underlying
superblock as well.  Restrict forced unmount to the global root user
for now.  Otherwise it becomes possible a user in a less privileged
mount namespace to force the shutdown of a superblock of a filesystem
in a more privileged mount namespace, allowing a DOS attack on root.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:48 -06:00
Eric W. Biederman
3e1866410f mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount
Now that remount is properly enforcing the rule that you can't remove
nodev at least sandstorm.io is breaking when performing a remount.

It turns out that there is an easy intuitive solution implicitly
add nodev on remount when nodev was implicitly added on mount.

Tested-by: Cedric Bosdonnat <cbosdonnat@suse.com>
Tested-by: Richard Weinberger <richard@nod.at>
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-12-02 10:46:39 -06:00
Linus Torvalds
3a18ca0613 Fix an ext4 metadata checksum regression introduced in v3.18-rc3.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUfTKdAAoJENNvdpvBGATwRhsQAJOWfDD1K57CFQQ0MbLdRfeA
 YzdTHlKIYpt4Dm3+Bc/LJskLrA0sn2vmxF/v0Jxr3F/agwfCODLMO8dIsB0nFX0E
 eC8fCx05titBMQLN4adWr59qQTeE1nssQdpA5SomPrJZr5pSabtj3ekFiHQJ+bWb
 9WNY737TxSPLK0ex9iDlAAp/AoxgF4K6zj/azsRY+mmlfM9+dFoprZWqWwgwl99m
 4LVx1waAnLQdU2Yj7ZYGReweFFTTOqGz4ds1GggymB3Z8Q873dVYO7vdbQWDFJgC
 TcAp8YbfrQC6/M/IaASZKVj6hwEPVMTgOs7dUeyfPtSaXBrW0WBGAhM5gFnQ6J+T
 DO4YHC+tH26GLsfBs9IZnHjAoeVZ93JFDKmxfclDs0AGY+0WgSyY8Bt6VJyoWR60
 RPbW15i/0oMSWEPxbuqQmFIqlcj5n9D80SEmvhpn7oJwkrrUMprcxcWTQN+Ca73e
 2EIOW0SHaLrkM7wpYjwlO4dgCxSZWg6QfHyznbuJcVKOw8KnDMuTEOjP2vBvwHwu
 Wax4EIGZw9XqZVI7a9Z1nd+mpUYi5KDgpS8Uo08Qz5QapWEYla3JPt76q3TwSCIz
 ExMwoRBUSMrpSoDMbyjmk4sh5ABTTWOf9SPmCdnVfzZ36EY0dJckeXj0jFqtyVdq
 p1bxjWPARBm1LfVBcQek
 =yIOT
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfix from Ted Ts'o:
 "Fix an ext4 metadata checksum regression introduced in v3.18-rc3"

* tag 'ext4_for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  jbd2: fix regression where we fail to initialize checksum seed when loading
2014-12-01 20:11:49 -08:00
Darrick J. Wong
32f3869184 jbd2: fix regression where we fail to initialize checksum seed when loading
When we're enabling journal features, we cannot use the predicate
jbd2_journal_has_csum_v2or3() because we haven't yet set the sb
feature flag fields!  Moreover, we just finished loading the shash
driver, so the test is unnecessary; calculate the seed always.

Without this patch, we fail to initialize the checksum seed the first
time we turn on journal_checksum, which means that all journal blocks
written during that first mount are corrupt.  Transactions written
after the second mount will be fine, since the feature flag will be
set in the journal superblock.  xfstests generic/{034,321,322} are the
regression tests.

(This is important for 3.18.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.coM>
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-12-01 21:57:06 -05:00
Changman Lee
9c01503f4d f2fs: cleanup redundant macro
We've already made fi and sbi for inode. Let's avoid duplicated work.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-01 14:16:50 -08:00
Chao Yu
cd34e2969b f2fs: fix to return correct error number in f2fs_write_begin
Fix the wrong error number in error path of f2fs_write_begin.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-12-01 13:56:02 -08:00
Dan Carpenter
818f2f57f2 nfsd: minor off by one checks in __write_versions()
My static checker complains that if "len == remaining" then it means we
have truncated the last character off the version string.

The intent of the code is that we print as many versions as we can
without truncating a version.  Then we put a newline at the end.  If the
newline can't fit we return -EINVAL.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-12-01 12:45:28 -07:00
Dave Chinner
c14fc01340 Merge branch 'xfs-coccinelle-cleanups' into for-next 2014-12-01 09:03:02 +11:00
kbuild test robot
d254aaec5d xfs: fix simple_return.cocci warning in xfs_bmse_shift_one
fs/xfs/libxfs/xfs_bmap.c:5591:1-6: WARNING: end returns can be simpified

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

CC: Brian Foster <bfoster@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2014-12-01 08:42:52 +11:00
kbuild test robot
8300475ebf xfs: fix simple_return.cocci warning in xfs_file_readdir
fs/xfs/xfs_file.c:919:1-6: WARNING: end returns can be simpified and declaration on line 902 can be dropped

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:25:28 +11:00
kbuild test robot
b72091f2fb libxfs: fix simple_return.cocci warnings
fs/xfs/libxfs/xfs_ialloc.c:1141:1-6: WARNING: end returns can be simpified

 Simplify a trivial if-return sequence.  Possibly combine with a
 preceding function call.
Generated by: scripts/coccinelle/misc/simple_return.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:58 +11:00
Markus Elfring
d2a5e3c6fc xfs: remove unnecessary null checks
The functions xfs_blkdev_put() and xfs_qm_dqrele() test whether
their argument is NULL and then return immediately.  Thus the test
around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-12-01 08:24:20 +11:00
Chris Mason
2f19cad94c btrfs: zero out left over bytes after processing compression streams
Don Bailey noticed that our page zeroing for compression at end-io time
isn't complete.  This reworks a patch from Linus to push the zeroing
into the zlib and lzo specific functions instead of trying to handle the
corners inside btrfs_decompress_buf2page

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reported-by: Don A. Bailey <donb@securitymouse.com>
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-30 09:33:51 -08:00
Geert Uytterhoeven
4740f49652 jffs2: Drop bogus if in comment
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2014-11-28 18:23:44 -08:00
Changman Lee
31a3268839 f2fs: cleanup if-statement of phase in gc_data_segment
Little cleanup to distinguish each phase easily

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
[Jaegeuk Kim: modify indentation for code readability]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-27 20:30:17 -08:00
Dave Chinner
216875a594 Merge branch 'xfs-consolidate-format-defs' into for-next 2014-11-28 14:52:16 +11:00
Dave Chinner
4bd47c1bf4 Merge branch 'xfs-misc-fixes-for-3.19-1' into for-next 2014-11-28 14:52:02 +11:00
Christoph Hellwig
508b6b3b73 xfs: merge xfs_inum.h into xfs_format.h
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:10 +11:00
Christoph Hellwig
bb58e6188a xfs: move most of xfs_sb.h to xfs_format.h
More on-disk format consolidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:27:09 +11:00
Christoph Hellwig
4fb6e8ade2 xfs: merge xfs_ag.h into xfs_format.h
More on-disk format consolidation.  A few declarations that weren't on-disk
format related move into better suitable spots.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:25:04 +11:00
Christoph Hellwig
5beda58bf2 xfs: move acl structures to xfs_format.h
Move the on-disk ACL format to xfs_format.h, so that repair can
use the common defintion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:37 +11:00
Christoph Hellwig
6d3ebaae7c xfs: merge xfs_dinode.h into xfs_format.h
More consolidatation for the on-disk format defintions.  Note that the
XFS_IS_REALTIME_INODE moves to xfs_linux.h instead as it is not related
to the on disk format, but depends on a CONFIG_ option.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:24:06 +11:00
Eric Sandeen
db52d09ecb xfs: catch invalid negative blknos in _xfs_buf_find()
Here blkno is a daddr_t, which is a __s64; it's possible to hold
a value which is negative, and thus pass the (blkno >= eofs)
test.  Then we try to do a xfs_perag_get() for a ridiculous
agno via xfs_daddr_to_agno(), and bad things happen when that
fails, and returns a null pag which is dereferenced shortly
thereafter.

Found via a user-supplied fuzzed image...

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:03:55 +11:00
Brian Foster
91ee575f2b xfs: allow lazy sb counter sync during filesystem freeze sequence
The expectation since the introduction the lazy superblock counters is
that the counters are synced and superblock logged appropriately as part
of the filesystem freeze sequence. This does not occur, however, due to
the logic in xfs_fs_writable() that prevents progress when the fs is in
any state other than SB_UNFROZEN.

While this is a bug, it has not been exposed to date because the last
thing XFS does during freeze is dirty the log. The log recovery process
recalculates the counters from AGI/AGF metadata to ensure everything is
correct. Therefore should a crash occur while an fs is frozen, the
subsequent log recovery puts everything back in order. See the following
commit for reference:

	92821e2b [XFS] Lazy Superblock Counters

We might not always want to rely on dirtying the log on a frozen fs.
Modify xfs_log_sbcount() to proceed when the filesystem is freezing but
not once the freeze process has completed. Modify xfs_fs_writable() to
accept the minimum freeze level for which modifications should be
blocked to support various codepaths.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:02:59 +11:00
Brian Foster
5d45ee1b41 xfs: fix error handling in xfs_qm_log_quotaoff()
The error handling in xfs_qm_log_quotaoff() has a couple problems. If
xfs_trans_commit() fails, we fall through to the error block and call
xfs_trans_cancel(). This is incorrect on commit failure. If
xfs_trans_reserve() fails, we jump to the error block, cancel the tp and
restore the superblock qflags to oldsbqflag. However, oldsbqflag has
been initialized to zero and not yet updated from the original flags so
we set the flags to zero.

Fix up the error handling in xfs_qm_log_quotaoff() to not restore flags
if they haven't been modified and not cancel the tp on commit failure.
Remove the flag restore code altogether because commit error is the only
failure condition and we don't know whether the transaction made it to
disk.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:00:53 +11:00
Brian Foster
062647a8b4 xfs: replace on-stack xfs_trans_res with pointer in xfs_create()
There's no need to store a full struct xfs_trans_res on the stack in
xfs_create() and copy the fields. Use a pointer to the appropriate
structures embedded in the xfs_mount.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 14:00:16 +11:00
Brian Foster
78c931b8be xfs: replace global xfslogd wq with per-mount wq
The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).

Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.

Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.

Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for buffer iodone processing at this point in time,
rename xfslogd to xfs-buf.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-11-28 13:59:58 +11:00
Phillip Lougher
62421645bb Squashfs: Add LZ4 compression configuration option
Add the glue code, and also update the documentation.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2014-11-27 18:48:44 +00:00
Phillip Lougher
9c06a46f15 Squashfs: add LZ4 compression support
Add support for reading file systems compressed with the
LZ4 compression algorithm.

This patch adds the LZ4 decompressor wrapper code.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2014-11-27 07:44:11 +00:00
Arend van Spriel
98210b7f73 debugfs: add helper function to create device related seq_file
This patch adds a helper function that simplifies adding a
so-called single_open sequence file for device drivers. The
calling device driver needs to provide a read function and
a device pointer. The field struct seq_file::private will
reference the device pointer upon call to the read function
so the driver can obtain his data from it and do its task
of providing the file content using seq_printf() calls and
alike. Using this helper function also gets rid of the need
to specify file operations per debugfs file.

Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-26 19:38:37 -08:00
Trond Myklebust
1702562db4 NFS: Generic client side changes from Chuck
These patches fixes for iostats and SETCLIENTID in addition to cleaning
 up the nfs4_init_callback() function.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUdfhQAAoJENfLVL+wpUDr9x4QANWmG6xjEU7RuIBLalOoxit6
 eXnEDNZlwYp6NCkYktHVWaTqXdwq7fdGX+3p4eiwNg3C3SrJHkpWvjeFd4KT/SyP
 B/w3vYG3o5H01i3Mb5kCD2uW0gbS9soZXpIg+uHHOl43yZzneC0vPTLhQ/h+9zPc
 B32XcgQhvLIHR3LWrC/+uolsa31lcyya0W45PX+iCpzHF7i9qRBrNLODkTlw1hNQ
 eEnYIvVy5oW00zlHJUYiTHP3e+0EJn5PAngdYbqiboJ9mK7DbB0QDwqyvJbIT7ql
 WAip6cNcJnSv1eiVYqDwlR1ok8drK5X7yQCT3lcLzAMDznLsSAL1Itu0h2Ay3z61
 f8XCyTwI0izq0DbdrMcPoqPSitqyM8nkPElnOuitXwzEroPaG40OF67yss3+ixbl
 JeQZ+u35pnpCkKUaZdCK3Pn83StxmUaBcFx8eg30NBc0SN13Eiz6aZcGEperClrR
 RwMLDUhUtAMcMRunRRxiN9lHafPqqeDeJre7uky0p0sU9CsH+1n5qIKLmk+Ber2d
 ZS29TobdR7ktfjQ52XazMAIFzI7r1v4Zn7ziH/WRbvKoKw9ICcyoUGNy9CS5LyWu
 BxWCZAJmby5H1i7V7ituJK8TImb81L4aPW06hHX9k+0SoTrgFjas//4jZYfysJ9N
 PvR0MRNfMFhuBlsTj7Gy
 =PMdS
 -----END PGP SIGNATURE-----

Merge tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next

Pull pull additional NFS client changes for 3.19 from Anna Schumaker:
  "NFS: Generic client side changes from Chuck

  These patches fixes for iostats and SETCLIENTID in addition to cleaning
  up the nfs4_init_callback() function.

  Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"

* tag 'nfs-cel-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  NFS: Clean up nfs4_init_callback()
  NFS: SETCLIENTID XDR buffer sizes are incorrect
  SUNRPC: serialize iostats updates
2014-11-26 17:34:14 -05:00
Michael Halcrow
942080643b eCryptfs: Remove buggy and unnecessary write in file name decode routine
Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
end of the allocated buffer during encrypted filename decoding. This
fix corrects the issue by getting rid of the unnecessary 0 write when
the current bit offset is 2.

Signed-off-by: Michael Halcrow <mhalcrow@google.com>
Reported-by: Dmitry Chernenkov <dmitryc@google.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org # v2.6.29+: 51ca58d eCryptfs: Filename Encryption: Encoding and encryption functions
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-11-26 15:55:02 -06:00
Linus Torvalds
b914c5b213 Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "These fix one mishandling of the case when security labels are
  configured out, and two races in the 4.1 backchannel code"

* 'for-3.18' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix slot wake up race in the nfsv4.1 callback code
  SUNRPC: Fix locking around callback channel reply receive
  nfsd: correctly define v4.2 support attributes
2014-11-25 19:05:41 -08:00
Linus Torvalds
277f850fbc Merge git://git.kvack.org/~bcrl/aio-fixes
Pull aio fix from Ben LaHaise:
 "Dirty page accounting fix for aio"

* git://git.kvack.org/~bcrl/aio-fixes:
  aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer
2014-11-25 18:55:44 -08:00
Jaegeuk Kim
95f5b0fc5e f2fs: fix to recover converted inline_data
If an inode has converted inline_data which was written to the disk, we should
set its inode flag for further fsync so that this inline_data can be recovered
from sudden power off.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 18:08:00 -08:00
Jaegeuk Kim
158c194c37 f2fs: make clean the page before writing
If a page is set to be written to the disk, we can make clean the page.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 17:33:31 -08:00
Changman Lee
80ec2e914d f2fs: no more dirty_nat_entires when flushing
After flushing dirty nat entries, it has to be no more dirty nat
entries.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 17:26:36 -08:00
Changman Lee
20d047c876 f2fs: check dirty_nat_cnt before flushing nat entries in journal
It's meaningless to check dirty_nat_cnt after re-dirtying nat entries in
journal. And although there are rooms for dirty nat entires if dirty_nat_cnt
is zero, it's also meaningless to check __has_cursum_space.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 17:26:34 -08:00
Jan Kara
d4f7610743 ext4: forbid journal_async_commit in data=ordered mode
Option journal_async_commit breaks gurantees of data=ordered mode as it
sends only a single cache flush after writing a transaction commit
block. Thus even though the transaction including the commit block is
fully stored on persistent storage, file data may still linger in drives
caches and will be lost on power failure. Since all checksums match on
journal recovery, we replay the transaction thus possibly exposing stale
user data.

To fix this data exposure issue, remove the possibility to use
journal_async_commit in data=ordered mode.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:19:17 -05:00
Theodore Ts'o
d9f39d1e44 jbd2: remove unnecessary NULL check before iput()
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:02:37 -05:00
Markus Elfring
bfcba2d035 ext4: Remove an unnecessary check for NULL before iput()
The iput() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 20:01:37 -05:00
Anna Schumaker
624bd5b7b6 nfs: Add DEALLOCATE support
This patch adds support for using the NFS v4.2 operation DEALLOCATE to
punch holes in a file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-25 16:38:32 -05:00
Anna Schumaker
f4ac1674f5 nfs: Add ALLOCATE support
This patch adds support for using the NFS v4.2 operation ALLOCATE to
preallocate data in a file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-25 16:38:32 -05:00
Namjae Jeon
31fc006b12 ext4: remove unneeded code in ext4_unlink
Setting retval to zero is not needed in ext4_unlink.
Remove unneeded code.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:34:38 -05:00
Eric Sandeen
b003b52496 ext4: don't count external journal blocks as overhead
This was fixed for ext3 with:

e6d8fb3 ext3: Count internal journal as bsddf overhead in ext3_statfs

but was never fixed for ext4.

With a large external journal and no used disk blocks, df comes
out negative without this, as journal blocks are added to the
overhead & subtracted from used blocks unconditionally.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:27:44 -05:00
Jan Kara
733ded2a80 ext4: remove never taken branch from ext4_ext_shift_path_extents()
path[depth].p_hdr can never be NULL for a path passed to us (and even if
it could, EXT_LAST_EXTENT() would make something != NULL from it). So
just remove the branch.

Coverity-id: 1196498
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:23:48 -05:00
Chuck Lever
c2ef47b7f5 NFS: Clean up nfs4_init_callback()
nfs4_init_callback() is never invoked for NFS versions other than 4.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 16:22:16 -05:00
Chuck Lever
6dd3436b9d NFS: SETCLIENTID XDR buffer sizes are incorrect
Use the correct calculation of the maximum size of a clientaddr4
when encoding and decoding SETCLIENTID operations. clientaddr4 is
defined in section 2.2.10 of RFC3530bis-31.

The usage in encode_setclientid_maxsz is missing the 4-byte length
in both strings, but is otherwise correct. decode_setclientid_maxsz
simply asks for a page of receive buffer space, which is
unnecessarily large (more than 4KB).

Note that a SETCLIENTID reply is either clientid+verifier, or
clientaddr4, depending on the returned NFS status. It doesn't
hurt to allocate enough space for both.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 16:22:16 -05:00
Darrick J. Wong
c6d3d56dd0 ext4: create nojournal_checksum mount option
Create a mount option to disable journal checksumming (because the
metadata_csum feature turns it on by default now), and fix remount not
to allow changing the journal checksumming option, since changing the
mount options has no effect on the journal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:20:50 -05:00
Wang Shilong
58d86a50ee ext4: update comments regarding ext4_delete_inode()
ext4_delete_inode() has been renamed for a long time, update
comments for this.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 16:17:29 -05:00
Jaegeuk Kim
5f72739583 f2fs: fix deadlock during inline_data conversion
A deadlock can be occurred:
Thread 1]                             Thread 2]
 - f2fs_write_data_pages              - f2fs_write_begin
   - lock_page(page #0)
                                        - grab_cache_page(page #X)
                                        - get_node_page(inode_page)
                                        - grab_cache_page(page #0)
                                          : to convert inline_data
   - f2fs_write_data_page
     - f2fs_write_inline_data
       - get_node_page(inode_page)

In this case, trying to lock inode_page and page #0 causes deadlock.
In order to avoid this, this patch adds a rule for this locking policy,
which is that page #0 should be locked followed by inode_page lock.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 12:08:30 -08:00
Markus Elfring
ce3e6d25f3 f2fs: fix typos for the word "destroy" in jump labels
Two jump labels were adjusted in the implementation of the
create_node_manager_caches() function because these identifiers
contained typos.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-25 12:08:22 -08:00
Dmitry Monakhov
4fdb554318 ext4: cleanup GFP flags inside resize path
We must use GFP_NOFS instead GFP_KERNEL inside ext4_mb_add_groupinfo
and ext4_calculate_overhead() because they are called from inside a
journal transaction. Call trace:

ioctl
 ->ext4_group_add
   ->journal_start
   ->ext4_setup_new_descs
     ->ext4_mb_add_groupinfo -> GFP_KERNEL
   ->ext4_flex_group_add
     ->ext4_update_super
       ->ext4_calculate_overhead  -> GFP_KERNEL
   ->journal_stop

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 13:08:04 -05:00
Jan Kara
2be12de98a ext4: introduce aging to extent status tree
Introduce a simple aging to extent status tree. Each extent has a
REFERENCED bit which gets set when the extent is used. Shrinker then
skips entries with referenced bit set and clears the bit. Thus
frequently used extents have higher chances of staying in memory.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:55:24 -05:00
Jan Kara
624d0f1dd7 ext4: cleanup flag definitions for extent status tree
Currently flags for extent status tree are defined twice, once shifted
and once without a being shifted. Consolidate these definitions into one
place and make some computations automatic to make adding flags less
error prone. Compiler should be clever enough to figure out these are
constants and generate the same code.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:53:47 -05:00
Jan Kara
dd47592551 ext4: limit number of scanned extents in status tree shrinker
Currently we scan extent status trees of inodes until we reclaim nr_to_scan
extents. This can however require a lot of scanning when there are lots
of delayed extents (as those cannot be reclaimed).

Change shrinker to work as shrinkers are supposed to and *scan* only
nr_to_scan extents regardless of how many extents did we actually
reclaim. We however need to be careful and avoid scanning each status
tree from the beginning - that could lead to a situation where we would
not be able to reclaim anything at all when first nr_to_scan extents in
the tree are always unreclaimable. We remember with each inode offset
where we stopped scanning and continue from there when we next come
across the inode.

Note that we also need to update places calling __es_shrink() manually
to pass reasonable nr_to_scan to have a chance of reclaiming anything and
not just 1.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:51:23 -05:00
Jan Kara
b0dea4c165 ext4: move handling of list of shrinkable inodes into extent status code
Currently callers adding extents to extent status tree were responsible
for adding the inode to the list of inodes with freeable extents. This
is error prone and puts list handling in unnecessarily many places.

Just add inode to the list automatically when the first non-delay extent
is added to the tree and remove inode from the list when the last
non-delay extent is removed.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:49:25 -05:00
Zheng Liu
edaa53cac8 ext4: change LRU to round-robin in extent status tree shrinker
In this commit we discard the lru algorithm for inodes with extent
status tree because it takes significant effort to maintain a lru list
in extent status tree shrinker and the shrinker can take a long time to
scan this lru list in order to reclaim some objects.

We replace the lru ordering with a simple round-robin.  After that we
never need to keep a lru list.  That means that the list needn't be
sorted if the shrinker can not reclaim any objects in the first round.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:45:37 -05:00
Zheng Liu
2f8e0a7c6c ext4: cache extent hole in extent status tree for ext4_da_map_blocks()
Currently extent status tree doesn't cache extent hole when a write
looks up in extent tree to make sure whether a block has been allocated
or not.  In this case, we don't put extent hole in extent cache because
later this extent might be removed and a new delayed extent might be
added back.  But it will cause a defect when we do a lot of writes.  If
we don't put extent hole in extent cache, the following writes also need
to access extent tree to look at whether or not a block has been
allocated.  It brings a cache miss.  This commit fixes this defect.
Also if the inode doesn't have any extent, this extent hole will be
cached as well.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:44:37 -05:00
Jan Kara
cbd7584e6e ext4: fix block reservation for bigalloc filesystems
For bigalloc filesystems we have to check whether newly requested inode
block isn't already part of a cluster for which we already have delayed
allocation reservation. This check happens in ext4_ext_map_blocks() and
that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
ext4_da_map_blocks() finds in extent cache information about the block,
we don't call into ext4_ext_map_blocks() and thus we always end up
getting new reservation even if the space for cluster is already
reserved. This results in overreservation and premature ENOSPC reports.

Fix the problem by checking for existing cluster reservation already in
ext4_da_map_blocks(). That simplifies the logic and actually allows us
to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-11-25 11:41:49 -05:00
Filipe Manana
9ea24bbe17 Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
e5fa8f865b Btrfs: ensure send always works on roots without orphans
Move the logic from the snapshot creation ioctl into send. This avoids
doing the transaction commit if send isn't used, and ensures that if
a crash/reboot happens after the transaction commit that created the
snapshot and before the transaction commit that switched the commit
root, send will not get a commit root that differs from the main root
(that has orphan items).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana
758eb51e71 Btrfs: fix freeing used extent after removing empty block group
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).

The sequence of steps that lead to this:

1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
   the goal of freeing empty block groups;

2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
   transaction (or starts a new one if none is running) and attempts to
   clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
   and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);

3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
   -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
   to delete the block group and the respective chunk, while pinned_extents
   remains with that bit set for the whole (or a part of the) range covered
   by the block group;

4) Later while the transaction is still running, the chunk ends up being reused
   for a new block group (maybe for different purpose, data or metadata), and
   extents belonging to the new block group are allocated for file data or btree
   nodes/leafs;

5) The current transaction is committed, meaning that we unpinned one or more
   extents from the new block group (through btrfs_finish_extent_commit() and
   unpin_extent_range()) which are now being used for new file data or new
   metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
   And unpinning means we returned the extents to the free space cache of the
   new block group, which implies those extents can be used for future allocations
   while they're still in use.

Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).

Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Chris Mason
8f608de699 Btrfs: include vmalloc.h in check-integrity.c
Fengguang's build monster reported warnings on some arches because we
don't have vmalloc.h included

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: fengguang.wu@intel.com
2014-11-25 06:01:11 -08:00
Qu Wenruo
084b6e7c76 btrfs: Fix a lockdep warning when running xfstest.
The following lockdep warning is triggered during xfstests:

[ 1702.980872] =========================================================
[ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
[ 1702.981482] 3.18.0-rc1 #27 Not tainted
[ 1702.981781] ---------------------------------------------------------
[ 1702.982095] kswapd0/77 just changed the state of lock:
[ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
[ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}

and interrupts could create inverse lock ordering between them.

[ 1702.984675]
other info that might help us debug this:
[ 1702.985524] Chain exists of:
  &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock

[ 1702.986799]  Possible interrupt unsafe locking scenario:

[ 1702.987681]        CPU0                    CPU1
[ 1702.988137]        ----                    ----
[ 1702.988598]   lock(&fs_info->dev_replace.lock);
[ 1702.989069]                                local_irq_disable();
[ 1702.989534]                                lock(&delayed_node->mutex);
[ 1702.990038]                                lock(&found->groups_sem);
[ 1702.990494]   <Interrupt>
[ 1702.990938]     lock(&delayed_node->mutex);
[ 1702.991407]
 *** DEADLOCK ***

It is because the btrfs_kobj_{add/rm}_device() will call memory
allocation with GFP_KERNEL,
which may flush fs page cache to free space, waiting for it self to do
the commit, causing the deadlock.

To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
dev_replace lock range, also involing split the
btrfs_rm_dev_replace_srcdev() function into remove and free parts.

Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
called out of the lock range.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 05:55:38 -08:00
Chris Mason
ad27c0dab7 Merge branch 'dev/pending-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-11-25 05:45:30 -08:00
Li RongQing
e9f456ca50 nfs: define nfs_inc_fscache_stats and using it as possible
Define and use nfs_inc_fscache_stats when plus one, which can save to
pass one parameter.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 20:08:47 -05:00
Li RongQing
5a254d08b0 nfs: replace nfs_add_stats with nfs_inc_stats when add one
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 20:08:47 -05:00
Markus Elfring
fe0bf1185d NFS: Deletion of unnecessary checks before the function call "nfs_put_client"
The nfs_put_client() function tests whether its argument is NULL and then
returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 20:07:27 -05:00
Jeff Layton
10b89567db lockd: eliminate LOCKD_DEBUG
LOCKD_DEBUG is always the same value as CONFIG_SUNRPC_DEBUG, so we can
just use it instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:24:08 -05:00
Peng Tao
4bd5a980de nfs41: fix nfs4_proc_layoutget error handling
nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
early so that it is safe to call .release in case nfs4_alloc_pages
fails.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Fixes: a47970ff78 ("NFSv4.1: Hold reference to layout hdr in layoutget")
Cc: stable@vger.kernel.org # 3.9+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:14:54 -05:00
Weston Andros Adamson
cb1410c71e NFS: fix subtle change in COMMIT behavior
Recent work in the pgio layer made it possible for there to be more than one
request per page. This caused a subtle change in commit behavior, because
write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
writeback against the number of requests on a commit list to choose when to
send a COMMIT in a non-blocking flush.

This is probably hard to hit in normal operation - you have to be using
rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
aligned to have a noticeable change in behavior.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:00:42 -05:00
Christoph Hellwig
6a74c0c940 pnfs/blocklayout: fix end calculation in pnfs_num_cont_bytes
Use the number of pages in the pagecache mapping instead of the
number of pnfs requests which is only slightly related.

Reported-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:00:41 -05:00
Anna Schumaker
878ffa9f85 NFS: Use nfs_server_capable() for checknig NFS_CAP_SEEK
This should make the code easier to maintain in the future.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 12:49:13 -05:00
Jan Kara
c1b69b1ca1 nfs: Remove dead case from nfs4_map_errors()
NFS4ERR_ACCESS has number 13 and thus is matched and returned
immediately at the beginning of nfs4_map_errors() and there's no point
in checking it later.

Coverity-id: 733891
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 12:48:14 -05:00
Paul Burton
774c105ed8 binfmt_elf: allow arch code to examine PT_LOPROC ... PT_HIPROC headers
MIPS is introducing new variants of its O32 ABI which differ in their
handling of floating point, in order to enable a gradual transition
towards a world where mips32 binaries can take advantage of new hardware
features only available when configured for certain FP modes. In order
to do this ELF binaries are being augmented with a new section that
indicates, amongst other things, the FP mode requirements of the binary.
The presence & location of such a section is indicated by a program
header in the PT_LOPROC ... PT_HIPROC range.

In order to allow the MIPS architecture code to examine the program
header & section in question, pass all program headers in this range
to an architecture-specific arch_elf_pt_proc function. This function
may return an error if the header is deemed invalid or unsuitable for
the system, in which case that error will be returned from
load_elf_binary and upwards through the execve syscall.

A means is required for the architecture code to make a decision once
it is known that all such headers have been seen, but before it is too
late to return from an execve syscall. For this purpose the
arch_check_elf function is added, and called once, after all PT_LOPROC
to PT_HIPROC headers have been passed to arch_elf_pt_proc but before
the code which invoked execve has been lost. This enables the
architecture code to make a decision based upon all the headers present
in an ELF binary and its interpreter, as is required to forbid
conflicting FP ABI requirements between an ELF & its interpreter.

In order to allow data to be stored throughout the calls to the above
functions, struct arch_elf_state is introduced.

Finally a variant of the SET_PERSONALITY macro is introduced which
accepts a pointer to the struct arch_elf_state, allowing it to act
based upon state observed from the architecture specific program
headers.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7679/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-11-24 07:45:02 +01:00
Paul Burton
a9d9ef133f binfmt_elf: load interpreter program headers earlier
Load the program headers of an ELF interpreter early enough in
load_elf_binary that they can be examined before it's too late to return
an error from an exec syscall. This patch does not perform any such
checking, it merely lays the groundwork for a further patch to do so.

No functional change is intended.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7675/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-11-24 07:45:02 +01:00
Paul Burton
6a8d38945c binfmt_elf: Hoist ELF program header loading to a function
load_elf_binary & load_elf_interp both load program headers from an ELF
executable in the same way, duplicating the code. This patch introduces
a helper function (load_elf_phdrs) which performs this common task &
calls it from both load_elf_binary & load_elf_interp. In addition to
reducing code duplication, this is part of preparing to load the ELF
interpreter headers earlier such that they can be examined before it's
too late to return an error from an exec syscall.

Signed-off-by: Paul Burton <paul.burton@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7676/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2014-11-24 07:45:02 +01:00
Jaegeuk Kim
0341845efc f2fs: fix livelock calling f2fs_iget during f2fs_evict_inode
In f2fs_evict_inode,
 commit_inmemory_pages
   f2fs_gc
     f2fs_iget
       iget_locked
         -> wait for inode free

Here, if the inode is same as the one to be evicted, f2fs should wait forever.
Actually, we should not call f2fs_balance_fs during f2fs_evict_inode to avoid
this.

But, the commit_inmem_pages calls f2fs_balance_fs by default, even if
f2fs_evict_inode wants to free inmemory pages only.

Hence, this patch adds to trigger f2fs_balance_fs only when there is something
to write.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-11-23 21:51:57 -08:00