Commit Graph

39778 Commits

Author SHA1 Message Date
Ryusuke Konishi
283ee1482f nilfs2: fix deadlock of segment constructor during recovery
According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
during recovery at mount time.  The code path that caused the deadlock was
as follows:

  nilfs_fill_super()
    load_nilfs()
      nilfs_salvage_orphan_logs()
        * Do roll-forwarding, attach segment constructor for recovery,
          and kick it.

        nilfs_segctor_thread()
          nilfs_segctor_thread_construct()
           * A lock is held with nilfs_transaction_lock()
             nilfs_segctor_do_construct()
               nilfs_segctor_drop_written_files()
                 iput()
                   iput_final()
                     write_inode_now()
                       writeback_single_inode()
                         __writeback_single_inode()
                           do_writepages()
                             nilfs_writepage()
                               nilfs_construct_dsync_segment()
                                 nilfs_transaction_lock() --> deadlock

This can happen if commit 7ef3ff2fea ("nilfs2: fix deadlock of segment
constructor over I_SYNC flag") is applied and roll-forward recovery was
performed at mount time.  The roll-forward recovery can happen if datasync
write is done and the file system crashes immediately after that.  For
instance, we can reproduce the issue with the following steps:

 < nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
 # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
 # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
 count=1 && reboot -nfh
 < the system will immediately reboot >
 # mount -t nilfs2 /dev/sdb1 /nilfs

The deadlock occurs because iput() can run segment constructor through
writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags.  The
above commit changed segment constructor so that it calls iput()
asynchronously for inodes with i_nlink == 0, but that change was
imperfect.

This fixes the another deadlock by deferring iput() in segment constructor
even for the case that mount is not finished, that is, for the case that
MS_ACTIVE flag is not set.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Mark Fasheh
18d585f0f2 ocfs2: make append_dio an incompat feature
It turns out that making this feature ro_compat isn't quite enough to
prevent accidental corruption on mount from older kernels.  Ocfs2 (like
other file systems) will process orphaned inodes even when the user mounts
in 'ro' mode.  So for the case of a filesystem not knowing the append_dio
feature, mounting the filesystem could result in orphaned-for-dio files
being deleted, which we clearly don't want.

So instead, turn this into an incompat flag.

Btw, this is kind of my fault - initially I asked that we add a flag to
cover the feature and even suggested that we use an ro flag.  It wasn't
until I was looking through our commits for v4.0-rc1 that I realized we
actually want this to be incompat.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
Linus Torvalds
84399bb075 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Outside of misc fixes, Filipe has a few fsync corners and we're
  pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  Btrfs: remove extra run_delayed_refs in update_cowonly_root
  Btrfs: incremental send, don't rename a directory too soon
  btrfs: fix lost return value due to variable shadowing
  Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
  Btrfs: fix off-by-one logic error in btrfs_realloc_node
  Btrfs: add missing inode update when punching hole
  Btrfs: abort the transaction if we fail to update the free space cache inode
  Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-06 13:52:54 -08:00
Linus Torvalds
7c5bde7ade File locking related fix for v4.0 (#3)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU+EEgAAoJEAAOaEEZVoIVNtgP/33VZs6/5rthhMhhFt5Gonwn
 2wKjrA8Tuskxg2Ij7RLSCnZNeoscF1Te7GNRQdalhO4mHoDzQx3zSHuV/STNlILg
 6wNSd0pVj4FjXQN8Kv/slJsKiUu2U33lQwXNiVbgPI4CAdRAVJqYZ/7EIgf7VciQ
 FojqR3/RTSW1CHuYcNL5CTkn9Pdm8cwqacXMcKBWK6t1ZiEBqaP0GD0UhpFsUeRF
 RPv0ba3a+iuf/ToxhKV1fGWjJUplBUR2FMCnEiUFCnOGNPYeknWKp8wGeRp8bYA0
 6Umpk/NLg8mVSzbdW3wxBQ25PRBNVBAmQutFD2aEjIMPXuKPt4IHe/os2SssbfbJ
 wyCZ4LEbogcbmOlBAePtlUloy2GM3F7lQN+6GUgiLAT/I0+1VCjESsBftZP7QKRT
 N6szoAAsIDNmXcaApYPZZk3JBoyiwHurbrhI23V74p91esVNFqWJaVTpAbn8TkD7
 u/hGL7Loi7sXst2g9XISXvcRkcGUKKXf727Ih9wYQx8N6McP1sDgSC9PNYHz2KLo
 Ha4pQo+1+t0tXGmTsGpHBouAsSntURjn5/vv8OHbi0G3hS5v06G/j0vkU3lRtMqv
 koomIyShwUORLJxo6oJ3yKXQdAnz0Q78LkfDdSyq/M17G1FvOkQhw42a/LJ/A1w1
 yzJ5XmwbkM5GcG3LVTLs
 =ND3k
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux

Pull file locking fix from Jeff Layton:
 "Just a single patch to fix a memory leak that Daniel Wagner discovered
  while doing some testing with leases"

* tag 'locks-v4.0-3' of git://git.samba.org/jlayton/linux:
  locks: fix fasync_struct memory leak in lease upgrade/downgrade handling
2015-03-06 10:31:38 -08:00
Linus Torvalds
1b1bd56191 NFS client bugfixes for Linux 4.0
Highlights include:
 
 - Fix a regression in the NFSv4 open state recovery code
 - Fix a regression in the NFSv4 close code
 - Fix regressions and side-effects of the loop-back mounted NFS fixes
   in 3.18, that cause the NFS read() syscall to return EBUSY.
 - Fix regressions around the readdirplus code and how it interacts with
   the VFS lazy unmount changes that went into v3.18.
 - Fix issues with out-of-order RPC call replies replacing updated
   attributes with stale ones (particularly after a truncate()).
 - Fix an underflow checking issue with RPC/RDMA credits
 - Fix a number of issues with the NFSv4 delegation return/free code.
 - Fix issues around stale NFSv4.1 leases when doing a mount
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU+SJ6AAoJEGcL54qWCgDyVgQQAKNsF/O2O9ip2uAHZRZvM6TS
 Ev8c3Spj/FmRI1tlCcGi1zZ8uCSwvPQz3uAN0vOTMocUWjokT5sAgN5yBIxHasem
 6YK7jxs9WHiP7MGyReaAFJwG/W6LZndnlNqPWPs9KiaWwKVXIsQ3uFm/Y0lr90Fi
 ew16DQm0DUd4Yvv42WJR9ay7UwUPvT7wmaGIVVK2hjQqr2lx02jspt5kfrC+vsMU
 OYU/0YDofb2ajs5krbah6tUHf3VDnSVrXP6if66IrukCM9S4AvowpnMJQ5QJALh+
 cPlqHDm2ZzuIecpqZEgYLM73wQ2q+KBXTlDLcgYg6LjnqBEivwO9RDn6tfCwKTcS
 tCFohQc9iDOj9rTZ9EQlQME6u/FdxWncpxyTsMyBk7FlcLsOQRqio/FXZhbyJGuH
 gvIdIW3fPseijtejpYkxgabe6JL9NRvjv3SnOay7xHs9Vn4tRkFF2mkkQDZOG6HT
 atkxQp8kB3m9gMoeAmTLdTcJkcFdk6AKnNrcyJaa1GW4msmMuGZtq/6vayKJsBZb
 OIw788bDSOVpcVR+6SAC24/dutcl+WJHlSJShqvIrTKejBxPCc7IVSeg63FJsWTO
 sxfXdUr3wVJ1ooDFCspeBj+3zAimIq7qDmRRs85ekgEtxrgdUldhg/VwbDjvLjmb
 whXFpiCS7Ii0fVYtZvjz
 =qMB7
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

   - Fix a regression in the NFSv4 open state recovery code
   - Fix a regression in the NFSv4 close code
   - Fix regressions and side-effects of the loop-back mounted NFS fixes
     in 3.18, that cause the NFS read() syscall to return EBUSY.
   - Fix regressions around the readdirplus code and how it interacts
     with the VFS lazy unmount changes that went into v3.18.
   - Fix issues with out-of-order RPC call replies replacing updated
     attributes with stale ones (particularly after a truncate()).
   - Fix an underflow checking issue with RPC/RDMA credits
   - Fix a number of issues with the NFSv4 delegation return/free code.
   - Fix issues around stale NFSv4.1 leases when doing a mount"

* tag 'nfs-for-4.0-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
  NFSv4.1: Clear the old state by our client id before establishing a new lease
  NFSv4: Fix a race in NFSv4.1 server trunking discovery
  NFS: Don't write enable new pages while an invalidation is proceeding
  NFS: Fix a regression in the read() syscall
  NFSv4: Ensure we skip delegations that are already being returned
  NFSv4: Pin the superblock while we're returning the delegation
  NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
  NFSv4: Ensure that we don't reap a delegation that is being returned
  NFS: Fix stateid used for NFS v4 closes
  NFSv4: Don't call put_rpccred() under the rcu_read_lock()
  NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
  NFSv3: Use the readdir fileid as the mounted-on-fileid
  NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
  NFSv4: Set a barrier in the update_changeattr() helper
  NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
  NFS: Remove size hack in nfs_inode_attrs_need_update()
  NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
  NFS: Add attribute update barriers to NFS writebacks
  NFS: Set an attribute barrier on all updates
  NFS: Add attribute update barriers to nfs_setattr_update_inode()
  ...
2015-03-06 10:09:57 -08:00
Quentin Casasnovas
dd9ef135e3 Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
Improper arithmetics when calculting the address of the extended ref could
lead to an out of bounds memory read and kernel panic.

Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
cc: stable@vger.kernel.org # v3.7+
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:33 -08:00
Filipe Manana
3a8b36f378 Btrfs: fix data loss in the fast fsync path
When using the fast file fsync code path we can miss the fact that new
writes happened since the last file fsync and therefore return without
waiting for the IO to finish and write the new extents to the fsync log.

Here's an example scenario where the fsync will miss the fact that new
file data exists that wasn't yet durably persisted:

1. fs_info->last_trans_committed == N - 1 and current transaction is
   transaction N (fs_info->generation == N);

2. do a buffered write;

3. fsync our inode, this clears our inode's full sync flag, starts
   an ordered extent and waits for it to complete - when it completes
   at btrfs_finish_ordered_io(), the inode's last_trans is set to the
   value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
   btrfs_set_inode_last_trans);

4. transaction N is committed, so fs_info->last_trans_committed is now
   set to the value N and fs_info->generation remains with the value N;

5. do another buffered write, when this happens btrfs_file_write_iter
   sets our inode's last_trans to the value N + 1 (that is
   fs_info->generation + 1 == N + 1);

6. transaction N + 1 is started and fs_info->generation now has the
   value N + 1;

7. transaction N + 1 is committed, so fs_info->last_trans_committed
   is set to the value N + 1;

8. fsync our inode - because it doesn't have the full sync flag set,
   we only start the ordered extent, we don't wait for it to complete
   (only in a later phase) therefore its last_trans field has the
   value N + 1 set previously by btrfs_file_write_iter(), and so we
   have:

       inode->last_trans <= fs_info->last_trans_committed
           (N + 1)              (N + 1)

   Which made us not log the last buffered write and exit the fsync
   handler immediately, returning success (0) to user space and resulting
   in data loss after a crash.

This can actually be triggered deterministically and the following excerpt
from a testcase I made for xfstests triggers the issue. It moves a dummy
file across directories and then fsyncs the old parent directory - this
is just to trigger a transaction commit, so moving files around isn't
directly related to the issue but it was chosen because running 'sync' for
example does more than just committing the current transaction, as it
flushes/waits for all file data to be persisted. The issue can also happen
at random periods, since the transaction kthread periodicaly commits the
current transaction (about every 30 seconds by default).
The body of the test is:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our main test file 'foo', the one we check for data loss.
  # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
  # bit from its flags (btrfs inode specific flags).
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                  -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io

  # Now create one other file and 2 directories. We will move this second file
  # from one directory to the other later because it forces btrfs to commit its
  # currently open transaction if we fsync the old parent directory. This is
  # necessary to trigger the data loss bug that affected btrfs.
  mkdir $SCRATCH_MNT/testdir_1
  touch $SCRATCH_MNT/testdir_1/bar
  mkdir $SCRATCH_MNT/testdir_2

  # Make sure everything is durably persisted.
  sync

  # Write more 8Kb of data to our file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io

  # Move our 'bar' file into a new directory.
  mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar

  # Fsync our first directory. Because it had a file moved into some other
  # directory, this made btrfs commit the currently open transaction. This is
  # a condition necessary to trigger the data loss bug.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1

  # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
  # data we wrote previously to be persisted and available if a crash happens.
  # This did not happen with btrfs, because of the transaction commit that
  # happened when we fsynced the parent directory.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now check that all data we wrote before are available.
  echo "File content after log replay:"
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected golden output for the test, which is what we get with this
fix applied (or when running against ext3/4 and xfs), is:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  0040000

Without this fix applied, the output shows the test file does not have
the second 8Kb extent that we successfully fsynced:

  wrote 8192/8192 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 8192/8192 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after log replay:
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0020000

So fix this by skipping the fsync only if we're doing a full sync and
if the inode's last_trans is <= fs_info->last_trans_committed, or if
the inode is already in the log. Also remove setting the inode's
last_trans in btrfs_file_write_iter since it's useless/unreliable.

Also because btrfs_file_write_iter no longer sets inode->last_trans to
fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
bail out if last_trans is 0, otherwise something as simple as the following
example wouldn't log the second write on the last fsync:

  1. write to file

  2. fsync file

  3. fsync file
       |--> btrfs_inode_in_log() returns true and it set last_trans to 0

  4. write to file
       |--> btrfs_file_write_iter() no longers sets last_trans, so it
            remained with a value of 0
  5. fsync
       |--> inode->last_trans == 0, so it bails out without logging the
            second write

A test case for xfstests will be sent soon.

CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:32 -08:00
Josef Bacik
f5c0a12280 Btrfs: remove extra run_delayed_refs in update_cowonly_root
This got added with my dirty_bgs patch, it's not needed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-05 17:28:30 -08:00
Jeff Layton
0164bf0239 locks: fix fasync_struct memory leak in lease upgrade/downgrade handling
Commit 8634b51f6c (locks: convert lease handling to file_lock_context)
introduced a regression in the handling of lease upgrade/downgrades.

In the event that we already have a lease on a file and are going to
either upgrade or downgrade it, we skip doing any list insertion or
deletion and simply re-call lm_setup on the existing lease.

As of commit 8634b51f6c however, we end up calling lm_setup on the
lease that was passed in, instead of on the existing lease. This causes
us to leak the fasync_struct that was allocated in the event that there
was not already an existing one (as it always appeared that there
wasn't one).

Fixes: 8634b51f6c (locks: convert lease handling to file_lock_context)
Reported-and-Tested-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-03-04 17:34:32 -05:00
Linus Torvalds
8a001af4bb Fixes for proper ioctl handling and an untriggerable buffer overflow
- The eCryptfs ioctl handling functions should only pass known-good ioctl
   commands to the lower filesystem
 - A static checker found a potential buffer overflow. Upon inspection, it is
   not triggerable due to input validation performed on the mount parameters.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJU94JWAAoJENaSAD2qAscK5z0P/3rrZv7w4rWnPeLfzddRCSpt
 QMCPGGXE2fo3nP9w6e7HirAQKg474CFdmNXbNzAKR08irQNRM9lMMEkUp8B9kwXN
 8Ms+lHPVuTZuBPXkqtpG/p47kAJdc1d9QePa0iU5PPp5GcdI/knPR/Md42NxUd4y
 kqMdgK7brO/y16i5aCC40CG2x+OPc5I8Xz/9MT9fm9+NTRzOxFhbLxis5LKUrQgj
 SnzkXn4Z1jsLt3y8OCFhrP9n9sHrKmcHpxBExa7GziADbIcw+nv/ugQy8Dvi8sHK
 zzO1G21uUlZXfMoWdv+DbwgU6subaT/D6NjcyVEhZ0ziYQDdMMOVvqDUFzwIV9W7
 WiIV2fYLxYlZHSv318BlYTT6XiDVveINUcLI2107cw4pBwMYUt9QY3QoKLQ7027o
 HhdX4Ys6chzwggWU0CRRI7W45/LTF0KeJ3g06tAsYVV6GFGoJAnZe7olHA6X7nGE
 s8rXzpT5zZqdcz7Y5ln5NtrzWBER91iBUDivafw6y7b0rqj6o+fEUk4vXNdsyuzZ
 y5Qn5dus1ImPLwtCAsfqZMjUUnNtOaBPd52k0sHIkRoY6W7GlXZNOjXA7ki+X5tX
 Jq4tm0n3fkInEhjJDQ/sxtPa+TcZGgcoyc3qNsX80Lob7QpaTvBczKFgBadNnMnK
 TKJEvJQuLIlAq3Lw5DGL
 =4fdp
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs fixes from Tyler Hicks:
 "Fixes for proper ioctl handling and an untriggerable buffer overflow

   - The eCryptfs ioctl handling functions should only pass known-good
     ioctl commands to the lower filesystem

   - A static checker found a potential buffer overflow.  Upon
     inspection, it is not triggerable due to input validation performed
     on the mount parameters"

* tag 'ecryptfs-4.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: don't pass fs-specific ioctl commands through
  eCryptfs: ensure copy to crypt_stat->cipher does not overrun
2015-03-04 14:19:48 -08:00
Trond Myklebust
e11259f920 NFSv4.1: Clear the old state by our client id before establishing a new lease
If the call to exchange-id returns with the EXCHGID4_FLAG_CONFIRMED_R flag
set, then that means our lease was established by a previous mount instance.
Ensure that we detect this situation, and that we clear the state held by
that mount.

Reported-by: Jorge Mora <Jorge.Mora@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 21:52:30 -05:00
Trond Myklebust
48d66b9749 NFSv4: Fix a race in NFSv4.1 server trunking discovery
We do not want to allow a race with another NFS mount to cause
nfs41_walk_client_list() to establish a lease on our nfs_client before
we're done checking for trunking.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 20:42:23 -05:00
Linus Torvalds
a6c5170d1e Merge branch 'for-4.0' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
 "Three miscellaneous bugfixes, most importantly the clp->cl_revoked
  bug, which we've seen several reports of people hitting"

* 'for-4.0' of git://linux-nfs.org/~bfields/linux:
  sunrpc: integer underflow in rsc_parse()
  nfsd: fix clp->cl_revoked list deletion causing softlock in nfsd
  svcrpc: fix memory leak in gssp_accept_sec_context_upcall
2015-03-03 15:52:50 -08:00
Trond Myklebust
ef070dcb39 NFS: Don't write enable new pages while an invalidation is proceeding
nfs_vm_page_mkwrite() should wait until the page cache invalidation
is finished. This is the second patch in a 2 patch series to deprecate
the NFS client's reliance on nfs_release_page() in the context of
nfs_invalidate_mapping().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 13:58:08 -05:00
Trond Myklebust
874f946376 NFS: Fix a regression in the read() syscall
When invalidating the page cache for a regular file, we want to first
sync all dirty data to disk and then call invalidate_inode_pages2().
The latter relies on nfs_launder_page() and nfs_release_page() to deal
respectively with dirty pages, and unstable written pages.

When commit 9590544694 ("NFS: avoid deadlocks with loop-back mounted
NFS filesystems.") changed the behaviour of nfs_release_page(), then it
made it possible for invalidate_inode_pages2() to fail with an EBUSY.
Unfortunately, that error is then propagated back to read().

Let's therefore work around the problem for now by protecting the call
to sync the data and invalidate_inode_pages2() so that they are atomic
w.r.t. the addition of new writes.
Later on, we can revisit whether or not we still need nfs_launder_page()
and nfs_release_page().

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-03 13:02:29 -05:00
Tyler Hicks
6d65261a09 eCryptfs: don't pass fs-specific ioctl commands through
eCryptfs can't be aware of what to expect when after passing an
arbitrary ioctl command through to the lower filesystem. The ioctl
command may trigger an action in the lower filesystem that is
incompatible with eCryptfs.

One specific example is when one attempts to use the Btrfs clone
ioctl command when the source file is in the Btrfs filesystem that
eCryptfs is mounted on top of and the destination fd is from a new file
created in the eCryptfs mount. The ioctl syscall incorrectly returns
success because the command is passed down to Btrfs which thinks that it
was able to do the clone operation. However, the result is an empty
eCryptfs file.

This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
commands through and then copies up the inode metadata from the lower
inode to the eCryptfs inode to catch any changes made to the lower
inode's metadata. Those five ioctl commands are mostly common across all
filesystems but the whitelist may need to be further pruned in the
future.

https://bugzilla.kernel.org/show_bug.cgi?id=93691
https://launchpad.net/bugs/1305335

Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Rocko <rockorequin@hotmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8 eCryptfs: Handle ioctl calls with unlocked and compat functions
2015-03-03 02:03:56 -06:00
Trond Myklebust
ec3ca4e57e NFSv4: Ensure we skip delegations that are already being returned
In nfs_client_return_marked_delegations() and nfs_delegation_reap_unclaimed()
we want to optimise the loop traversal by skipping delegations that are
already in the process of being returned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:15 -05:00
Trond Myklebust
9f0f8e12c4 NFSv4: Pin the superblock while we're returning the delegation
This patch ensures that the superblock doesn't go ahead and disappear
underneath us while the state manager thread is returning delegations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:14 -05:00
Trond Myklebust
ade04647dd NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation()
Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a
delegation that is already in the process of being returned.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:14 -05:00
Trond Myklebust
b04b22f4ca NFSv4: Ensure that we don't reap a delegation that is being returned
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:09:13 -05:00
Anna Schumaker
369d6b7f00 NFS: Fix stateid used for NFS v4 closes
After 566fcec60 the client uses the "current stateid" from the
nfs4_state structure to close a file.  This could potentially contain a
delegation stateid, which is disallowed by the protocol and causes
servers to return NFS4ERR_BAD_STATEID.  This patch restores the
(correct) behavior of sending the open stateid to close a file.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: 566fcec60 (NFSv4: Fix an atomicity problem in CLOSE)
Signed-off-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-02 18:06:42 -05:00
Filipe Manana
84471e2429 Btrfs: incremental send, don't rename a directory too soon
There's one more case where we can't issue a rename operation for a
directory as soon as we process it. We used to delay directory renames
only if they have some ancestor directory with a higher inode number
that got renamed too, but there's another case where we need to delay
the rename too - when a directory A is renamed to the old name of a
directory B but that directory B has its rename delayed because it
has now (in the send root) an ancestor with a higher inode number that
was renamed. If we don't delay the directory rename in this case, the
receiving end of the send stream will attempt to rename A to the old
name of B before B got renamed to its new name, which results in a
"directory not empty" error. So fix this by delaying directory renames
for this case too.

Steps to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/a
  $ mkdir /mnt/b
  $ mkdir /mnt/c
  $ touch /mnt/a/file

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/c /mnt/x
  $ mv /mnt/a /mnt/x/y
  $ mv /mnt/b /mnt/a

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 -f /tmp/1.send
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2
  $ btrfs receive /mnt2 -f /tmp/1.send
  $ btrfs receive /mnt2 -f /tmp/2.send
  ERROR: rename b -> a failed. Directory not empty

A test case for xfstests follows soon.

Reported-by: Ames Cornish <ames@cornishes.net>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
David Sterba
1932b7be97 btrfs: fix lost return value due to variable shadowing
A block-local variable stores error code but btrfs_get_blocks_direct may
not return it in the end as there's a ret defined in the function scope.

CC: <stable@vger.kernel.org>	# 3.6+
Fixes: d187663ef2 ("Btrfs: lock extents as we map them in DIO")
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
5cdf83edb8 Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
The return value from btrfs_lookup_xattr() can be a pointer encoding an
error, therefore deal with it. This fixes commit 5f5bc6b1e2
("Btrfs: make xattr replace operations atomic").

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
5dfe2be7ea Btrfs: fix off-by-one logic error in btrfs_realloc_node
The end_slot variable actually matches the number of pointers in the
node and not the last slot (which is 'nritems - 1'). Therefore in order
to check that the current slot in the for loop doesn't match the last
one, the correct logic is to check if 'i' is less than 'end_slot - 1'
and not 'end_slot - 2'.

Fix this and set end_slot to be 'nritems - 1', as it's less confusing
since the variable name implies it's inclusive rather then exclusive.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:45 -08:00
Filipe Manana
e8c1c76e80 Btrfs: add missing inode update when punching hole
When punching a file hole if we endup only zeroing parts of a page,
because the start offset isn't a multiple of the sector size or the
start offset and length fall within the same page, we were not updating
the inode item. This prevented an fsync from doing anything, if no other
file changes happened in the current transaction, because the fields
in btrfs_inode used to check if the inode needs to be fsync'ed weren't
updated.

This issue is easy to reproduce and the following excerpt from the
xfstest case I made shows how to trigger it:

  _scratch_mkfs >> $seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file.
  $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
      $SCRATCH_MNT/foo | _filter_xfs_io

  # Fsync the file, this makes btrfs update some btrfs inode specific fields
  # that are used to track if the inode needs to be written/updated to the fsync
  # log or not. After this fsync, the new values for those fields indicate that
  # a subsequent fsync does not need to touch the fsync log.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  # Force a commit of the current transaction. After this point, any operation
  # that modifies the data or metadata of our file, should update those fields in
  # the btrfs inode with values that make the next fsync operation write to the
  # fsync log.
  sync

  # Punch a hole in our file. This small range affects only 1 page.
  # This made the btrfs hole punching implementation write only some zeroes in
  # one page, but it did not update the btrfs inode fields used to determine if
  # the next fsync needs to write to the fsync log.
  $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo

  # Another variation of the previously mentioned case.
  $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo

  # Now fsync the file. This was a no-operation because the previous hole punch
  # operation didn't update the inode's fields mentioned before, so they remained
  # with the values they had after the first fsync - that is, they indicate that
  # it is not needed to write to fsync log.
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

  echo "File content before:"
  od -t x1 $SCRATCH_MNT/foo

  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Enable writes and mount the fs. This makes the fsync log replay code run.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Because the last fsync didn't do anything, here the file content matched what
  # it was after the first fsync, before the holes were punched, and not what it
  # was after the holes were punched.
  echo "File content after:"
  od -t x1 $SCRATCH_MNT/foo

This issue has been around since 2012, when the punch hole implementation
was added, commit 2aaa665581 ("Btrfs: add hole punching").

A test case for xfstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Josef Bacik
0c0ef4bc84 Btrfs: abort the transaction if we fail to update the free space cache inode
Our gluster boxes were hitting a problem where they'd run out of space when
updating the block group cache and therefore wouldn't be able to update the free
space inode.  This is a problem because this is how we invalidate the cache and
protect ourselves from errors further down the stack, so if this fails we have
to abort the transaction so we make sure we don't end up with stale free space
cache.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Filipe Manana
4d884fceaa Btrfs: fix fsync race leading to ordered extent memory leaks
We can have multiple fsync operations against the same file during the
same transaction and they can collect the same ordered extents while they
don't complete (still accessible from the inode's ordered tree). If this
happens, those ordered extents will never get their reference counts
decremented to 0, leading to memory leaks and inode leaks (an iput for an
ordered extent's inode is scheduled only when the ordered extent's refcount
drops to 0). The following sequence diagram explains this race:

         CPU 1                                         CPU 2

btrfs_sync_file()

                                                 btrfs_sync_file()

  mutex_lock(inode->i_mutex)
  btrfs_log_inode()
    btrfs_get_logged_extents()
      --> collects ordered extent X
      --> increments ordered
          extent X's refcount
    btrfs_submit_logged_extents()
  mutex_unlock(inode->i_mutex)

                                                   mutex_lock(inode->i_mutex)
  btrfs_sync_log()
     btrfs_wait_logged_extents()
       --> list_del_init(&ordered->log_list)
                                                     btrfs_log_inode()
                                                       btrfs_get_logged_extents()
                                                         --> Adds ordered extent X
                                                             to logged_list because
                                                             at this point:
                                                             list_empty(&ordered->log_list)
                                                             && test_bit(BTRFS_ORDERED_LOGGED,
                                                                         &ordered->flags) == 0
                                                         --> Increments ordered extent
                                                             X's refcount
       --> check if ordered extent's io is
           finished or not, start it if
           necessary and wait for it to finish
       --> sets bit BTRFS_ORDERED_LOGGED
           on ordered extent X's flags
           and adds it to trans->ordered
  btrfs_sync_log() finishes

                                                       btrfs_submit_logged_extents()
                                                     btrfs_log_inode() finishes
                                                   mutex_unlock(inode->i_mutex)

btrfs_sync_file() finishes

                                                   btrfs_sync_log()
                                                      btrfs_wait_logged_extents()
                                                        --> Sees ordered extent X has the
                                                            bit BTRFS_ORDERED_LOGGED set in
                                                            its flags
                                                        --> X's refcount is untouched
                                                   btrfs_sync_log() finishes

                                                 btrfs_sync_file() finishes

btrfs_commit_transaction()
  --> called by transaction kthread for e.g.
  btrfs_wait_pending_ordered()
    --> waits for ordered extent X to
        complete
    --> decrements ordered extent X's
        refcount by 1 only, corresponding
        to the increment done by the fsync
        task ran by CPU 1

In the scenario of the above diagram, after the transaction commit,
the ordered extent will remain with a refcount of 1 forever, leaking
the ordered extent structure and preventing the i_count of its inode
from ever decreasing to 0, since the delayed iput is scheduled only
when the ordered extent's refcount drops to 0, preventing the inode
from ever being evicted by the VFS.

Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
mean that an ordered extent is already being processed by an fsync call,
which will attach it to the current transaction, preventing it from being
collected by subsequent fsync operations against the same inode.

This race was introduced with the following change (added in 3.19 and
backported to stable 3.18 and 3.17):

  Btrfs: make sure logged extents complete in the current transaction V3
  commit 50d9aa99bd

I ran into this issue while running xfstests/generic/113 in a loop, which
failed about 1 out of 10 runs with the following warning in dmesg:

[ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
[ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
ppy e1000 scsi_mod [last unloaded: btrfs]
[ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
[ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
[ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
[ 2612.463163] Call Trace:
[ 2612.463719]  [<ffffffff8142425e>] dump_stack+0x4c/0x65
[ 2612.464789]  [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
[ 2612.466026]  [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
[ 2612.467247]  [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
[ 2612.468416]  [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
[ 2612.469625]  [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
[ 2612.471251]  [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
[ 2612.472536]  [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
[ 2612.473742]  [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
[ 2612.475477]  [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
[ 2612.476695]  [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
[ 2612.477911]  [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
[ 2612.479106]  [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
[ 2612.480226]  [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
[ 2612.481471]  [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
[ 2612.482686]  [<ffffffff811547a7>] deactivate_super+0x3f/0x43
[ 2612.483791]  [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
[ 2612.484842]  [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
[ 2612.485900]  [<ffffffff8105d019>] task_work_run+0x8f/0xbc
[ 2612.486960]  [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
[ 2612.488083]  [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 2612.489333]  [<ffffffff8142a17f>] int_signal+0x12/0x17
[ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
[ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...

Kmemleak confirmed the ordered extent leak (and btrfs inode specific
structures such as delayed nodes):

$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880154290db0 (size 576):
  comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
  hex dump (first 32 bytes):
    01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
    00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
  backtrace:
    [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
    [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
    [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
    [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
    [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
    [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
    [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
    [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff88014ef11db0 (size 576):
  comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
  hex dump (first 32 bytes):
    02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
  backtrace:
    [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
    [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
    [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
    [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
    [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
    [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
    [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
    [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
    [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
    [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
    [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
    [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800336feda8 (size 584):
  comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
  hex dump (first 32 bytes):
    00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
    00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8114eb34>] create_object+0x172/0x29a
    [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
    [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
    [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
    [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
    [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
    [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
    [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
    [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
    [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
    [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
    [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
    [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
    [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
    [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
    [<ffffffff81429e92>] system_call_fastpath+0x12/0x17

CC: <stable@vger.kernel.org> # 3.19, 3.18 and 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Trond Myklebust
7c0af9ffb7 NFSv4: Don't call put_rpccred() under the rcu_read_lock()
put_rpccred() can sleep.

Fixes: 8f649c3762 ("NFSv4: Fix the locking in nfs_inode_reclaim_delegation()")
Cc: stable@vger.kernel.org # 2.6.35+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-01 23:23:07 -05:00
Trond Myklebust
fa9233699c NFS: Don't require a filehandle to refresh the inode in nfs_prime_dcache()
If the server does not return a valid set of attributes that we can
use to either create a file or refresh the inode, then there is no
value in calling nfs_prime_dcache().

However if we're just refreshing the inode using the attributes that
the server returned, then it shouldn't matter whether or not we have
a filehandle, as long as we check the fsid+fileid combination.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-01 23:23:07 -05:00
Trond Myklebust
1ae04b2523 NFSv3: Use the readdir fileid as the mounted-on-fileid
When we call readdirplus, set the fileid normally returned by readdir
as the mounted-on-fileid, since that is commonly the case if there is
a mountpoint. To ensure that we get it right, we only set the flag if
the readdir fileid differs from the one returned in the readdirplus
attributes.

This again means that we can avoid the issues described in commit
2ef47eb1ae ("NFS: Fix use of nfs_attr_use_mounted_on_fileid()"),
which only fixed NFSv4.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-01 23:23:07 -05:00
Trond Myklebust
6c441c254e NFS: Don't invalidate a submounted dentry in nfs_prime_dcache()
If we're traversing a directory which contains a submounted filesystem,
or one that has a referral, the NFS server that is processing the READDIR
request will often return information for the underlying (mounted-on)
directory. It may, or may not, also return filehandle information.

If this happens, and the lookup in nfs_prime_dcache() returns the
dentry for the submounted directory, the filehandle comparison will
fail, and we call d_invalidate(). Post-commit 8ed936b567
("vfs: Lazily remove mounts on unlinked files and directories."), this
means the entire subtree is unmounted.

The following minimal patch addresses this problem by punting on
the invalidation if there is a submount.

Kudos to Neil Brown <neilb@suse.de> for having tracked down this
issue (see link).

Reported-by: Nix <nix@esperi.org.uk>
Link: http://lkml.kernel.org/r/87iofju9ht.fsf@spindle.srvr.nix
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
3235b40303 NFSv4: Set a barrier in the update_changeattr() helper
Ensure that we don't regress the changes that were made to the
directory.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
92d64e47b6 NFS: Fix nfs_post_op_update_inode() to set an attribute barrier
nfs_post_op_update_inode() is called after a self-induced attribute
update. Ensure that it also sets the barrier.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
00fb4c9f84 NFS: Remove size hack in nfs_inode_attrs_need_update()
Prior to this patch, we used to always OK attribute updates that extended
the file size on the assumption that we might be performing writeback.
Now that we have attribute barriers to protect the writeback related updates,
we should remove this hack, as it can cause truncate() operations to
apparently be reverted if/when a readahead or getattr RPC call races
with our on-the-wire SETATTR.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
8f8ba1d739 NFSv4: Add attribute update barriers to delegreturn and pNFS layoutcommit
Ensure that other operations that race with delegreturn and layoutcommit
cannot revert the attribute updates that were made on the server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
a08a8cd375 NFS: Add attribute update barriers to NFS writebacks
Ensure that other operations that race with our write RPC calls
cannot revert the file size updates that were made on the server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
f506200346 NFS: Set an attribute barrier on all updates
Ensure that we update the attribute barrier even if there were no
invalidations, provided that this value is newer than the old one.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:06 -05:00
Trond Myklebust
f044636d97 NFS: Add attribute update barriers to nfs_setattr_update_inode()
Ensure that other operations which raced with our setattr RPC call
cannot revert the file attribute changes that were made on the server.
To do so, we artificially bump the attribute generation counter on
the inode so that all calls to nfs_fattr_init() that precede ours
will be dropped.

The motivation for the patch came from Chuck Lever's reports of readaheads
racing with truncate operations and causing the file size to be reverted.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:05 -05:00
Trond Myklebust
140e049c64 NFS: Add a helper to set attribute barriers
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:23:05 -05:00
Trond Myklebust
aa5accea40 NFS: Ensure that buffered writes wait for O_DIRECT writes to complete
The O_DIRECT code will grab the inode->i_mutex and flush out buffered
writes, before scheduling a read or a write. However there is no
equivalent in the buffered write code to wait for O_DIRECT to complete.

Fixes a reported issue in xfstests generic/133, when first performing an
O_DIRECT write followed by a buffered write.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-03-01 23:22:40 -05:00
Linus Torvalds
2aaeb784bf xfs: fixes for v4.0-rc2
This update contains:
 o ensure quota type is reset in on-disk dquots
 o fix missing partial EOF block data flush on truncate extension
 o fix transaction leak in error handling for new pnfs block layout
   support
 o add missing target_ip check to RENAME_EXCHANGE
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJU78w2AAoJEK3oKUf0dfod68IQALzcN8py4QxvmxVXf8F7+ymo
 PrUc/ZiP8EOS+q2wk4V0RgyoCAFA02pFjCEpWVm3PBdyfsd9DC12w7VYBlbDMO8f
 wApPots48NbqYVQA2+YLzC2+dgHwxLWzzJFyS6jDb/xtrVarHZtbhJU6hvl3a1gH
 8RwEW+mplMmIN8Qh7vxJ2/2K+97lfS2AW0jAnnOZKCsx98XWvSgeCk+3VszwZWjD
 obQn2WrvlfUSSERs0z2sygx5GxR/3Wnm5LrzpiX/+gH6LdPED53o6K/tKf5ncbmF
 maXkYUMxvTs3tOO9ZPohtL4Zc9JarPu2U6sKmMxULOaRgZLwmk6W2cyoCbdW2du5
 0ardLB89fUvGCJGMXojVtxZ6BX8IEoyhSDUX1qGF9/HFr0Rz5zIkeeqAWkj89+Cj
 VYvR/AmLBYwdaUPL+aHmG3P6B07u42n4650UQIVYw29rGEpxYOaBr7BAEYgyWFoM
 Omizf05rsz5aAxXCTjfUl+s9VsO6H0lNCjRyNs+QRIqkGf9rgxJGIAJuoh+bNNOm
 +WcId+5BPInuAy1YFP9Z02fe1NqIkSihTbL6daIlGIYralauXG+wyrsm9DaMsNSq
 VPul6HFMUwv2g5ECjvhiGZcvElOcBKcVQEUBJP3izFczP9o2i5NKcIOVFW/AxwTZ
 NW1qOYsLAQmD/hYpx1p2
 =kTai
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
 "These are fixes for regressions/bugs introduced in the 4.0 merge cycle
  and problems discovered during the merge window that need to be pushed
  back to stable kernels ASAP.

  This contains:
   - ensure quota type is reset in on-disk dquots
   - fix missing partial EOF block data flush on truncate extension
   - fix transaction leak in error handling for new pnfs block layout
     support
   - add missing target_ip check to RENAME_EXCHANGE"

* tag 'xfs-for-linus-4.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: cancel failed transaction in xfs_fs_commit_blocks()
  xfs: Ensure we have target_ip for RENAME_EXCHANGE
  xfs: ensure truncate forces zeroed blocks to disk
  xfs: Fix quota type in quota structures when reusing quota file
2015-02-28 10:06:33 -08:00
Ryusuke Konishi
957ed60b53 nilfs2: fix potential memory overrun on inode
Each inode of nilfs2 stores a root node of a b-tree, and it turned out to
have a memory overrun issue:

Each b-tree node of nilfs2 stores a set of key-value pairs and the number
of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as
a few other "bn_*" members.

Since the value of "bn_nchildren" is used for operations on the key-values
within the b-tree node, it can cause memory access overrun if a large
number is incorrectly set to "bn_nchildren".

For instance, nilfs_btree_node_lookup() function determines the range of
binary search with it, and too large "bn_nchildren" leads
nilfs_btree_node_get_key() in that function to overrun.

As for intermediate b-tree nodes, this is prevented by a sanity check
performed when each node is read from a drive, however, no sanity check
has been done for root nodes stored in inodes.

This patch fixes the issue by adding missing sanity check against b-tree
root nodes so that it's called when on-memory inodes are read from ifile,
inode metadata file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Trond Myklebust
be36e185bd NFSv4: nfs4_open_recover_helper() must set share access
The share access mode is now specified as an argument in the nfs4_opendata,
and so nfs4_open_recover_helper() needs to call nfs4_map_atomic_open_share()
in order to set it.

Fixes: 6ae373394c ("NFSv4.1: Ask for no delegation on OPEN if using O_DIRECT")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-27 17:10:40 -05:00
Andrew Elble
c876486be1 nfsd: fix clp->cl_revoked list deletion causing softlock in nfsd
commit 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is
protected by clp->cl_lock") removed the use of the reaplist to
clean out clp->cl_revoked. It failed to change list_entry() to
walk clp->cl_revoked.next instead of reaplist.next

Fixes: 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
Cc: stable@vger.kernel.org
Reported-by: Eric Meddaugh <etmsys@rit.edu>
Tested-by: Eric Meddaugh <etmsys@rit.edu>
Signed-off-by: Andrew Elble <aweits@rit.edu>
Reviewed-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-02-26 15:32:24 -05:00
Linus Torvalds
7dac5cb1bc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "I'm still testing more fixes, but I wanted to get out the fix for the
  btrfs raid5/6 memory corruption I mentioned in my merge window pull"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix allocation size calculations in alloc_btrfs_bio
2015-02-26 10:34:24 -08:00
Miklos Szeredi
aa991b3b26 fuse: set stolen page uptodate
Regular pipe buffers' ->steal method (generic_pipe_buf_steal()) doesn't set
PG_uptodate.

Don't warn on this condition, just set the uptodate flag.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2015-02-26 11:45:47 +01:00
Miklos Szeredi
0d2783626a fuse: notify: don't move pages
fuse_try_move_page() is not prepared for replacing pages that have already
been read.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2015-02-26 11:45:47 +01:00
Colin Ian King
2a559a8bde eCryptfs: ensure copy to crypt_stat->cipher does not overrun
The patch 237fead619: "[PATCH] ecryptfs: fs/Makefile and
fs/Kconfig" from Oct 4, 2006, leads to the following static checker
warning:

  fs/ecryptfs/crypto.c:846 ecryptfs_new_file_context()
  error: off-by-one overflow 'crypt_stat->cipher' size 32.  rl = '0-32'

There is a mismatch between the size of ecryptfs_crypt_stat.cipher
and ecryptfs_mount_crypt_stat.global_default_cipher_name causing the
copy of the cipher name to cause a off-by-one string copy error. This
fix ensures the space reserved for this string is the same size including
the trailing zero at the end throughout ecryptfs.

This fix avoids increasing the size of ecryptfs_crypt_stat.cipher
and also ecryptfs_parse_tag_70_packet_silly_stack.cipher_string and instead
reduces the of ECRYPTFS_MAX_CIPHER_NAME_SIZE to 31 and includes the + 1 for
the end of string terminator.

NOTE: An overflow is not possible in practice since the value copied
into global_default_cipher_name is validated by
ecryptfs_code_for_cipher_string() at mount time. None of the allowed
cipher strings are long enough to cause the potential buffer overflow
fixed by this patch.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[tyhicks: Added the NOTE about the overflow not being triggerable]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2015-02-24 19:23:28 -06:00
Eric Sandeen
83d5f01858 xfs: cancel failed transaction in xfs_fs_commit_blocks()
If xfs_trans_reserve fails we don't cancel the transaction,
and we'll leak the allocated transaction pointer.

Spotted by Coverity.

Signed-off-by: Eric Sandeen <ssandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24 10:15:18 +11:00
Eric Sandeen
fc921566f4 xfs: Ensure we have target_ip for RENAME_EXCHANGE
We shouldn't get here with RENAME_EXCHANGE set and no
target_ip, but let's be defensive, because xfs_cross_rename()
will dereference it.

Spotted by Coverity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-24 10:12:55 +11:00
Dave Chinner
5885ebda87 xfs: ensure truncate forces zeroed blocks to disk
A new fsync vs power fail test in xfstests indicated that XFS can
have unreliable data consistency when doing extending truncates that
require block zeroing. The blocks beyond EOF get zeroed in memory,
but we never force those changes to disk before we run the
transaction that extends the file size and exposes those blocks to
userspace. This can result in the blocks not being correctly zeroed
after a crash.

Because in-memory behaviour is correct, tools like fsx don't pick up
any coherency problems - it's not until the filesystem is shutdown
or the system crashes after writing the truncate transaction to the
journal but before the zeroed data in the page cache is flushed that
the issue is exposed.

Fix this by also flushing the dirty data in memory region between
the old size and new size when we've found blocks that need zeroing
in the truncate process.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 22:37:08 +11:00
Jan Kara
dfcc70a8c8 xfs: Fix quota type in quota structures when reusing quota file
For filesystems without separate project quota inode field in the
superblock we just reuse project quota file for group quotas (and vice
versa) if project quota file is allocated and we need group quota file.
When we reuse the file, quota structures on disk suddenly have wrong
type stored in d_flags though. Nobody really cares about this (although
structure type reported to userspace was wrong as well) except
that after commit 14bf61ffe6 (quota: Switch ->get_dqblk() and
->set_dqblk() to use bytes as space units) assertion in
xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
was testing without XFS_DEBUG so I didn't notice when submitting the
above commit).

Fix the problem by properly resetting ddq->d_flags when running quotacheck
for a quota file.

CC: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-02-23 22:34:17 +11:00
Linus Torvalds
feaf222925 Ext4 bug fixes for 3.20. We also reserved code points for encryption
and read-only images (for which the implementation is mostly just the
 reserved code point for a read-only feature :-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJU6lssAAoJENNvdpvBGATwpsEQAOcpCqj0gp/istbsoFpl5v5K
 +BU2aPvR5CPLtUQz9MqrVF5/6zwDbHGN+GIB6CEmh/qHIVQAhhS4XR+opSc7qqUr
 fAQ1AhL5Oh8Dyn9DRy5Io8oRv+wo5lRdD7aG7SPiizCMRQ34JwJ2sWIAwbP2Ea7W
 Xg51v3LWEu+UpqpgY3YWBoJKHj4hXwFvTVOCHs94239Y2zlcg2c4WwbKPzkvPcV/
 TvvZOOctty+l3FOB2bqFj3VnvywQmNv8/OixKjSprxlR7nuQlhKaLTWCtRjFbND4
 J/rk2ls5Bl79dnMvyVfV5ghpmGYBf5kkXCP716YsQkRCZUfNVrTOPJrNHZtYilAb
 opRo2UjAyTWxZBvyssnCorHJZUdxlYeIuSTpaG0zUbR0Y6p/7qd31F5k41GbBCFf
 B0lV3IaiVnXk23S2jFVHGhrzoKdFqu30tY7LMaO4xyGVMigOZJyBu8TZ7Utj9HmW
 /4GfjlvYqlfB7p+6yBkDv/87hjdmfMWIw48A7xWCiIeguQhB79gwTV7uAHVtgfng
 h5RF2EH/fx5klbAZx9vlaAh3pGFBHbh9fkeBmW9qNm7glz7aMUuxQaSo6X8HrCAJ
 LrECgDGbuiOHnMYuzZRERZiqwLB7JT82C1xopGzefsE/i0kN1eMjITkfggjQ5whu
 caLPn49tAb9U8P6TsPeE
 =PF+t
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Ext4 bug fixes.

  We also reserved code points for encryption and read-only images (for
  which the implementation is mostly just the reserved code point for a
  read-only feature :-)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix indirect punch hole corruption
  ext4: ignore journal checksum on remount; don't fail
  ext4: remove duplicate remount check for JOURNAL_CHECKSUM change
  ext4: fix mmap data corruption in nodelalloc mode when blocksize < pagesize
  ext4: support read-only images
  ext4: change to use setup_timer() instead of init_timer()
  ext4: reserve codepoints used by the ext4 encryption feature
  jbd2: complain about descriptor block checksum errors
2015-02-22 18:05:13 -08:00
Linus Torvalds
be5e6616dd Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted stuff from this cycle.  The big ones here are multilayer
  overlayfs from Miklos and beginning of sorting ->d_inode accesses out
  from David"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
  autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
  procfs: fix race between symlink removals and traversals
  debugfs: leave freeing a symlink body until inode eviction
  Documentation/filesystems/Locking: ->get_sb() is long gone
  trylock_super(): replacement for grab_super_passive()
  fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
  SELinux: Use d_is_positive() rather than testing dentry->d_inode
  Smack: Use d_is_positive() rather than testing dentry->d_inode
  TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
  Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
  Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
  VFS: Split DCACHE_FILE_TYPE into regular and special types
  VFS: Add a fallthrough flag for marking virtual dentries
  VFS: Add a whiteout dentry type
  VFS: Introduce inode-getting helpers for layered/unioned fs environments
  Infiniband: Fix potential NULL d_inode dereference
  posix_acl: fix reference leaks in posix_acl_create
  autofs4: Wrong format for printing dentry
  ...
2015-02-22 17:42:14 -08:00
Al Viro
0a280962dc autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
X-Coverup: just ask spender
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:43:34 -05:00
Al Viro
7e0e953bb0 procfs: fix race between symlink removals and traversals
use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:43:12 -05:00
Al Viro
0db59e5929 debugfs: leave freeing a symlink body until inode eviction
As it is, we have debugfs_remove() racing with symlink traversals.
Supply ->evict_inode() and do freeing there - inode will remain
pinned until we are done with the symlink body.

And rip the idiocy with checking if dentry is positive right after
we'd verified debugfs_positive(), which is a stronger check...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:43 -05:00
Konstantin Khlebnikov
eb6ef3df4f trylock_super(): replacement for grab_super_passive()
I've noticed significant locking contention in memory reclaimer around
sb_lock inside grab_super_passive(). Grab_super_passive() is called from
two places: in icache/dcache shrinkers (function super_cache_scan) and
from writeback (function __writeback_inodes_wb). Both are required for
progress in memory allocator.

Grab_super_passive() acquires sb_lock to increment sb->s_count and check
sb->s_instances. It seems sb->s_umount locked for read is enough here:
super-block deactivation always runs under sb->s_umount locked for write.
Protecting super-block itself isn't a problem: in super_cache_scan() sb
is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
are still active. Inside writeback super-block comes from inode from bdi
writeback list under wb->list_lock.

This patch removes locking sb_lock and checks s_instances under s_umount:
generic_shutdown_super() unlinks it under sb->s_umount locked for write.
New variant is called trylock_super() and since it only locks semaphore,
callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
they're done.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:42 -05:00
David Howells
54f2a2f427 fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
Fanotify probably doesn't want to watch autodirs so make it use d_can_lookup()
rather than d_is_dir() when checking a dir watch and give an error on fake
directories.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:42 -05:00
David Howells
ce40fa78ef Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
Fix up the following scripted S_ISDIR/S_ISREG/S_ISLNK conversions (or lack
thereof) in cachefiles:

 (1) Cachefiles mostly wants to use d_can_lookup() rather than d_is_dir() as
     it doesn't want to deal with automounts in its cache.

 (2) Coccinelle didn't find S_IS* expressions in ASSERT() statements in
     cachefiles.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
David Howells
e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
David Howells
44bdb5e5f6 VFS: Split DCACHE_FILE_TYPE into regular and special types
Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular
files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and
socket files).

d_is_reg() and d_is_special() are added to detect these subtypes and
d_is_file() is left as the union of the two.

This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to
use d_is_reg(dentry) instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:38 -05:00
David Howells
df1a085af1 VFS: Add a fallthrough flag for marking virtual dentries
Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is
a virtual dentry that covers another one in a lower layer that should be used
instead.  This may be recorded on medium if directory integration is stored
there.

The flag can be set with d_set_fallthru() and tested with d_is_fallthru().

Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:38 -05:00
Linus Torvalds
93aaa830fc xfs: pnfs block layout support for 3.20-rc1
This update contains the implementation of the PNFS server export
 methods that enable use of XFS filesystems as a block layout target.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJU58orAAoJEK3oKUf0dfodFyAQAKqC+Iez1rEMr0aW5WzEFjTO
 gHoBxQgfz/b2gMntPGbcmMnhRV4LL5/anjRMqU3R4uqTPigskI0+ylQakUKoKgZq
 yV1MnOeZvv4TIqK45uoesO3ractDdcL84HM7vLF/tlgvNMqDLpLiZlHl+1gEWig6
 fMXAcpsp7J7XhGsI5dRDtt5sEuWUUeqSvtiZlzponvLJz//J2JfOe/Z0UzkNddQr
 Ea7BA/ZQuiN2m3GgXykTljt1i7GuA2HCK0oLzgXpsIblrHoYyP67Clf8TnlG4RN3
 a4GsdlHd/0FRa0M28eHh5HND89giMiCDcJbESaR5lAiornwzFYaBF/2cj3M8Jbvr
 Rr9rhMrD2WRL1Z7Kgv8MDiOd9YpTS12VjSv7n5p4Y1H90USJQutaPYuYdAA2/SHn
 L4iXVJ5szgPKF6QLFAWubVYn/8NeSRU9VDVXrUb/pQsbbF/sfDtVzwQhouwJmQ2z
 II9nyNwuqev3Os0ODv22YQAGqRkpWN1u/S266AOr7xForCA9ZO31lAYbQ4YS1Gwe
 Wbvhw3NXRBqfI3ytm7faGnX9D6NaW/2xvkW2odoBH3AiS7mAYN+hzXi4QZgwuPej
 bbkEJsG4hcyEmUqmy/Bes+jNhiI6h48G9vKxBaurV07vV7kwoDzrYcZAt383sjtg
 k7kxPPdtQphr+7Ckudtg
 =ujZQ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs pnfs block layout support from Dave Chinner:
 "This contains the changes to XFS needed to support the PNFS block
  layout server that you pulled in through Bruce's NFS server tree
  merge.

  I originally thought that I'd need to merge changes into the NFS
  server side, but Bruce had already picked them up and so this is
  purely changes to the fs/xfs/ codebase.

  Summary:

  This update contains the implementation of the PNFS server export
  methods that enable use of XFS filesystems as a block layout target"

* tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: recall pNFS layouts on conflicting access
  xfs: implement pNFS export operations
2015-02-21 14:09:38 -08:00
Linus Torvalds
24a52e412e NFS client updates for Linux 3.20
Highlights include:
 
 - Fix a use-after-free in decode_cb_sequence_args()
 - Fix a compile error when #undef CONFIG_PROC_FS
 - NFSv4.1 backchannel spinlocking issue
 - Cleanups in the NFS unstable write code requested by Linus
 - NFSv4.1 fix issues when the server denies our backchannel request
 - Cleanups in create_session and bind_conn_to_session
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU51zBAAoJEGcL54qWCgDyyvwQALp7vxXCTwC3/OSoPkhujd2C
 HGMoI2nVSl8CY0LFYxoT6auhefZCgQChmQXkGtlZwO+bkEr9rvJ6ii8lhuDR+JXF
 M0bAwOX+bNxUFkpqYYF3Q4Hi//tMJCIZdqUp2irtyFLL/qlNoN2ktoEYMIjMY5uO
 4L1fxj9KaVuMtFuqt3xeSSe41LaXxitKyefyVJfbqz5qbcPGiXzS7WKYAun4RyvM
 7rCir9kfnxxEX3+hc7xHxWeFJnW0jUMklrjNvnrubnpHE/fX2sUAnjX2hmx5bLg/
 4puxADzhPT4f3LdGDKXVWaULuTy20VksOJnB82TKJ3rLIELJixGZw88svOrFcGvt
 5CMI8BvOihwn8ov+sSj7Xedz4046btwA1YHkUcwPV3LZAlyx/FSq4nasO4Cn27yl
 OPdkcAL1YR5I83mEKA+8BOVXJuFx5vKhkwmMdReJkBmxsWbSwB/qp6caPD9DtuXI
 K0qJYWHMqN+Dv9npi0Q4WR6vmnzxV+Eq7Z2D9WPW0FL8nT3STE0eWIrNipyqOv+7
 bHptPxrrZej+i3c922bZ6hdaE2uAlwG8FPgEGhHFtm+s09RgDPDG7NiaeCutf9cQ
 9ub82Hlk9nLTqq5X3poUPV35RS6THnZcybhngNMy8F1cPSUTs+/H+l91sEW/Zhgw
 odMB/DEa9sRnZGX5rQXE
 =NieR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull more NFS client updates from Trond Myklebust:
 "Highlights include:

   - Fix a use-after-free in decode_cb_sequence_args()
   - Fix a compile error when #undef CONFIG_PROC_FS
   - NFSv4.1 backchannel spinlocking issue
   - Cleanups in the NFS unstable write code requested by Linus
   - NFSv4.1 fix issues when the server denies our backchannel request
   - Cleanups in create_session and bind_conn_to_session"

* tag 'nfs-for-3.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: Clean up bind_conn_to_session
  NFSv4.1: Always set up a forward channel when binding the session
  NFSv4.1: Don't set up a backchannel if the server didn't agree to do so
  NFSv4.1: Clean up create_session
  pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit
  NFSv4: Kill unused nfs_inode->delegation_state field
  NFS: struct nfs_commit_info.lock must always point to inode->i_lock
  nfs: Can call nfs_clear_page_commit() instead
  nfs: Provide and use helper functions for marking a page as unstable
  SUNRPC: Always manipulate rpc_rqst::rq_bc_pa_list under xprt->bc_pa_lock
  SUNRPC: Fix a compile error when #undef CONFIG_PROC_FS
  NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()
  NFSv4.1: Fix a kfree() of uninitialised pointers in decode_cb_sequence_args
2015-02-21 14:02:59 -08:00
Linus Torvalds
5fbe4c224c Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
 "This contains:

   - EFI fixes
   - a boot printout fix
   - ASLR/kASLR fixes
   - intel microcode driver fixes
   - other misc fixes

  Most of the linecount comes from an EFI revert"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm/ASLR: Avoid PAGE_SIZE redefinition for UML subarch
  x86/microcode/intel: Handle truncated microcode images more robustly
  x86/microcode/intel: Guard against stack overflow in the loader
  x86, mm/ASLR: Fix stack randomization on 64-bit systems
  x86/mm/init: Fix incorrect page size in init_memory_mapping() printks
  x86/mm/ASLR: Propagate base load address calculation
  Documentation/x86: Fix path in zero-page.txt
  x86/apic: Fix the devicetree build in certain configs
  Revert "efi/libstub: Call get_memory_map() to obtain map and desc sizes"
  x86/efi: Avoid triple faults during EFI mixed mode calls
2015-02-21 10:41:29 -08:00
Chris Mason
e57cf21e97 Btrfs: fix allocation size calculations in alloc_btrfs_bio
Since commit 8e5cfb55d3 (Btrfs: Make raid_map array be inlined in
btrfs_bio structure), the raid map array is allocated along with the
btrfs bio in alloc_btrfs_bio.  The calculation used to decide how much
we need to allocate was using the wrong parameter passed into the
allocation function.

The passed in real_stripes will be zero if a target replace operation
is not currently running.  We want to use total_stripes instead.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: David Sterba <dsterba@suse.cz>
Tested-by: David Sterba <dsterba@suse.cz>
2015-02-20 06:55:15 -08:00
Al Viro
ce7b9facdf Merge branch 'overlayfs-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into for-next 2015-02-20 04:58:52 -05:00
Omar Sandoval
fed0b588be posix_acl: fix reference leaks in posix_acl_create
get_acl gets a reference which we must release in the error cases.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20 04:56:45 -05:00
Rasmus Villemoes
76bf3f6b1d autofs4: Wrong format for printing dentry
%pD for struct file*, %pd for struct dentry*.

Fixes: a455589f18 ("assorted conversions to %p[dD]")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20 04:56:45 -05:00
Bastien Nocera
fcbc32bc6c coredump: Fix typo in comment
Signed-off-by: Bastien Nocera <hadess@hadess.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20 04:56:44 -05:00
Kinglong Mee
acd88d4e1a fs/aio.c: Remove duplicate function name in pr_debug messages
Have defined pr_fmt as below in fs/aio.c, so remove duplicate
function name in pr_debug message.

#define pr_fmt(fmt) "%s: " fmt, __func__

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20 04:56:44 -05:00
David Howells
112fc894a7 configfs: Fix potential NULL d_inode dereference
Code that does this:

		if (!(d_unhashed(dentry) && dentry->d_inode)) {
			...
			simple_unlink(parent->d_inode, dentry);
		}

is broken because:

    !(d_unhashed(dentry) && dentry->d_inode)

is equivalent to:

    !d_unhashed(dentry) || !dentry->d_inode

so it is possible to get into simple_unlink() with dentry->d_inode == NULL.

simple_unlink(), however, assumes dentry->d_inode cannot be NULL.

I think that what was meant is this:

    !d_unhashed(dentry) && dentry->d_inode

and that the logical-not operator or the final close-bracket was misplaced.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20 04:56:43 -05:00
Al Viro
db671a8ecd don't bother with most of the bad_file_ops methods
Only ->open() should be there (always failing, of course).  We never
replace ->f_op of an already opened struct file, so there's no way
for any of those methods to be called.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-20 04:03:58 -05:00
Linus Torvalds
2b9fb532d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is mostly cleanups and fixes:

   - The raid5/6 cleanups from Zhao Lei fixup some long standing warts
     in the code and add improvements on top of the scrubbing support
     from 3.19.

   - Josef has round one of our ENOSPC fixes coming from large btrfs
     clusters here at FB.

   - Dave Sterba continues a long series of cleanups (thanks Dave), and
     Filipe continues hammering on corner cases in fsync and others

  This all was held up a little trying to track down a use-after-free in
  btrfs raid5/6.  It's not clear yet if this is just made easier to
  trigger with this pull or if its a new bug from the raid5/6 cleanups.
  Dave Sterba is the only one to trigger it so far, but he has a
  consistent way to reproduce, so we'll get it nailed shortly"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
  Btrfs: don't remove extents and xattrs when logging new names
  Btrfs: fix fsync data loss after adding hard link to inode
  Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
  Btrfs: account for large extents with enospc
  Btrfs: don't set and clear delalloc for O_DIRECT writes
  Btrfs: only adjust outstanding_extents when we do a short write
  btrfs: Fix out-of-space bug
  Btrfs: scrub, fix sleep in atomic context
  Btrfs: fix scheduler warning when syncing log
  Btrfs: Remove unnecessary placeholder in btrfs_err_code
  btrfs: cleanup init for list in free-space-cache
  btrfs: delete chunk allocation attemp when setting block group ro
  btrfs: clear bio reference after submit_one_bio()
  Btrfs: fix scrub race leading to use-after-free
  Btrfs: add missing cleanup on sysfs init failure
  Btrfs: fix race between transaction commit and empty block group removal
  btrfs: add more checks to btrfs_read_sys_array
  btrfs: cleanup, rename a few variables in btrfs_read_sys_array
  btrfs: add checks for sys_chunk_array sizes
  btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
  ...
2015-02-19 14:36:00 -08:00
Linus Torvalds
4533f6e27a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "On the RBD side, there is a conversion to blk-mq from Christoph,
  several long-standing bug fixes from Ilya, and some cleanup from
  Rickard Strandqvist.

  On the CephFS side there is a long list of fixes from Zheng, including
  improved session handling, a few IO path fixes, some dcache management
  correctness fixes, and several blocking while !TASK_RUNNING fixes.

  The core code gets a few cleanups and Chaitanya has added support for
  TCP_NODELAY (which has been used on the server side for ages but we
  somehow missed on the kernel client).

  There is also an update to MAINTAINERS to fix up some email addresses
  and reflect that Ilya and Zheng are doing most of the maintenance for
  RBD and CephFS these days.  Do not be surprised to see a pull request
  come from one of them in the future if I am unavailable for some
  reason"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
  MAINTAINERS: update Ceph and RBD maintainers
  libceph: kfree() in put_osd() shouldn't depend on authorizer
  libceph: fix double __remove_osd() problem
  rbd: convert to blk-mq
  ceph: return error for traceless reply race
  ceph: fix dentry leaks
  ceph: re-send requests when MDS enters reconnecting stage
  ceph: show nocephx_require_signatures and notcp_nodelay options
  libceph: tcp_nodelay support
  rbd: do not treat standalone as flatten
  ceph: fix atomic_open snapdir
  ceph: properly mark empty directory as complete
  client: include kernel version in client metadata
  ceph: provide seperate {inode,file}_operations for snapdir
  ceph: fix request time stamp encoding
  ceph: fix reading inline data when i_size > PAGE_SIZE
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
  ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
  rbd: fix error paths in rbd_dev_refresh()
  ...
2015-02-19 14:14:42 -08:00
Ingo Molnar
a267b0a349 Merge branch 'tip-x86-kaslr' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp into x86/urgent
Pull ASLR and kASLR fixes from Borislav Petkov:

  - Add a global flag announcing KASLR state so that relevant code can do
    informed decisions based on its setting. (Jiri Kosina)

  - Fix a stack randomization entropy decrease bug. (Hector Marco-Gisbert)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-02-19 12:31:34 +01:00
Hector Marco-Gisbert
4e7c22d447 x86, mm/ASLR: Fix stack randomization on 64-bit systems
The issue is that the stack for processes is not properly randomized on
64 bit architectures due to an integer overflow.

The affected function is randomize_stack_top() in file
"fs/binfmt_elf.c":

  static unsigned long randomize_stack_top(unsigned long stack_top)
  {
           unsigned int random_variable = 0;

           if ((current->flags & PF_RANDOMIZE) &&
                   !(current->personality & ADDR_NO_RANDOMIZE)) {
                   random_variable = get_random_int() & STACK_RND_MASK;
                   random_variable <<= PAGE_SHIFT;
           }
           return PAGE_ALIGN(stack_top) + random_variable;
           return PAGE_ALIGN(stack_top) - random_variable;
  }

Note that, it declares the "random_variable" variable as "unsigned int".
Since the result of the shifting operation between STACK_RND_MASK (which
is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):

	  random_variable <<= PAGE_SHIFT;

then the two leftmost bits are dropped when storing the result in the
"random_variable". This variable shall be at least 34 bits long to hold
the (22+12) result.

These two dropped bits have an impact on the entropy of process stack.
Concretely, the total stack entropy is reduced by four: from 2^28 to
2^30 (One fourth of expected entropy).

This patch restores back the entropy by correcting the types involved
in the operations in the functions randomize_stack_top() and
stack_maxrandom_size().

The successful fix can be tested with:

  $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
  7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
  7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
  7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
  7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
  ...

Once corrected, the leading bytes should be between 7ffc and 7fff,
rather than always being 7fff.

Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Ismael Ripoll <iripoll@upv.es>
[ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: CVE-2015-1593
Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.net
Signed-off-by: Borislav Petkov <bp@suse.de>
2015-02-19 12:21:36 +01:00
Yan, Zheng
4d41cef279 ceph: return error for traceless reply race
When we receives traceless reply for request that created new inode,
we re-send a lookup request to MDS get information of the newly created
inode. (VFS expects FS' callback return an inode in create case)
This breaks one request into two requests. Other client may modify or
move to the new inode in the middle.

When the race happens, ceph_handle_notrace_create() unconditionally
links the dentry for 'create' operation to the inode returned by lookup.
This may confuse VFS when the inode is a directory (VFS does not allow
multiple linkages for directory inode).

This patch makes ceph_handle_notrace_create() when it detect a race.
This event should be rare and it happens only when we talk to old MDS.
Recent MDS does not send traceless reply for request that creates new
inode.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:40 +03:00
Yan, Zheng
5cba372c0f ceph: fix dentry leaks
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:40 +03:00
Yan, Zheng
3de22be677 ceph: re-send requests when MDS enters reconnecting stage
So that MDS can check if any request is already completed and process
completed requests in clientreplay stage. When completed requests are
processed in clientreplay stage, MDS can avoid sending traceless
replies.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:40 +03:00
Ilya Dryomov
2a0b61cefc ceph: show nocephx_require_signatures and notcp_nodelay options
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2015-02-19 13:31:40 +03:00
Yan, Zheng
bf91c31508 ceph: fix atomic_open snapdir
ceph_handle_snapdir() checks ceph_mdsc_do_request()'s return value
and creates snapdir inode if it's -ENOENT

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:39 +03:00
Yan, Zheng
2f92b3d0a9 ceph: properly mark empty directory as complete
ceph_add_cap() calls __check_cap_issue(), which clears directory
inode' complete flag. so we should set the complete flag for empty
directory should be set after calling ceph_add_cap().

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:39 +03:00
Yan, Zheng
a6a5ce4f0d client: include kernel version in client metadata
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:39 +03:00
Yan, Zheng
38c48b5f0a ceph: provide seperate {inode,file}_operations for snapdir
remove all unsupported operations from {inode,file}_operations.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:39 +03:00
Yan, Zheng
1f041a89b4 ceph: fix request time stamp encoding
struct timespec uses 'long' to present second and nanosecond. 'long'
is 64 bits on 64bits machine. ceph MDS expects time stamp to be
encoded as struct ceph_timespec, which uses 'u32' to present second
and nanosecond.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:39 +03:00
Yan, Zheng
fcc02d2a03 ceph: fix reading inline data when i_size > PAGE_SIZE
when inode has inline data but its size > PAGE_SIZE (it was truncated
to larger size), previous direct read code return -EIO. This patch adds
code to return zeros for data whose offset > PAGE_SIZE.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:39 +03:00
Yan, Zheng
86d8f67b26 ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
use an atomic variable to track number of sessions, this can avoid block
operation inside wait loops.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Yan, Zheng
c4d4a582c5 ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
we should not do block operation in wait_event_interruptible()'s condition
check function, but reading inline data can block. so move the read inline
data code to ceph_get_caps()

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Yan, Zheng
d3383a8e37 ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
check_cap_flush() calls mutex_lock(), which may block. So we can't
use it as condition check function for wait_event();

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Yan, Zheng
982d6011bc ceph: improve reference tracking for snaprealm
When snaprealm is created, its initial reference count is zero.
But in some rare cases, the newly created snaprealm is not referenced
by anyone. This causes snaprealm with zero reference count not freed.

The fix is set reference count of newly snaprealm to 1. The reference
is return the function who requests to create the snaprealm. When the
function finishes its job, it releases the reference.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Yan, Zheng
1487a688d8 ceph: properly zero data pages for file holes.
A bug is found in striped_read() of fs/ceph/file.c. striped_read() calls
ceph_zero_pape_vector_range().  The first argument, page_align + read + ret,
passed to ceph_zero_pape_vector_range() is wrong.

When a file has holes, this wrong parameter may cause memory corruption
either in kernal space or user space. Kernel space memory may be corrupted in
the case of non direct IO; user space memory may be corrupted in the case of
direct IO. In the latter case, the application doing direct IO may crash due
to memory corruption, as we have experienced.

The correct value should be initial_align + read + ret, where intial_align =
o_direct ? buf_align : io_align.  Compared with page_align, the current page
offest, initial_align is the initial page offest, which should be used to
calculate the page and offset in ceph_zero_pape_vector_range().

Reported-by: caifeng zhu <zhucaifeng@unissoft-nj.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Rickard Strandqvist
671762f807 ceph: acl: Remove unused function
Remove the function ceph_get_cached_acl() that is not used anywhere.

This was partially found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:38 +03:00
Yan, Zheng
03f4fcb028 ceph: handle SESSION_FORCE_RO message
mark session as readonly and wake up all cap waiters.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2015-02-19 13:31:37 +03:00
Trond Myklebust
71a097c6de NFSv4.1: Clean up bind_conn_to_session
We don't need to fake up an entire session in order retrieve the arguments.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-18 13:11:09 -08:00
Trond Myklebust
7e9f073887 NFSv4.1: Always set up a forward channel when binding the session
Currently, the client requests a back channel or a bidirectional
connection when binding a new TCP channel to an existing session.
Fix that to ask for a forward channel or bidirectional.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-18 12:30:52 -08:00
Trond Myklebust
b1c0df5fad NFSv4.1: Don't set up a backchannel if the server didn't agree to do so
If the server doesn't agree to out backchannel setup request, then
don't set one up.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-18 12:30:47 -08:00
Trond Myklebust
79969dd12e NFSv4.1: Clean up create_session
Don't decode directly into the shared struct session

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-18 12:28:50 -08:00
Linus Torvalds
b2b89ebfc0 File locking related fixes for v3.20 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU48n/AAoJEAAOaEEZVoIVtbsP/iWEnnP4ZIY8Bai32mQAVgdm
 C20aftlQvtrNWOf9SSjFIZGQDLeExk2RTZMbkJhCS4SkVjdB38mST/mBglFO5MLc
 xarz2FcAApOYAu6d2qkfze3KuCQHq4xPhDs0C2WLf0ENUOeE2nFAZcOccL2VyJvW
 RQF0AslWVhhvbaCnIpmDFx5SnL+yOuMcVJOMO5g3HPjbW8oaZWQuvjTCRxdAI2tk
 CZBZIfyve0KH6WSGHQkAlH5PU3myV3XHgZ4UHqM1nBLF0L2LyRARXGfnbzBcS+G9
 kgX/L7ohwI/VXG9MvD2IyQ7fpMyV60tHmDQBR3eqaxs4OKPD4p2c62LahGtUSxM7
 B9+WX6pypj14MQS96iVtQEHgqGDixQbmIjq+EslwvzqPZR77nYOPmDRP+sWsmok1
 tNRy8WizZPC45SO9gs7LzZQF1eFTMyalW5IZTh4UbwWRjGjJRtpdEmFSWyN6jLuL
 iJnhe39g+sQOqyPPcP6SxcZiCnLj0Y5utrDRwIMM03kKugfC80id+RDTw8I1uQ/p
 Bmch6FoGvn3jFB0O1OAxp6ZbB5KwdKBgNPfzpoK+D7kjKJSWH1tZkFpfSvINKx9g
 yxVahQkHVy9TFPY0uhA6j/IwNZ3c+wdRZ5lbpMKMS46LRvzGc3zNSCn5e6dWOBA2
 GS+K2xmkLo1pRuYv96f9
 =Gn2o
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.20-2' of git://git.samba.org/jlayton/linux

Pull file locking fixes from Jeff Layton:
 "A small set of patches to fix problems with the recent file locking
  changes that we discussed earlier this week"
"

* tag 'locks-v3.20-2' of git://git.samba.org/jlayton/linux:
  locks: fix list insertion when lock is split in two
  locks: remove conditional lock release in middle of flock_lock_file
  locks: only remove leases associated with the file being closed
  Revert "locks: keep a count of locks on the flctx lists"
2015-02-18 10:21:47 -08:00
Linus Torvalds
402521b8f7 MTD updates for 3.20-rc1
NAND:
 
  * Add new Hisilicon NAND driver for Hip04
  * Add default reboot handler, to ensure all outstanding erase transactions
    complete in time
  * jz4740: convert to use GPIO descriptor API
  * Atmel: add support for sama5d4
  * Change default bitflip threshold to 75% of correction strength
  * Miscellaneous cleanups and bugfixes
 
 SPI NOR:
 
  * Freescale QuadSPI:
    - Fix a few probe() and remove() issues
    - Add a MAINTAINERS entry for this driver
    - Tweak transfer size to increase read performance
    - Add suspend/resume support
  * Add Micron quad I/O support
  * ST FSM SPI: miscellaneous fixes
 
 JFFS2:
 
  * gracefully handle corrupted 'offset' field found on flash
 
 Other:
 
  * bcm47xxpart: add tweaks for a few new devices
  * mtdconcat: set return lengths properly for mtd_write_oob()
  * map_ram: enable use with mtdoops
  * maps: support fallback to ROM/UBI for write-protected NOR flash
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJU4qf2AAoJEFySrpd9RFgtmo4P/i7KD+Xx12SgBbO+ZUCqBJhh
 X+gorTFr0YpItdn53i1PA8t+WnnXi4BHY07Y8fCj/JL+lxzS+00156o+hsYAFWIl
 TVvjlFHxUYS/rh7plshd5kbEZunlXBOpWw2Qr4dSoIIuOChaRDm9eGNHJ75D/ImO
 Cr+83cyYAm0F+fCHavZKHUq/iFmpDcrt3vbPx/Rv51W+rs/HqPPUcKxt4iaL5Thk
 R0pkcaZHfJ+pkXfjkgRu/L35RLRVxRkycYvLlVSOyE/KqnzE1RRgFeHUYUiPeCem
 xUEoI0OqIYlR5LuKTt/NsBtz1W0Kcm3AcQDC5QliKnbGCwm9nbHAjqfraaZ4Ks2Z
 4YL/2pJCyJFT6NPjsiwiYkJOzJHvN8tLCSIQrXCtAKAkMn8YMHvWIEC/bVsAkpVq
 V3ke3gmZ8bY7sXyY+Fi5WVW4uxKCwSVtGiAw3i74v3z5hZZ818hkbtPc1J0CANiE
 iqbkLMJ5pvWuVT9V2qGlDqK1MDqNXNLXZgBfT9tJx/q5Ptitva79Ift4teRwery2
 5pD3uSaA3vJE2AGHKPfIyTDFqdDDUDCOWJIGbIKsYoKXSAmuOxuWKEhRMWeZMmjo
 o0ZOrhJqBNp4ZqvAxUddUOsGhRKNa3btPoB+IhAQG4+OBwxknsAY39BzPcBjKrkG
 iEKHgRDXXMe8W2wCalLw
 =+nRk
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20150216' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "NAND:

   - Add new Hisilicon NAND driver for Hip04
   - Add default reboot handler, to ensure all outstanding erase
     transactions complete in time
   - jz4740: convert to use GPIO descriptor API
   - Atmel: add support for sama5d4
   - Change default bitflip threshold to 75% of correction strength
   - Miscellaneous cleanups and bugfixes

  SPI NOR:

   - Freescale QuadSPI:
   - Fix a few probe() and remove() issues
   - Add a MAINTAINERS entry for this driver
   - Tweak transfer size to increase read performance
   - Add suspend/resume support
   - Add Micron quad I/O support
   - ST FSM SPI: miscellaneous fixes

  JFFS2:

   - gracefully handle corrupted 'offset' field found on flash

  Other:

   - bcm47xxpart: add tweaks for a few new devices
   - mtdconcat: set return lengths properly for mtd_write_oob()
   - map_ram: enable use with mtdoops
   - maps: support fallback to ROM/UBI for write-protected NOR flash"

* tag 'for-linus-20150216' of git://git.infradead.org/linux-mtd: (46 commits)
  mtd: hisilicon: && vs & typo
  jffs2: fix handling of corrupted summary length
  mtd: hisilicon: add device tree binding documentation
  mtd: hisilicon: add a new NAND controller driver for hisilicon hip04 Soc
  mtd: avoid registering reboot notifier twice
  mtd: concat: set the return lengths properly
  mtd: kconfig: replace PPC_OF with PPC
  mtd: denali: remove unnecessary stubs
  mtd: nand: remove redundant local variable
  MAINTAINERS: add maintainer entry for FREESCALE QUAD SPI driver
  mtd: fsl-quadspi: improve read performance by increase AHB transfer size
  mtd: fsl-quadspi: Remove unnecessary 'map_failed' label
  mtd: fsl-quadspi: Remove unneeded success/error messages
  mtd: fsl-quadspi: Fix the error paths
  mtd: nand: omap: drop condition with no effect
  mtd: nand: jz4740: Convert to GPIO descriptor API
  mtd: nand: Request strength instead of bytes for soft BCH
  mtd: nand: default bitflip-reporting threshold to 75% of correction strength
  mtd: atmel_nand: introduce a new compatible string for sama5d4 chip
  mtd: atmel_nand: return max bitflips in all sectors in pmecc_correction()
  ...
2015-02-18 08:01:44 -08:00
Trond Myklebust
65d2918e71 Merge branch 'cleanups'
Merge cleanups requested by Linus.

* cleanups: (3 commits)
  pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit
  nfs: Can call nfs_clear_page_commit() instead
  nfs: Provide and use helper functions for marking a page as unstable
2015-02-18 07:28:37 -08:00
Tom Haynes
338d00cfef pnfs: Refactor the *_layout_mark_request_commit to use pnfs_layout_mark_request_commit
The File Layout's filelayout_mark_request_commit() is almost the
Flex File Layout's ff_layout_mark_request_commit(). And that can
be reduced by calling into nfs_request_add_commit_list().

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-18 07:20:35 -08:00
Al Viro
28444a2bde configfs_add_file: fold into its sole caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:16:46 -05:00
Al Viro
1cf97d0d3a configfs: fold create_dir() into its only caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:16:35 -05:00
Al Viro
c88b1e70ae configfs: configfs_create() init callback is never NULL and it never fails
... so make it return void and drop the check for it being non-NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:15:47 -05:00
Linus Torvalds
533cf7aef2 Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "These are fixes for two bugs introduced during the merge window"

* 'for-3.20' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix v3-less build
  nfsd: fix comparison in fh_fsid_match()
2015-02-17 17:00:54 -08:00
Linus Torvalds
038911597e Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
2015-02-17 16:12:34 -08:00
Linus Torvalds
66dc830d14 Merge branch 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "More iov_iter work - missing counterpart of iov_iter_init() for
  bvec-backed ones and vfs_read_iter()/vfs_write_iter() - wrappers for
  sync calls of ->read_iter()/->write_iter()"

* 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: add vfs_iter_{read,write} helpers
  new helper: iov_iter_bvec()
2015-02-17 15:48:33 -08:00
Linus Torvalds
05016b0f0a Merge branch 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull getname/putname updates from Al Viro:
 "Rework of getname/getname_kernel/etc., mostly from Paul Moore.  Gets
  rid of quite a pile of kludges between namei and audit..."

* 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  audit: replace getname()/putname() hacks with reference counters
  audit: fix filename matching in __audit_inode() and __audit_inode_child()
  audit: enable filename recording via getname_kernel()
  simpler calling conventions for filename_mountpoint()
  fs: create proper filename objects using getname_kernel()
  fs: rework getname_kernel to handle up to PATH_MAX sized filenames
  cut down the number of do_path_lookup() callers
2015-02-17 15:27:47 -08:00
Linus Torvalds
c6b1de1b64 Merge branch 'debugfs_automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull debugfs patches from Al Viro:
 "debugfs patches, mostly to make it possible for something like tracefs
  to be transparently automounted on given directory in debugfs.

  New primitive in there is debugfs_create_automount(name, parent, func,
  arg), which creates a directory and makes its ->d_automount() return
  func(arg).  Another missing primitive was debugfs_create_file_size() -
  open-coded in quite a few places.  Dave's patch adds it and converts
  the open-code instances to calling it"

* 'debugfs_automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  debugfs: Provide a file creation function that also takes an initial size
  new primitive: debugfs_create_automount()
  debugfs: split end_creating() into success and failure cases
  debugfs: take mode-dependent parts of debugfs_get_inode() into callers
  fold debugfs_mknod() into callers
  fold debugfs_create() into caller
  fold debugfs_mkdir() into caller
  debugfs_mknod(): get rid useless arguments
  fold debugfs_link() into caller
  debugfs: kill __create_file()
  debugfs: split the beginning and the end of __create_file() off
  debugfs_{mkdir,create,link}(): get rid of redundant argument
2015-02-17 15:18:19 -08:00
Linus Torvalds
50652963ea Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc VFS updates from Al Viro:
 "This cycle a lot of stuff sits on topical branches, so I'll be sending
  more or less one pull request per branch.

  This is the first pile; more to follow in a few.  In this one are
  several misc commits from early in the cycle (before I went for
  separate branches), plus the rework of mntput/dput ordering on umount,
  switching to use of fs_pin instead of convoluted games in
  namespace_unlock()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch the IO-triggering parts of umount to fs_pin
  new fs_pin killing logics
  allow attaching fs_pin to a group not associated with some superblock
  get rid of the second argument of acct_kill()
  take count and rcu_head out of fs_pin
  dcache: let the dentry count go down to zero without taking d_lock
  pull bumping refcount into ->kill()
  kill pin_put()
  mode_t whack-a-mole: chelsio
  file->f_path.dentry is pinned down for as long as the file is open...
  get rid of lustre_dump_dentry()
  gut proc_register() a bit
  kill d_validate()
  ncpfs: get rid of d_validate() nonsense
  selinuxfs: don't open-code d_genocide()
2015-02-17 14:56:45 -08:00
Linus Torvalds
e2b74f232e Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - a pile of minor fs fixes and cleanups

 - kexec updates

 - random misc fixes in various places: vmcore, rbtree, eventfd, ipc, seccomp.

 - a series of python-based kgdb helper scripts

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits)
  seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO
  samples/seccomp: improve label helper
  ipc,sem: use current->state helpers
  scripts/gdb: disable pagination while printing from breakpoint handler
  scripts/gdb: define maintainer
  scripts/gdb: convert CpuList to generator function
  scripts/gdb: convert ModuleList to generator function
  scripts/gdb: use a generator instead of iterator for task list
  scripts/gdb: ignore byte-compiled python files
  scripts/gdb: port to python3 / gdb7.7
  scripts/gdb: add basic documentation
  scripts/gdb: add lx-lsmod command
  scripts/gdb: add class to iterate over CPU masks
  scripts/gdb: add lx_current convenience function
  scripts/gdb: add internal helper and convenience function for per-cpu lookup
  scripts/gdb: add get_gdbserver_type helper
  scripts/gdb: add internal helper and convenience function to retrieve thread_info
  scripts/gdb: add is_target_arch helper
  scripts/gdb: add helper and convenience function to look up tasks
  scripts/gdb: add task iteration class
  ...
2015-02-17 14:35:02 -08:00
Fabian Frederick
0445f01a53 fs/affs/super.c: fix switch indentation
Fix checkpatch error:

  ERROR: switch and case should be at the same indent

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:53 -08:00
Fabian Frederick
0cdfe18ad5 fs/affs/inode.c: remove double extern affs_symlink_inode_operations
affs_symlink_inode_operations was already declared extern in affs.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
211c2af014 fs/affs/bitmap.c: remove unnecessary return
return is not needed at the end of function.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
b4478e3530 fs/affs/amigaffs.c: remove else after return
else is unnecessary after return -ENAMETOOLONG

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
f157853e40 fs/affs: define AFFSNAMEMAX to replace constant use
30 was used all over the place to compare name length against
AFFS maximum name length.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
eeb36f8e93 fs/affs: use unsigned int for string lengths
- Some min() were used with different types.

- Create a new variable in __affs_hash_dentry() to process
  affs_check_name()/min() return

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
4d29e571e1 fs/affs/super.c: destroy sbi mutex in affs_kill_sb()
Call mutex_destroy() on superblock mutex in affs_kill_sb() otherwise mutex
debugging code isn't able to detect that mutex is used after being freed.
(thanks to Jan Kara for complete definition).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
92b20708f9 fs/affs/file.c: fix direct IO writes beyond EOF
Use the same fallback to normal IO in case of write
operations beyond EOF as fat direct IO. This patch fixes

fsx file -d -Z -r 4096 -w 4096

Report:
  129(129 mod 256): TRUNCATE DOWN from 0x3ff01 to 0xb3f6
  130(130 mod 256): WRITE    0x22000 thru 0x2dfff (0xc000 bytes) HOLE

Thanks to Jan for helping me on this problem.

The ideal solution suggested by Jan Kara would be to use
cont_write_begin() but affs direct_IO shouldn't be used a lot anyway...

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fabian Frederick
afe305dcc9 fs/affs/file.c: replace if/BUG by BUG_ON
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Geert Uytterhoeven
08fe100d91 fs/affs: fix casting in printed messages
- "inode.i_ino" is "unsigned long",
  - "loff_t" is always "unsigned long long",
  - "sector_t" should be cast to "unsigned long long" for printing,
  - "u32" should not be cast to "unsigned int" for printing.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Chris Mason
e22553e2a2 eventfd: don't take the spinlock in eventfd_poll
The spinlock in eventfd_poll is trying to protect the count of events so
it can decide if it should return POLLIN, POLLERR, or POLLOUT.  But,
because of the way we drop the lock after calling poll_wait, and drop it
again before returning, we have the same pile of races with the lock as
we do with a single read of ctx->count().

This replaces the lock with a read barrier and single read.

eventfd_write does a single bump of ctx->count, so this should not add
new races with adding events.  eventfd_read is similar, it will do a
single decrement with the lock held, and so we're making the race with
concurrent readers slightly larger.

This spinlock is the top CPU user in kernel code during one of our
workloads.  Removing it gives us a ~2% boost.

[arnd@arndb.de: avoid unused variable warning]
[dan.carpenter@oracle.com: type bug in eventfd_poll()]
Signed-off-by: Chris Mason <clm@fb.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
WANG Chao
34b4776429 vmcore: fix PT_NOTE n_namesz, n_descsz overflow issue
When updating PT_NOTE header size (ie.  p_memsz), an overflow issue
happens with the following bogus note entry:

  n_namesz = 0xFFFFFFFF
  n_descsz = 0x0
  n_type   = 0x0

This kind of note entry should be dropped during updating p_memsz.  But
because n_namesz is 32bit, after (n_namesz + 3) & (~3), it's overflow to
0x0, the note entry size looks sane and reserved.

When userspace (eg.  crash utility) is trying to access such bogus note,
it could lead to an unexpected behavior (eg.  crash utility segment fault
because it's reading bogus address).

The source of bogus note hasn't been identified yet.  At least we could
drop the bogus note so user space wouldn't be surprised.

Signed-off-by: WANG Chao <chaowang@redhat.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Randy Wright <rwright@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Rashika Kheria <rashika.kheria@gmail.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:52 -08:00
Fred Chou
d6bd428275 fs: fat: use MSDOS_SB macro to get msdos_sb_info
Use the MSDOS_SB macro to get msdos_sb_info, instead of coding it
directly.

Signed-off-by: Fred Chou <fred.chou.nd@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:51 -08:00
Fabian Frederick
714b71a3a9 fs/reiserfs/inode.c: replace 0 by NULL for pointers
Fix sparse warning:

  fs/reiserfs/inode.c:2769:19: warning: Using plain integer as NULL pointer

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:51 -08:00
Fabian Frederick
ed3ad79f87 fs/ufs/super.c: fix potential race condition
Let locking subsystem decide on mutex management.  As reported by Andrew
Morton this patch fixes a bug:

: lock_ufs() is assuming that on non-preempt uniprocessor, the calling
: code will run atomically up to the matching unlock_ufs().
:
: But that isn't true. The very first site I looked at (ufs_frag_map)
: does sb_bread() under lock_ufs().  And sb_bread() will call schedule(),
: very commonly.
:
: The ->mutex_owner stuff is a bit hacky but should work OK.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:51 -08:00
Fabian Frederick
61da3ae241 fs/ufs/super.c: remove unnecessary casting
Fix the following coccinelle warning:

  fs/ufs/super.c:1418:7-28: WARNING: casting value returned by memory allocation function to (struct ufs_inode_info *) is useless.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:51 -08:00
Fabian Frederick
b625032b10 fs/coda/dir.c: forward declaration clean-up
- Move operation structures to avoid forward declarations.

- Fix some checkpatch warnings:

WARNING: Missing a blank line after declarations
+		struct inode *host_inode = file_inode(host_file);
+		mutex_lock(&host_inode->i_mutex);

ERROR: that open brace { should be on the previous line
+const struct dentry_operations coda_dentry_operations =
+{

ERROR: that open brace { should be on the previous line
+const struct inode_operations coda_dir_inode_operations =
+{

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:50 -08:00
Fabian Frederick
111d639dd6 fs/befs/linuxvfs.c: remove unnecessary casting
Fix the following coccinelle warning:

  fs/befs/linuxvfs.c:278:14-36: WARNING: casting value returned by memory allocation function to (struct befs_inode_info *) is useless.

[akpm@linux-foundation.org: avoid 80-col ugliness]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-17 14:34:50 -08:00
Linus Torvalds
9cd77374f0 Merge branch 'parisc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc update from Helge Deller:
 "The major change in here is the removal of the old HP-UX compat code
  which should have made it possible to load and execute 32-bit HP-UX
  binaries on PA-RISC Linux.  Since it was never functional and since
  nobody cares about old 32-bit HPUX binaries any longer, it's now time
  to free up 3200 lines of kernel code (CONFIG_HPUX and
  CONFIG_BINFMT_SOM).

  Other than that we wire up the execveat() syscall, fix sparse errors
  and have some whitespace cleanups"

* 'parisc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  fs/binfmt_som: Drop kernel support for HP-UX SOM binaries
  parisc: Remove unused function
  parisc: macro whitespace fixes
  parisc/uaccess: fix sparse errors
  parisc: hpux - Remove HPUX syscall numbers
  parisc: hpux - Remove hpux gateway page
  parisc: hpux - Delete files in hpux subdirectory
  parisc: hpux - Do not compile hpux subdirectory
  parisc: hpux - Drop support for HP-UX binaries
  parisc: Add error checks when building up signal trampoline handler
  parisc: Wire up execveat syscall
2015-02-17 14:25:58 -08:00
Jeff Layton
2e2f756f81 locks: fix list insertion when lock is split in two
In the case where we're splitting a lock in two, the current code
the new "left" lock in the incorrect spot. It's inserted just
before "right" when it should instead be inserted just before the
new lock.

When we add a new lock, set "fl" to that value so that we can
add "left" before it.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-02-17 17:08:23 -05:00
Jeff Layton
267f112858 locks: remove conditional lock release in middle of flock_lock_file
As Linus pointed out:

    Say we have an existing flock, and now do a new one that conflicts. I
    see what looks like three separate bugs.

     - We go through the first loop, find a lock of another type, and
    delete it in preparation for replacing it

     - we *drop* the lock context spinlock.

     - BUG #1? So now there is no lock at all, and somebody can come in
    and see that unlocked state. Is that really valid?

     - another thread comes in while the first thread dropped the lock
    context lock, and wants to add its own lock. It doesn't see the
    deleted or pending locks, so it just adds it

     - the first thread gets the context spinlock again, and adds the lock
    that replaced the original

     - BUG #2? So now there are *two* locks on the thing, and the next
    time you do an unlock (or when you close the file), it will only
    remove/replace the first one.

...remove the "drop the spinlock" code in the middle of this function as
it has always been suspicious. This should eliminate the potential race
that can leave two locks for the same struct file on the list.

He also pointed out another thing as a bug -- namely that you
flock_lock_file removes the lock from the list unconditionally when
doing a lock upgrade, without knowing whether it'll be able to set the
new lock. Bruce pointed out that this is expected behavior and may help
prevent certain deadlock situations.

We may want to revisit that at some point, but it's probably best that
we do so in the context of a different patchset.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-02-17 15:23:09 -05:00
Jeff Layton
c4e136cda1 locks: only remove leases associated with the file being closed
We don't want to remove all leases just because one filp was closed.

Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-02-17 15:22:57 -05:00
David Howells
e59b4e9187 debugfs: Provide a file creation function that also takes an initial size
Provide a file creation function that also takes an initial size so that the
caller doesn't have to set i_size, thus meaning that we don't have to call
deal with ->d_inode in the callers.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 12:21:51 -05:00
Helge Deller
35e88d5c22 fs/binfmt_som: Drop kernel support for HP-UX SOM binaries
The parisc arch has been the only user of HP-UX SOM binaries.

Support for HP-UX executables was never finished and since we now drop support
for the HP-UX compat layer anyway, it does not makes sense to keep the
BINFMT_SOM support.

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Helge Deller <deller@gmx.de>
2015-02-17 16:29:36 +01:00
Brian Norris
eb928d40a9 Merge JFFS2 updates from David Woodhouse 2015-02-16 18:05:26 -08:00
Joseph Qi
160cc26663 ocfs2: set append dio as a ro compat feature
Intruduce a bit OCFS2_FEATURE_RO_COMPAT_APPEND_DIO and check it in
write flow. If the bit is not set, fall back to the old way.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
4813962bee ocfs2: wait for orphan recovery first once append O_DIRECT write crash
If one node has crashed with orphan entry leftover, another node which do
append O_DIRECT write to the same file will override the
i_dio_orphaned_slot.  Then the old entry won't be cleaned forever.  If
this case happens, we let it wait for orphan recovery first.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
3a83b342c8 ocfs2: complete the rest request through buffer io
Complte the rest request thourgh buffer io after direct write performed.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
d943d59dd3 ocfs2: do not fallback to buffer I/O write if appending
Now we can do direct io and do not fallback to buffered IO any more in
case of append O_DIRECT write.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
49255dce65 ocfs2: allocate blocks in ocfs2_direct_IO_get_blocks
Allow blocks allocation in ocfs2_direct_IO_get_blocks.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
24c40b329e ocfs2: implement ocfs2_direct_IO_write
Implement ocfs2_direct_IO_write.  Add the inode to orphan dir first, and
then delete it once append O_DIRECT finished.

This is to make sure block allocation and inode size are consistent.

[akpm@linux-foundation.org: fix it for "block: Add discard flag to blkdev_issue_zeroout() function"]
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
ed460cffc2 ocfs2: add orphan recovery types in ocfs2_recover_orphans
Define two orphan recovery types, which indicates if need truncate file or
not.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Joseph Qi
06ee5c75b5 ocfs2: add functions to add and remove inode in orphan dir
Add functions to add inode to orphan dir and remove inode in orphan dir.
Here we do not call ocfs2_prepare_orphan_dir and ocfs2_orphan_add
directly.  Because append O_DIRECT will add inode to orphan two and may
result in more than one orphan entry for the same inode.

[akpm@linux-foundation.org: avoid dynamic stack allocation]
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Joseph Qi
026749a86e ocfs2: prepare some interfaces used in append direct io
Currently in case of append O_DIRECT write (block not allocated yet),
ocfs2 will fall back to buffered I/O.  This has some disadvantages.
Firstly, it is not the behavior as expected.  Secondly, it will consume
huge page cache, e.g.  in mass backup scenario.  Thirdly, modern
filesystems such as ext4 support this feature.

In this patch set, the direct I/O write doesn't fallback to buffer I/O
write any more because the allocate blocks are enabled in direct I/O now.

This patch (of 9):

Prepare some interfaces which will be used in append O_DIRECT write.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Matthew Wilcox
d92576f116 dax: does not work correctly with virtual aliasing caches
The DAX code accesses the underlying storage through the kernel's linear
mapping, which may not be cache-coherent with user mappings on ARM, MIPS
or SPARC.  Temporarily disable the DAX code until this problem is
resolved.

The original XIP code also had this problem, but it was never noticed.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Ross Zwisler
923ae0ff92 ext4: add DAX functionality
This is a port of the DAX functionality found in the current version of
ext2.

[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00