Commit Graph

51779 Commits

Author SHA1 Message Date
Elena Reshetova
fee21fb587 lockd: convert nlm_host.h_count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nlm_host.h_count  is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

**Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts.
The full comparison can be seen in
https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
in state to be merged to the documentation tree.
Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the nlm_host.h_count it might make a difference
in following places:
 - nlmsvc_release_host(): decrement in refcount_dec()
   provides RELEASE ordering, while original atomic_dec()
   was fully unordered. Since the change is for better, it
   should not matter.
 - nlmclnt_release_host(): decrement in refcount_dec_and_test() only
   provides RELEASE ordering and control dependency on success
   vs. fully ordered atomic counterpart. It doesn't seem to
   matter in this case since object freeing happens under mutex
   lock anyway.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Scott Mayhew
ba4a76f703 nfs/pnfs: fix nfs_direct_req ref leak when i/o falls back to the mds
Currently when falling back to doing I/O through the MDS (via
pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
without releasing the reference taken on the dreq
via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
nfs_direct_pgio_init.  It then takes another reference on the dreq via
nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
as a result the requester will become stuck in inode_dio_wait.  Once
that happens, other processes accessing the inode will become stuck as
well.

Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
up correctly by calling hdr->completion_ops->completion() instead of
calling hdr->release() directly.

This can be reproduced (sometimes) by performing "storage failover
takeover" commands on NetApp filer while doing direct I/O from a client.

This can also be reproduced using SystemTap to simulate a failure while
doing direct I/O from a client (from Dave Wysochanski
<dwysocha@redhat.com>):

stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'

Suggested-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: 1ca018d28d ("pNFS: Fix a memory leak when attempted pnfs fails")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington
b3dce6a2f0 pnfs/blocklayout: handle transient devices
PNFS block/SCSI layouts should gracefully handle cases where block devices
are not available when a layout is retrieved, or the block devices are
removed while the client holds a layout.

While setting up a layout segment, keep a record of an unavailable or
un-parsable block device in cache with a flag so that subsequent layouts do
not spam the server with GETDEVINFO.  We can reuse the current
NFS_DEVICEID_UNAVAILABLE handling with one variation: instead of reusing
the device, we will discard it and send a fresh GETDEVINFO after the
timeout, since the lookup and validation of the device occurs within the
GETDEVINFO response handling.

A lookup of a layout segment that references an unavailable device will
return a segment with the NFS_LSEG_UNAVAILABLE flag set.  This will allow
the pgio layer to mark the layout with the appropriate fail bit, which
forces subsequent IO to the MDS, and prevents spamming the server with
LAYOUTGET, LAYOUTRETURN.

Finally, when IO to a block device fails, look up the block device(s)
referenced by the pgio header, and mark them as unavailable.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington
d78471d32b pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERROR
If there's an error doing I/O to block device, and the client resends the
I/O to the MDS, the MDS must recall the layout from the client before
processing the I/O.  Let's preempt that exchange by returning the layout
before falling back to the MDS when there's an error.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington
ad6b0241c9 pnfs/blocklayout: Add module alias for LAYOUT4_SCSI
The blocklayout module contains the client support for both block and SCSI
layouts.  Add a module alias for the SCSI layout type so that the module
will be loaded for SCSI layouts.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Benjamin Coddington
e545735a32 NFS: remove unused offset arg in nfs_pgio_rpcsetup
nfs_pgio_rpcsetup() is always called with an offset of 0, so we should be
able to drop the arguement altogether.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
NeilBrown
dce2630c7d NFSv4: always set NFS_LOCK_LOST when a lock is lost.
There are 2 comments in the NFSv4 code which suggest that
SIGLOST should possibly be sent to a process.  In these
cases a lock has been lost.
The current practice is to set NFS_LOCK_LOST so that
read/write returns EIO when a lock is lost.
So change these comments to code when sets NFS_LOCK_LOST.

One case is when lock recovery after apparent server restart
fails with NFS4ERR_DENIED, NFS4ERR_RECLAIM_BAD, or
NFS4ERRO_RECLAIM_CONFLICT.  The other case is when a lock
attempt as part of lease recovery fails with NFS4ERR_DENIED.

In an ideal world, these should not happen.  However I have
a packet trace showing an NFSv4.1 session getting
NFS4ERR_BADSESSION after an extended network parition.  The
NFSv4.1 client treats this like server reboot until/unless
it get NFS4ERR_NO_GRACE, in which case it switches over to
"nograce" recovery mode.  In this network trace, the client
attempts to recover a lock and the server (incorrectly)
reports NFS4ERR_DENIED rather than NFS4ERR_NO_GRACE.  This
leads to the ineffective comment and the client then
continues to write using the OPEN stateid.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
NeilBrown
aaa1500894 nfs: remove dead code from nfs_encode_fh()
This code can never be used as the IS_AUTOMOUNT(inode)
case has already been handled.
So remove it to avoid confusion.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust
9ccee940bd Support statx() mask and query flags parameters
Support the query flags AT_STATX_FORCE_SYNC by forcing an attribute
revalidation, and AT_STATX_DONT_SYNC by returning cached attributes
only.

Use the mask to optimise away server revalidation for attributes
that are not being requested by the user.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust
8634ef5e05 NFS: Fix nfsstat breakage due to LOOKUPP
The LOOKUPP operation was inserted into the nfs4_procedures array
rather than being appended, which put /proc/net/rpc/nfs out of
whack, and broke the nfsstat utility.
Fix by moving the LOOKUPP operation to the end of the array, and
by ensuring that it keeps the same length whether or not NFSV4.1
and NFSv4.2 are compiled in.

Fixes: 5b5faaf6df ("nfs4: add NFSv4 LOOKUPP handlers")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust
82571552a0 NFSv4: Convert LOCKU to use nfs4_async_handle_exception()
Convert CLOSE so that it specifies the correct stateid and
inode for the error handling.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust
e0dba0128a NFSv4: Convert DELEGRETURN to use nfs4_handle_exception()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:29 -05:00
Trond Myklebust
b8b8d22109 NFSv4: Convert CLOSE to use nfs4_async_handle_exception()
Convert CLOSE so that it specifies the correct stateid, state and
inode for the error handling.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:28 -05:00
Trond Myklebust
7f1bda447c NFS: Add a cond_resched() to nfs_commit_release_pages()
The commit list can get very large, and so we need a cond_resched()
in nfs_commit_release_pages() in order to ensure we don't hog the CPU
for excessive periods of time.

Reported-by: Mike Galbraith <efault@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-14 23:06:28 -05:00
Linus Torvalds
75d4276e83 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - untangle sys_close() abuses in xt_bpf

 - deal with register_shrinker() failures in sget()

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
  sget(): handle failures of register_shrinker()
  mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-06 17:13:21 -08:00
Linus Torvalds
89876f275e for-4.15-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpPux0ACgkQxWXV+ddt
 WDs/ORAAgRtjm+OWBb80eV1xJIHGRPRaL6E4OZc6SA7DEA+oCpkkVzOHQz3PV2a2
 cAsIUvp9azZd41gzBMw8mIe4AQKLZpud+vEM7QYRlbZFtp3EWmZ1Jht4bJRxC+w7
 NjBIEx4MX2KiUeRizmo3iWBVW+RoaRVW1xvFo/k5QchhO8U74SNYzxTGVxd8S/C0
 ZanuTowdm71uCJJHkoNWArAsou40QCJOYK19WilRkrf6SGsUqc1zKArRKe2KF4GH
 Wyf4Qyp2fm8RRKLOlc9NcsVbVqVg4kBmUXbJPCvltCs+JiyfhX9hahweoHHH8kmH
 u/jR3CItVqX+Ft1WAtSpgRzxO0uGu6aVkIql0VHV6wIbGnFoJd9XQ6RPnT/awlOw
 1jx8RLOZtVehF6pjyoSngLppqCw/sYpV8QhF32dEFGentO3Wd7CVKTcMOH498dbN
 paNzcNEfnTFLbUmViOTXl8AS8VX+3PU2Mgn8W8UxcFYksoIpV9P/LBDS3iIGYMtL
 pFFC9fYeipBDOPg2NV4QfCE9ZSqm35c2kAV/hb1nmPtPz4W+Ya5v2y9RSjAU80f4
 Y8ZyePg6pjwWOp1dW+TZF0NE8ExzSvgnXAQOdZkiy4Ztc6OwTVhlwRfW1xFy2Py+
 riR87A7/mDbiR9IXHgzFZi6WjjVMHDifBKeEpu91cF9JrwJqMBc=
 =WIOv
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have two more fixes for 4.15, both aimed for stable.

  The leak fix is obvious, the second patch fixes a bug revealed by the
  refcount API, when it behaves differently than previous atomic_t and
  reports refs going from 0 to 1 in one case"

* tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
  btrfs: Fix flush bio leak
2018-01-05 13:02:46 -08:00
Linus Torvalds
12e971b652 Changes since last update:
- Fix resource cleanup of failed quota initialization
 - Fix integer overflow problems wrt s_maxbytes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaS8yUAAoJEPh/dxk0SrTrrNMP+gLCitWenObhf6uA0Aysb3Vr
 EnhNFaqZA7RRLbQRwLESblvhExp9WTrtFmWOAFh1Q0ETBEIazIGkXfKDeOChxaCY
 LMPb83vQarZoV++HoiBeFbShf39dFw2ufGHyveZwvxk4kgYgQRFzIVZbRTg7CA/C
 nMLPZ9IBDBhEwnCVpH+gKJMcU6j5I9IIePwaEIKnB0o99fsEgZfnM0B4Wl0DRrzn
 nE6DOvkGZiNF4on1J2KgL2rB0r+VEyyMtBTCRs519rEaa8ACFUQDqEqoUIC92SnS
 pD/n9S2JwVH1dLX7cRoiMQcX/r4do83LlK0IvMswApMuNqYRQU6332lwosdgo7KQ
 8+antAlVKuqMAGNvhVWMy1DuaRO5gCqRwL1wpzebNHsw4eRsDD2MNkeLXbM2P2oL
 5OflIrPLMlLORlPtwbJclm8CcnQzQGMAa5yEDJcU1PIWH/urdRd+KqWQ+N0Zfj6m
 J3L4tXDY61hqwZ8BISe+/9iFDooGV/6Ri4mbez4UWiN6UfaKKokaFZzbo2n3VTb9
 Htx5KsrzslfGWAnoeIT9GnyFhT4te9IHT69jl2AorvxpmdXdfOI8TgrzS8TzuKGD
 N6TadC4IZGLLpww+rND6Bywdc8/garmFbck+/nVdMRwNAsZUE+m08OrNFMCqmYms
 p9jIA2tRh94Hu4Awi8hG
 =2rs/
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "I have just a few fixes for bugs and resource cleanup problems this
  week:

   - Fix resource cleanup of failed quota initialization

   - Fix integer overflow problems wrt s_maxbytes"

* tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix s_maxbytes overflow problems
  xfs: quota: check result of register_shrinker()
  xfs: quota: fix missed destroy of qi_tree_lock
2018-01-05 12:59:32 -08:00
Andrea Arcangeli
0cbb4b4f4c userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails
The previous fix in commit 384632e67e ("userfaultfd: non-cooperative:
fix fork use after free") corrected the refcounting in case of
UFFD_EVENT_FORK failure for the fork userfault paths.

That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
were set to point to the aborted new uffd ctx earlier in
dup_userfaultfd.

Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04 16:45:09 -08:00
Linus Torvalds
50d0f78f5c Merge branch 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs/fscache fixes from David Howells:

 - Fix the default return of fscache_maybe_release_page() when a cache
   isn't in use - it prevents a filesystem from releasing pages. This
   can cause a system to OOM.

 - Fix a potential uninitialised variable in AFS.

 - Fix AFS unlink's handling of the nlink count. It needs to use the
   nlink manipulation functions so that inode structs of deleted inodes
   actually get scheduled for destruction.

 - Fix error handling in afs_write_end() so that the page gets unlocked
   and put if we can't fill the unwritten portion.

* 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix missing error handling in afs_write_end()
  afs: Fix unlink
  afs: Potential uninitialized variable in afs_extract_data()
  fscache: Fix the default for fscache_maybe_release_page()
2018-01-03 10:58:56 -08:00
Kees Cook
e816c201ae exec: Weaken dumpability for secureexec
This is a logical revert of commit e37fdb785a ("exec: Use secureexec
for setting dumpability")

This weakens dumpability back to checking only for uid/gid changes in
current (which is useless), but userspace depends on dumpability not
being tied to secureexec.

  https://bugzilla.redhat.com/show_bug.cgi?id=1528633

Reported-by: Tom Horsley <horsley1953@gmail.com>
Fixes: e37fdb785a ("exec: Use secureexec for setting dumpability")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-03 10:13:36 -08:00
Darrick J. Wong
b4d8ad7fd3 xfs: fix s_maxbytes overflow problems
Fix some integer overflow problems if offset + count happen to be large
enough to cause an integer overflow.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-02 10:16:32 -08:00
Aliaksei Karaliou
3a3882ff26 xfs: quota: check result of register_shrinker()
xfs_qm_init_quotainfo() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.

Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
[darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-02 10:16:32 -08:00
Aliaksei Karaliou
2196881566 xfs: quota: fix missed destroy of qi_tree_lock
xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock
while destroys quotainfo->qi_quotaofflock.

Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-02 10:16:32 -08:00
Chris Mason
ec35e48b28 btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
refcounts have a generic implementation and an asm optimized one.  The
generic version has extra debugging to make sure that once a refcount
goes to zero, refcount_inc won't increase it.

The btrfs delayed inode code wasn't expecting this, and we're tripping
over the warnings when the generic refcounts are used.  We ended up with
this race:

Process A                                         Process B
                                                  btrfs_get_delayed_node()
						  spin_lock(root->inode_lock)
						  radix_tree_lookup()
__btrfs_release_delayed_node()
refcount_dec_and_test(&delayed_node->refs)
our refcount is now zero
						  refcount_add(2) <---
						  warning here, refcount
                                                  unchanged

spin_lock(root->inode_lock)
radix_tree_delete()

With the generic refcounts, we actually warn again when process B above
tries to release his refcount because refcount_add() turned into a
no-op.

We saw this in production on older kernels without the asm optimized
refcounts.

The fix used here is to use refcount_inc_not_zero() to detect when the
object is in the middle of being freed and return NULL.  This is almost
always the right answer anyway, since we usually end up pitching the
delayed_node if it didn't have fresh data in it.

This also changes __btrfs_release_delayed_node() to remove the extra
check for zero refcounts before radix tree deletion.
btrfs_get_delayed_node() was the only path that was allowing refcounts
to go from zero to one.

Fixes: 6de5f18e7b ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
CC: <stable@vger.kernel.org> # 4.12+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 18:00:14 +01:00
Nikolay Borisov
beed9263f4 btrfs: Fix flush bio leak
Commit e0ae999414 ("btrfs: preallocate device flush bio") reworked
the way the flush bio is allocated and used. Concretely it allocates
the bio in __alloc_device and then re-uses it multiple times with a
very simple endio routine that just calls complete() without consuming
a reference. Allocated bios by default come with a ref count of 1,
which is then consumed by the endio routine (or not, in which case they
should be bio_put by the caller). The way the impleementation works now
is that the flush bio has a refcount of 2 and we only ever bio_put it
once, leaving it to hang indefinitely. Fix this by removing the extra
bio_get in __alloc_device.

Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 18:00:13 +01:00
David Howells
afae457d87 afs: Fix missing error handling in afs_write_end()
afs_write_end() is missing page unlock and put if afs_fill_page() fails.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-01-02 10:02:19 +00:00
David Howells
440fbc3a8a afs: Fix unlink
Repeating creation and deletion of a file on an afs mount will run the box
out of memory, e.g.:

	dd if=/dev/zero of=/afs/scratch/m0 bs=$((1024*1024)) count=512
	rm /afs/scratch/m0

The problem seems to be that it's not properly decrementing the nlink count
so that the inode can be scrapped.

Note that this doesn't fix local creation followed by remote deletion.
That's harder to handle and will require a separate patch as we're not told
that the file has been deleted - only that the directory has changed.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-01-02 10:02:19 +00:00
Dan Carpenter
7888da9583 afs: Potential uninitialized variable in afs_extract_data()
Smatch warns that:

    fs/afs/rxrpc.c:922 afs_extract_data()
    error: uninitialized symbol 'remote_abort'.

Smatch is right that "remote_abort" might be uninitialized when we pass
it to afs_set_call_complete().  I don't know if that function uses the
uninitialized variable.  Anyway, the comment for rxrpc_kernel_recv_data(),
says that "*_abort should also be initialised to 0." and this patch does
that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-01-02 10:02:19 +00:00
Linus Torvalds
fca0e39b2b Changes since last update:
- Fix a locking problem during xattr block conversion that could lead to
   the log checkpointing thread to try to write an incomplete buffer to
   disk, which leads to a corruption shutdown
 - Fix a null pointer dereference when removing delayed allocation extents
 - Remove post-eof speculative allocations when reflinking a block past
   current inode size so that we don't just leave them there and assert on
   inode reclaim
 - Relax an assert which didn't accurately reflect the way locking works
   and would trigger under heavy io load
 - Avoid infinite loop when cancelling copy on write extents after a
   writeback failure
 - Try to avoid copy on write transaction reservation overflows when
   remapping after a successful write
 - Fix various problems with the copy-on-write reservation automatic
   garbage collection not being cleaned up properly during a ro remount
 - Fix problems with rmap log items being processed in the wrong order,
   leading to corruption shutdowns
 - Fix problems with EFI recovery wherein the "remove any rmapping if
   present" mechanism wasn't actually doing anything, which would lead
   to corruption problems later when the extent is reallocated, leading
   to multiple rmaps for the same extent
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaO+dwAAoJEPh/dxk0SrTrY8YP/R9AXH3Wt6S2QGGjZfXURa22
 /cioJKFl8hWay00ZT8Zcj4Pdx6R+stvausj5ECDvpdWZG+d28e61c1bxg+bqRYO5
 JWXikWnAa80RQ5uEjOXHoUjAgk6u6YYuQHEuHH/xA0nL4Cw98WLSzLjqk7ZU53rx
 P17dgUWWHta/w8OpxG9UG5pxvNW3VRitiyCMWxa2gzBPncHnCk3fu9lInpDzH9S+
 xakwCRtfiAykoOG/O5pnMg6vw5r6ENwK7DymxXgqF+Vv/HzgMbeJs+9UON2eACtp
 ECHGffN4pXpqWVcGDMs5cWCOfLUEjxCrotMLYpIrdZs5DptmOcOWpQpHWl4JiaXB
 rqAxx3D0Yo+00ENponM01un8UgCXF5gqsDGyTzn99aPpDVqxCJw1XmSdOXRhcnnF
 At2raUkXF+nbqaVwL3Y7ZJuOKs1hi3HpsYwwfvClR8cTFk/BaY6sQ4QnVR0Ggkg6
 8lZxeDb8VdoUjWO11sX1edwGtR8g+p3PSHiUFSnh1JsbP2I0R+TV+j5Y9rMotxFT
 Eq6+Ehp889GeSpEBCrDpMgNIABMjBxoi5JvOwXSUNhF5Rh/1Vf//7v31nXcyVlah
 a95IhCYfQLFMtaYaGr2ElvdO+Qs1+ppsD207I4H86XotjRkvD7U+mJoYm9EaujQX
 jgUDdZEsP5h5DX524VHU
 =i51V
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are some XFS fixes for 4.15-rc5. Apologies for the unusually
  large number of patches this late, but I wanted to make sure the
  corruption fixes were really ready to go.

  Changes since last update:

   - Fix a locking problem during xattr block conversion that could lead
     to the log checkpointing thread to try to write an incomplete
     buffer to disk, which leads to a corruption shutdown

   - Fix a null pointer dereference when removing delayed allocation
     extents

   - Remove post-eof speculative allocations when reflinking a block
     past current inode size so that we don't just leave them there and
     assert on inode reclaim

   - Relax an assert which didn't accurately reflect the way locking
     works and would trigger under heavy io load

   - Avoid infinite loop when cancelling copy on write extents after a
     writeback failure

   - Try to avoid copy on write transaction reservation overflows when
     remapping after a successful write

   - Fix various problems with the copy-on-write reservation automatic
     garbage collection not being cleaned up properly during a ro
     remount

   - Fix problems with rmap log items being processed in the wrong
     order, leading to corruption shutdowns

   - Fix problems with EFI recovery wherein the "remove any rmapping if
     present" mechanism wasn't actually doing anything, which would lead
     to corruption problems later when the extent is reallocated,
     leading to multiple rmaps for the same extent"

* tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: only skip rmap owner checks for unknown-owner rmap removal
  xfs: always honor OWN_UNKNOWN rmap removal requests
  xfs: queue deferred rmap ops for cow staging extent alloc/free in the right order
  xfs: set cowblocks tag for direct cow writes too
  xfs: remove leftover CoW reservations when remounting ro
  xfs: don't be so eager to clear the cowblocks tag on truncate
  xfs: track cowblocks separately in i_flags
  xfs: allow CoW remap transactions to use reserve blocks
  xfs: avoid infinite loop when cancelling CoW blocks after writeback failure
  xfs: relax is_reflink_inode assert in xfs_reflink_find_cow_mapping
  xfs: remove dest file's post-eof preallocations before reflinking
  xfs: move xfs_iext_insert tracepoint to report useful information
  xfs: account for null transactions in bunmapi
  xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute
  xfs: add the ability to join a held buffer to a defer_ops
2017-12-22 12:27:27 -08:00
Darrick J. Wong
68c58e9b9a xfs: only skip rmap owner checks for unknown-owner rmap removal
For rmap removal, refactor the rmap owner checks into a separate
function, then skip the checks if we are performing an unknown-owner
removal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong
33df3a9cf9 xfs: always honor OWN_UNKNOWN rmap removal requests
Calling xfs_rmap_free with an unknown owner is supposed to remove any
rmaps covering that range regardless of owner.  This is used by the EFI
recovery code to say "we're freeing this, it mustn't be owned by
anything anymore", but for whatever reason xfs_free_ag_extent filters
them out.

Therefore, remove the filter and make xfs_rmap_unmap actually treat it
as a wildcard owner -- free anything that's already there, and if
there's no owner at all then that's fine too.

There are two existing callers of bmap_add_free that take care the rmap
deferred ops themselves and use OWN_UNKNOWN to skip the EFI-based rmap
cleanup; convert these to use OWN_NULL (via helpers), and now we really
require that an RUI (if any) gets added to the defer ops before any EFI.

Lastly, now that xfs_free_extent filters out OWN_NULL rmap free requests,
growfs will have to consult directly with the rmap to ensure that there
aren't any rmaps in the grown region.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong
0525e952dc xfs: queue deferred rmap ops for cow staging extent alloc/free in the right order
Under the deferred rmap operation scheme, there's a certain order in
which the rmap deferred ops have to be queued to maintain integrity
during log replay.  For alloc/map operations that order is cui -> rui;
for free/unmap operations that order is cui -> rui -> efi.  However, the
initial refcount code got the ordering wrong in the free side of things
because it queued refcount free op and an EFI and the refcount free op
queued a rmap free op, resulting in the order cui -> efi -> rui.

If we fail before the efd finishes, the efi recovery will try to do a
wildcard rmap removal and the subsequent rui will fail to find the rmap
and blow up.  This didn't ever happen due to other screws up in handling
unknown owner rmap removals, but those other screw ups broke recovery in
other ways, so fix the ordering to follow the intended rules.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong
86d692bfad xfs: set cowblocks tag for direct cow writes too
If a user performs a direct CoW write, we end up loading the CoW fork
with preallocated extents.  Therefore, we must set the cowblocks tag so
that they can be cleared out if we run low on space.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:47:37 -08:00
Darrick J. Wong
10ddf64e42 xfs: remove leftover CoW reservations when remounting ro
When we're remounting the filesystem readonly, remove all CoW
preallocations prior to going ro.  If the fs goes down after the ro
remount, we never clean up the staging extents, which means xfs_check
will trip over them on a subsequent run.  Practically speaking, the next
mount will clean them up too, so this is unlikely to be seen.  Since we
shut down the cowblocks cleaner on remount-ro, we also have to make sure
we start it back up if/when we remount-rw.

Found by adding clonerange to fsstress and running xfs/017.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:47:32 -08:00
Darrick J. Wong
363e59baa4 xfs: don't be so eager to clear the cowblocks tag on truncate
Currently, xfs_itruncate_extents clears the cowblocks tag if i_cnextents
is zero.  This is wrong, since i_cnextents only tracks real extents in
the CoW fork, which means that we could have some delayed CoW
reservations still in there that will now never get cleaned.

Fix a further bug where we /don't/ clear the reflink iflag if there are
any attribute blocks -- really, it's only safe to clear the reflink flag
if there are no data fork extents and no cow fork extents.

Found by adding clonerange to fsstress in xfs/017.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:47:28 -08:00
Darrick J. Wong
91aae6be41 xfs: track cowblocks separately in i_flags
The EOFBLOCKS/COWBLOCKS tags are totally separate things, so track them
with separate i_flags.  Right now we're abusing IEOFBLOCKS for both,
which is totally bogus because we won't tag the inode with COWBLOCKS if
IEOFBLOCKS was set by a previous tagging of the inode with EOFBLOCKS.
Found by wiring up clonerange to fsstress in xfs/017.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-20 17:11:48 -08:00
Al Viro
9ee332d99e sget(): handle failures of register_shrinker()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-18 15:05:07 -05:00
Kees Cook
779f4e1c6c Revert "exec: avoid RLIMIT_STACK races with prlimit()"
This reverts commit 04e35f4495.

SELinux runs with secureexec for all non-"noatsecure" domain transitions,
which means lots of processes end up hitting the stack hard-limit change
that was introduced in order to fix a race with prlimit(). That race fix
will need to be redesigned.

Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Tomáš Trnka <trnka@scm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-17 14:26:25 -08:00
Arnd Bergmann
b9f5fb1800 cramfs: fix MTD dependency
With CONFIG_MTD=m and CONFIG_CRAMFS=y, we now get a link failure:

  fs/cramfs/inode.o: In function `cramfs_mount': inode.c:(.text+0x220): undefined reference to `mount_mtd'
  fs/cramfs/inode.o: In function `cramfs_mtd_fill_super':
  inode.c:(.text+0x6d8): undefined reference to `mtd_point'
  inode.c:(.text+0xae4): undefined reference to `mtd_unpoint'

This adds a more specific Kconfig dependency to avoid the broken
configuration.

Alternatively we could make CRAMFS itself depend on "MTD || !MTD" with a
similar result.

Fixes: 99c18ce580 ("cramfs: direct memory access support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-17 12:20:58 -08:00
Linus Torvalds
73d080d374 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "The alloc_super() one is a regression in this merge window, lazytime
  thing is older..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Handle lazytime in do_mount()
  alloc_super(): do ->s_umount initialization earlier
2017-12-17 12:18:35 -08:00
Linus Torvalds
1c6b942d7d Fix a regression which caused us to fail to interpret symlinks in very
ancient ext3 file system images.  Also fix two xfstests failures, one
 of which could cause a OOPS, plus an additional bug fix caught by fuzz
 testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlo1y3EACgkQ8vlZVpUN
 gaNFOQf/bMf6ynai1dGGRwef+UcT874NZ2Hqm+UqI6pxusz0ZeKWm8HWfPfg31Fa
 o+OnUsZ7NXFBIHyfXKFJzdOgutjZ5eY0vMu+NrlyBdd6W+ZcHwn1PvQsLapFYvqK
 Rt+8nWTKqtnksSfh0vyODmUYgItOULOPPepjnIPm/Pd0DinJwo0GY/8MzLkz4SpX
 g6R60ou0ToEYNqBXAKIBnZ4aq8KWMtCMGcD270U5eAm/63Pt4riRwJbjITxZPAH1
 wKzivP4Ce5ce8W2g2/6mFFlBFWvtlB491T+BsgHUEv3OLze+kYS2PcxQthhEmBR8
 zeZ2o2/0tTxejE//cyJ4gCe3fYGRDg==
 =xqLC
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a regression which caused us to fail to interpret symlinks in very
  ancient ext3 file system images.

  Also fix two xfstests failures, one of which could cause an OOPS, plus
  an additional bug fix caught by fuzz testing"

* tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix crash when a directory's i_size is too small
  ext4: add missing error check in __ext4_new_inode()
  ext4: fix fdatasync(2) after fallocate(2) operation
  ext4: support fast symlinks from ext3 file systems
2017-12-17 12:14:33 -08:00
Linus Torvalds
d025fbf1a2 NFS client fixes for Linux 4.15-rc4
Stable bugfixes:
 - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
        commit in the case that there were no commit requests.
 - SUNRPC: Fix a race in the receive code path
 
 Other fixes:
 - NFS: Fix a deadlock in nfs client initialization
 - xprtrdma: Fix a performance regression for small IOs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlo0PdMACgkQ18tUv7Cl
 QOvlUg/+KoXWXNwItHIyyegYgRXcAPpaCtdnCjjOP6R9HEJ+clnLcaqDxdDKVWQ/
 oDvEcQcsBpywbUi7vVrvdar4mofwuyjXPpbcZPlDP1Ru4yyAlyylftwIuQW/nzdd
 vX2tZaVf+B9y1XvSD5NI+2EKWmp7MVrPdNhYxAB39TQZnAAvYDFHhywtZ0UR7vJt
 7YVcZoPtKUhg15jhCOr73eaCT0884/tlgedfd6DkDGR6bCtSQC2PySfqq9Lnnl/1
 ruDzzcgTARzSEzvta/uyBRspOLBHeeBhTdQUp79lMfekC4+68Tx6DFWnydIUttuE
 G7LphN6hfbJLF20U/ENb2H8v10WZsKvGEuxM+fp5PXGcIMSlX4qoJUe/egJFiiSL
 IaikgibvfiKmYSJvwdxTlOcr793X2Ej19HNciNjJQp4pviDOdZixgtGvVVHJBmh6
 LYzE5q9jgbW9wQXwTTeWHp/nyqL80NslX0UARYnS2Ua0B96GRCESXqCUFtxK6tKR
 wbYiHzKc4dOfSxpNlKI+FlX63m5oSAmTEii3ODsWZjObbwYHNX2Zqj2cVFiSLCpv
 ZXgmpNL+tL2zBWxPvn6rzYhpaXo++PqlHK7vv2QVBI6XM2J8ztpj5Wr5zneRoJaE
 ejk8nw/mR43bfdQuUGZRKh/Z+FTqL0/2WbDgJMXl09c+zRz7J2c=
 =XhEC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This has two stable bugfixes, one to fix a BUG_ON() when
  nfs_commit_inode() is called with no outstanding commit requests and
  another to fix a race in the SUNRPC receive codepath.

  Additionally, there are also fixes for an NFS client deadlock and an
  xprtrdma performance regression.

  Summary:

  Stable bugfixes:
   - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
     commit in the case that there were no commit requests.
   - SUNRPC: Fix a race in the receive code path

  Other fixes:
   - NFS: Fix a deadlock in nfs client initialization
   - xprtrdma: Fix a performance regression for small IOs"

* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a race in the receive code path
  nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
  xprtrdma: Spread reply processing over more CPUs
  nfs: fix a deadlock in nfs client initialization
2017-12-16 13:12:53 -08:00
Linus Torvalds
f6f3732162 Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"
This reverts commits 5c9d2d5c26, c7da82b894, and e7fe7b5cae.

We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.

Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was.  Dave
Hansen says:

 "So, the underlying bug here is that we now a get_user_pages_remote()
  and then go ahead and do the p*_access_permitted() checks against the
  current PKRU. This was introduced recently with the addition of the
  new p??_access_permitted() calls.

  We have checks in the VMA path for the "remote" gups and we avoid
  consulting PKRU for them. This got missed in the pkeys selftests
  because I did a ptrace read, but not a *write*. I also didn't
  explicitly test it against something where a COW needed to be done"

It's also not entirely clear that it makes sense to check the protection
key bits at this level at all.  But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.

We'll see.

Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back.  Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-15 18:53:22 -08:00
Linus Torvalds
dd3d66b838 CephFS inode trimming fix from Zheng, marked for stable.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJaM/Y/AAoJEEp/3jgCEfOLSu0H/iFhQS+7rnyPcb3P8/YR785H
 IMPNWv8hg4UU6MDWC3lIliAPypAkaMLuEKOZvBRsLCW5esbOTlCP7w4bmO/YCI66
 DF0JfA4AV5yXIVMAtjP2EK3sFz0eCrK6S3XP3cT+x3K5qI6zwNN3Yvj78NFcvCOz
 IBgxrlhpu7/DfBsorhKEAEHXaYE+NKJNlcGBIisvM0BNC9dcm7ufTkP7pP6mRJC0
 GjjYqh8HMe45AvvIaE7o976M1GKexEDNsncHM8VlxuwkC5hz0SNAg73J7iwcDfUe
 hqfLeHcvTOrPQ0oB4Xz0Nh6cJ7tIv3gYZ941awhmH6XZCWgZhrBaLyipIenXEHM=
 =xpe2
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "CephFS inode trimming fix from Zheng, marked for stable"

* tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client:
  ceph: drop negative child dentries before try pruning inode's alias
2017-12-15 12:48:27 -08:00
Linus Torvalds
227701e0e7 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:

 - fix incomplete syncing of filesystem

 - fix regression in readdir on ovl over 9p

 - only follow redirects when needed

 - misc fixes and cleanups

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix overlay: warning prefix
  ovl: Use PTR_ERR_OR_ZERO()
  ovl: Sync upper dirty data when syncing overlayfs
  ovl: update ctx->pos on impure dir iteration
  ovl: Pass ovl_get_nlink() parameters in right order
  ovl: don't follow redirects if redirect_dir=off
2017-12-15 12:46:48 -08:00
Scott Mayhew
dc4fd9ab01 nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
If there were no commit requests, then nfs_commit_inode() should not
wait on the commit or mark the inode dirty, otherwise the following
BUG_ON can be triggered:

[ 1917.130762] kernel BUG at fs/inode.c:578!
[ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1]
[ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries
[ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G               ------------ T 3.10.0-768.el7.ppc64 #1
[ 1917.130810] task: c0000005ecd88040 ti: c00000004cea0000 task.ti: c00000004cea0000
[ 1917.130813] NIP: c000000000354178 LR: c000000000354160 CTR: c00000000012db80
[ 1917.130816] REGS: c00000004cea3720 TRAP: 0700   Tainted: G               ------------ T  (3.10.0-768.el7.ppc64)
[ 1917.130820] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI>  CR: 22002822  XER: 20000000
[ 1917.130828] CFAR: c00000000011f594 SOFTE: 1
GPR00: c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750
GPR04: 000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03
GPR08: 0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8
GPR12: c00000000012db80 c000000007b31200 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0
GPR28: d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8
[ 1917.130867] NIP [c000000000354178] .evict+0x1c8/0x350
[ 1917.130871] LR [c000000000354160] .evict+0x1b0/0x350
[ 1917.130873] Call Trace:
[ 1917.130876] [c00000004cea39a0] [c000000000354160] .evict+0x1b0/0x350 (unreliable)
[ 1917.130880] [c00000004cea3a30] [c0000000003558cc] .evict_inodes+0x13c/0x270
[ 1917.130884] [c00000004cea3af0] [c000000000327d20] .kill_anon_super+0x70/0x1e0
[ 1917.130896] [c00000004cea3b80] [d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs]
[ 1917.130900] [c00000004cea3c00] [c000000000328a20] .deactivate_locked_super+0xa0/0x1b0
[ 1917.130903] [c00000004cea3c80] [c00000000035ba54] .cleanup_mnt+0xd4/0x180
[ 1917.130907] [c00000004cea3d10] [c000000000119034] .task_work_run+0x114/0x150
[ 1917.130912] [c00000004cea3db0] [c00000000001ba6c] .do_notify_resume+0xcc/0x100
[ 1917.130916] [c00000004cea3e30] [c00000000000a7b0] .ret_from_except_lite+0x5c/0x60
[ 1917.130919] Instruction dump:
[ 1917.130921] 7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0
[ 1917.130927] 694a0060 7d4a0074 794ad182 694a0001 <0b0a0000> 892d02a4 2f890000 40de0134

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:50 -05:00
Scott Mayhew
c156618e15 nfs: fix a deadlock in nfs client initialization
The following deadlock can occur between a process waiting for a client
to initialize in while walking the client list during nfsv4 server trunking
detection and another process waiting for the nfs_clid_init_mutex so it
can initialize that client:

Process 1                               Process 2
---------                               ---------
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTA->cl_share_link,
        &nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
                                        spin_lock(&nn->nfs_client_lock);
                                        list_add_tail(&CLIENTB->cl_share_link,
                                                &nn->nfs_client_list);
                                        spin_unlock(&nn->nfs_client_lock);
                                        mutex_lock(&nfs_clid_init_mutex);
                                        nfs41_walk_client_list(clp, result, cred);
                                        nfs_wait_client_init_complete(CLIENTA);
(waiting for nfs_clid_init_mutex)

Make sure nfs_match_client() only evaluates clients that have completed
initialization in order to prevent that deadlock.

This patch also fixes v4.0 trunking behavior by not marking the client
NFS_CS_READY until the clientid has been confirmed.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:49 -05:00
Linus Torvalds
18d40eae7f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "17 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  arch: define weak abort()
  mm, oom_reaper: fix memory corruption
  kernel: make groups_sort calling a responsibility group_info allocators
  mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'
  tools/slabinfo-gnuplot: force to use bash shell
  kcov: fix comparison callback signature
  mm/slab.c: do not hash pointers when debugging slab
  mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()
  mm/memory.c: mark wp_huge_pmd() inline to prevent build failure
  scripts/faddr2line: fix CROSS_COMPILE unset error
  Documentation/vm/zswap.txt: update with same-value filled page feature
  exec: avoid gcc-8 warning for get_task_comm
  autofs: fix careless error in recent commit
  string.h: workaround for increased stack usage
  mm/kmemleak.c: make cond_resched() rate-limiting more efficient
  lib/rbtree,drm/mm: add rbtree_replace_node_cached()
  include/linux/idr.h: add #include <linux/bug.h>
2017-12-14 16:35:20 -08:00
Thiago Rafael Becker
bdcf0a423e kernel: make groups_sort calling a responsibility group_info allocators
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.

This patch:
 - Make groups_sort globally visible.
 - Move the call to groups_sort to the modifiers of group_info
 - Remove the call to groups_sort from set_groups

Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:49 -08:00
Arnd Bergmann
3756f6401c exec: avoid gcc-8 warning for get_task_comm
gcc-8 warns about using strncpy() with the source size as the limit:

  fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]

This is indeed slightly suspicious, as it protects us from source
arguments without NUL-termination, but does not guarantee that the
destination is terminated.

This keeps the strncpy() to ensure we have properly padded target
buffer, but ensures that we use the correct length, by passing the
actual length of the destination buffer as well as adding a build-time
check to ensure it is exactly TASK_COMM_LEN.

There are only 23 callsites which I all reviewed to ensure this is
currently the case.  We could get away with doing only the check or
passing the right length, but it doesn't hurt to do both.

Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:48 -08:00