Commit Graph

57974 Commits

Author SHA1 Message Date
Trond Myklebust
b422df915c lockd: Store the lockd client credential in struct nlm_host
When we create a new lockd client, we want to be able to pass the
correct credential of the process that created the struct nlm_host.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 17:51:23 -04:00
Trond Myklebust
3b7eb5e35d NFS: When mounting, don't share filesystems between different user namespaces
If two different containers that share the same network namespace attempt
to mount the same filesystem, we should not allow them to share the same
super block if they do not share the same user namespace, since the
user mappings on the wire will need to differ.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 17:39:42 -04:00
Trond Myklebust
c207db2f5d NFS: Convert NFSv2 to use the container user namespace
When mapping NFS identities, we want to substitute for the uids and
gids on the wire as we would for the AUTH_UNIX creds.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 17:26:37 -04:00
Trond Myklebust
58002399da NFSv4: Convert the NFS client idmapper to use the container user namespace
When mapping NFS identities using the NFSv4 idmapper, we want to substitute
for the uids and gids that would normally go on the wire as part of a
NFSv3 request. So we use the same mapping in the NFSv4 upcall as we
use in the NFSv3 RPC call (i.e. the mapping stored in the rpc_clnt cred).

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 17:10:53 -04:00
Trond Myklebust
264d948ce7 NFS: Convert NFSv3 to use the container user namespace
When mapping NFS identities, we want to substitute for the uids and
gids on the wire as we would for the AUTH_UNIX creds.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 16:54:27 -04:00
Trond Myklebust
1a58e8a0e5 NFS: Store the credential of the mount process in the nfs_server
Store the credential of the mount process so that we can determine
information such as the user namespace.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 16:11:54 -04:00
Trond Myklebust
79caa5fad4 SUNRPC: Cache cred of process creating the rpc_client
When converting kuids to AUTH_UNIX creds, etc we will want to use the
same user namespace as the process that created the rpc client.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-26 16:00:48 -04:00
Xiaoli Feng
ce96e888fe Fix nfs4.2 return -EINVAL when do dedupe operation
dedupe_file_range operations is combiled into remap_file_range.
But in nfs42_remap_file_range, it's skiped for dedupe operations.
Before this patch:
  # dd if=/dev/zero of=nfs/file bs=1M count=1
  # xfs_io -c "dedupe nfs/file 4k 64k 4k" nfs/file
  XFS_IOC_FILE_EXTENT_SAME: Invalid argument
After this patch:
  # dd if=/dev/zero of=nfs/file bs=1M count=1
  # xfs_io -c "dedupe nfs/file 4k 64k 4k" nfs/file
  deduped 4096/4096 bytes at offset 65536
  4 KiB, 1 ops; 0.0046 sec (865.988 KiB/sec and 216.4971 ops/sec)

Signed-off-by: Xiaoli Feng <fengxiaoli0714@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:15 -04:00
Trond Myklebust
c79d183ebb NFS: Remove redundant open context from nfs_page
The lock context already references and tracks the open context, so
take the opportunity to save some space in struct nfs_page.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:15 -04:00
Trond Myklebust
9fcd5960e8 NFS: Add a helper to return a pointer to the open context of a struct nfs_page
Add a helper for when we remove the explicit pointer to the open
context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:15 -04:00
Trond Myklebust
154945112d NFS: Ensure that all nfs lock contexts have a valid open context
Force the lock context to keep a reference to the parent open
context so that we can guarantee the validity of the latter.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:15 -04:00
Trond Myklebust
0688e64bc6 NFS: Allow signal interruption of NFS4ERR_DELAYed operations
If the server is unable to immediately execute an RPC call, and returns
an NFS4ERR_DELAY then we can assume it is safe to interrupt the operation
in order to handle ordinary signals. This allows the application to
service timer interrupts that would otherwise have to wait until the
server is again able to respond.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
33344e0f7e pNFS: Add tracking to limit the number of pNFS retries
When the client is reading or writing using pNFS, and hits an error
on the DS, then it typically sends a LAYOUTERROR and/or LAYOUTRETURN
to the MDS, before redirtying the failed pages, and going for a new
round of reads/writebacks. The problem is that if the server has no
way to fix the DS, then we may need a way to interrupt this loop
after a set number of attempts have been made.
This patch adds an optional module parameter that allows the admin
to specify how many times to retry the read/writeback process before
failing with a fatal error.
The default behaviour is to retry forever.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
28b1d3f5a7 NFS: Remove unused argument from nfs_create_request()
All the callers of nfs_create_request() are now creating page group
heads, so we can remove the redundant 'last' page argument.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
c917cfaf9b NFS: Fix up NFS I/O subrequest creation
We require all NFS I/O subrequests to duplicate the lock context as well
as the open context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
6fbda89b25 NFS: Replace custom error reporting mechanism with generic one
Replace the NFS custom error reporting mechanism with the generic
mapping_set_error().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
aded8d7b54 NFS: Don't inadvertently clear writeback errors
vfs_fsync() has the side effect of clearing unreported writeback errors,
so we need to make sure that we do not abuse it in situations where
applications might not normally expect us to report those errors.

The solution is to replace calls to vfs_fsync() with calls to nfs_wb_all().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
22876f540b NFS: Don't call generic_error_remove_page() while holding locks
The NFS read code can trigger writeback while holding the page lock.
If an error then triggers a call to nfs_write_error_remove_page(),
we can deadlock.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
14bebe3c90 NFS: Don't interrupt file writeout due to fatal errors
When flushing out dirty pages, the fact that we may hit fatal errors
is not a reason to stop writeback. Those errors are reported through
fsync(), not through the flush mechanism.

Fixes: a6598813a4 ("NFS: Don't write back further requests if there...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
91a575e1a9 NFS: Add a mount option "softerr" to allow clients to see ETIMEDOUT errors
Add a mount option that exposes the ETIMEDOUT errors that occur during
soft timeouts to the application. This allows aware applications to
distinguish between server disk IO errors and client timeout errors.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
11982a7c0f NFS: Consider ETIMEDOUT to be a fatal error
When we introduce the 'softerr' mount option, we will see the RPC
layer returning ETIMEDOUT errors if the server is unresponsive. We
want to consider those errors to be fatal on par with the EIO errors
that are returned by ordinary 'soft' timeouts..

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:14 -04:00
Trond Myklebust
6b2e685627 SUNRPC: Add function rpc_sleep_on_timeout()
Clean up the RPC task sleep interfaces by replacing the task->tk_timeout
'hidden parameter' to rpc_sleep_on() with a new function that takes an
absolute timeout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:13 -04:00
Trond Myklebust
8357a9b60f SUNRPC: Remove unused argument 'action' from rpc_sleep_on_priority()
None of the callers set the 'action' argument, so let's just remove it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Trond Myklebust
ae67bd3821 SUNRPC: Fix up task signalling
The RPC_TASK_KILLED flag should really not be set from another context
because it can clobber data in the struct task when task->tk_flags is
changed non-atomically.
Let's therefore swap out RPC_TASK_KILLED with an atomic flag, and add
a function to set that flag and safely wake up the task.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-04-25 14:18:12 -04:00
Linus Torvalds
38a2ca2cac for-linus-20190420
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAly7HukQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpla6D/9y7YyAyKDgv/pVQgAlDYaGSXufvrK5iK/f
 uFdSWPvGuWMbx+xy/4hfSX1pV9ZRv1aRJeFOkL/qVyr+4izKrgevwj+6Kl3/mCUO
 dhiqF76bnaGXNQC6YDn1IgZp2Za+WGpeNlEhwcg20Ve11U7DVBhcL/n/6NYphtUG
 V7ZFoVw+yjOO9GvkUeHx24HIQdC0JrABMoXYldl/tX3H9WjB3d1ncZDS45TuemXJ
 lwm/S4nyaNaDzLnO7Hv51u3tCFpuaJbcgBdKuZB/oSWhU68D26/6peW+8qAvN+ec
 htibFrK6KPRQCLNMEaV2njZEyprkL/BZJz4YukwmWB8GAtsuquy3Q3wJPounGmm5
 7fCG/T1asqkurwhVcHOC07R6+d8AT5ARyJn3QYFmoYCIoSwObu6xhZHHAv7Ct1Xn
 lrU4it0WkYbTXVI1l4CaRUtshCIQTZwr2EsgppjAsBc1+V2KgtbxR1wkQq2q9tQZ
 Fa/2KTv9Y1+7FUOf09LEvTbuUgZn4I6u4E07QwY4miFsQSEUufirfHZ5t62lIgA9
 3YzUrlVQSP1PbG8IP4aCSX2o+dxhL1Js6ukdZAxM6w9RtjLqWI3zTImSSMJobjna
 SF53kkpv1xuJYT+Z1YmNGbMauzLs/HhCB9ww56TUuQYW/rTDASqFc48l7+vsfPrZ
 sTEkShVGOw==
 =d9Ws
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190420' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A set of small fixes that should go into this series. This contains:

   - Removal of unused queue member (Hou)

   - Overflow bvec fix (Ming)

   - Various little io_uring tweaks (me)
       - kthread parking
       - Only call cpu_possible() for verified CPU
       - Drop unused 'file' argument to io_file_put()
       - io_uring_enter vs io_uring_register deadlock fix
       - CQ overflow fix

   - BFQ internal depth update fix (me)"

* tag 'for-linus-20190420' of git://git.kernel.dk/linux-block:
  block: make sure that bvec length can't be overflow
  block: kill all_q_node in request_queue
  io_uring: fix CQ overflow condition
  io_uring: fix possible deadlock between io_uring_{enter,register}
  io_uring: drop io_file_put() 'file' argument
  bfq: update internal depth state when queue depth changes
  io_uring: only test SQPOLL cpu after we've verified it
  io_uring: park SQPOLL thread if it's percpu
2019-04-20 12:20:58 -07:00
Andrea Arcangeli
04f5866e41 coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-19 09:46:05 -07:00
Linus Torvalds
2a852fd1ac AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXLGSOfu3V2unywtrAQKrLA/+MJfgZC9ejNsXqjnnqq48LtBf4B3FhYrO
 qdPZMKPuy5XxSWBIIGpXtgK7PspkpAD1NCe7qpvibQoa1SJqwa2gc46aN7+gzp3e
 HETkb4UEXNV5yXSjEA+YOQVa0MiLG2yL6satlqhN4KqoDB5Rj9KoaPFwT79K4SIU
 1iME9Q//tA2M7SU8B1FWqyx5kd350A38KoXmHOux7x8K1ZyjEB2DbyKqZBtn6Y/T
 nfH9ZZioep1JbPFwOuG5NYQH0hP2jUj8ogDnrzzWzn1rQQTWs6wVfigwAh8tc/Dv
 eDKA8rV9WtZT97ExOG6cQOe0UZsM3NFhnb4P/oSCpl+jtr9WHO3pDmJPLToCr0BA
 8xXW7vA+Ejx15dNz4Snnz9yj3FS0UeURSFyhn3g2AFQMK4v9Dd3ZZKOgjR9Czi98
 f1UNh768LKY+siOHmoOhgZNlrI8bgamCmXOucPayGzrQ/HVHqAwR6PiUGCzaH+vk
 +LfDC/Z/ezSqrfQiH/Ji5lLkQkhmYiUTcePn5xdi+kZGWqHwl9qGplVAFUVafs5O
 Hz+H2F7i8F6Oc4v5hzCHp5aDUP2oEKmruaXJimN+O9SOkUPi72aG8STRpBFnBqf7
 77WxNVG/XDFL6I8HukkgRMRePFDXi+2NwGOjB9k8PpAqyA1Vi6sSkCXVkwgAhLRS
 NCaOR2iDaqE=
 =cqBe
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:

 - Stop using the deprecated get_seconds().

 - Don't make tracepoint strings const as the section they go in isn't
   read-only.

 - Differentiate failure due to unmarshalling from other failure cases.
   We shouldn't abort with RXGEN_CC/SS_UNMARSHAL if it's not due to
   unmarshalling.

 - Add a missing unlock_page().

 - Fix the interaction between receiving a notification from a server
   that it has invalidated all outstanding callback promises and a
   client call that we're in the middle of making that will get a new
   promise.

* tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix in-progess ops to ignore server-level callback invalidation
  afs: Unlock pages for __pagevec_release()
  afs: Differentiate abort due to unmarshalling from other errors
  afs: Avoid section confusion in CM_NAME
  afs: avoid deprecated get_seconds()
2019-04-18 08:10:22 -07:00
Linus Torvalds
e53f31bffe five small SMB3 fixes, all also for stable - an important fix for an oplock (lease) bug, a handle leak, and 3 bugs spotted by KASAN
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAly2PPMACgkQiiy9cAdy
 T1EbQQwArLGfousJE2mkQZIdAlefjeyCLJdFxOLktA3dd3EbmgOtxg8FSBt5W04X
 UqQ6i2cGL46Uecp+EaOTy2o0o3lc7eRGA83+0CYY+JLNLSS9g+RNgK2kIkcnj3h8
 N/fovUg3yhdLO0tmh13FcwBDAYPqjSd4iija1icBXSkkZdp592ims3zJaGunHMGn
 s237jaLPzx7xxqNaGDD/MtQtpPDzZzPlVCszTnO3anGaSHB1R7HNhR9twWTF/mmz
 bfhLiOyquzzt53XyaTmtpnzMua5X+Rhii2Pq315Pb+71JQ5TBYWRIOx/8QKwdGS+
 9EiiiEgqYfwmcDl1KPnm1ppTF2++mYkFBFq43cgosagDaL8YXdqtCIrsVkTxXnmv
 +bKhiXpriPzsy5INuRgfWmODrbl2QLRY1qbS4I/lA0+uZxFu7palwmSBkMOgcorn
 0TnejxsajTtYBpkzYKOuiRN7z4ukvR4FJd6xaxqSkyX+Jlda4wnwPTRVyBxOcL06
 EaAxdi7X
 =vsJs
 -----END PGP SIGNATURE-----

Merge tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb3 fixes from Steve French:
 "Five small SMB3 fixes, all also for stable - an important fix for an
  oplock (lease) bug, a handle leak, and three bugs spotted by KASAN"

* tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: keep FileInfo handle live during oplock break
  cifs: fix handle leak in smb2_query_symlink()
  cifs: Fix lease buffer length error
  cifs: Fix use-after-free in SMB2_read
  cifs: Fix use-after-free in SMB2_write
2019-04-17 13:36:45 -07:00
Jens Axboe
74f464e970 io_uring: fix CQ overflow condition
This is a leftover from when the rings initially were not free flowing,
and hence a test for tail + 1 == head would indicate full. Since we now
let them wrap instead of mask them with the size, we need to check if
they drift more than the ring size from each other.

This fixes a case where we'd overwrite CQ ring entries, if the user
failed to reap completions. Both cases would ultimately result in lost
completions as the application violated the depth it asked for. The only
difference is that before this fix we'd return invalid entries for the
overflowed completions, instead of properly flagging it in the
cq_ring->overflow variable.

Reported-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-17 11:41:49 -06:00
Linus Torvalds
2a3a028fc6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Handle init flow failures properly in iwlwifi driver, from Shahar S
    Matityahu.

 2) mac80211 TXQs need to be unscheduled on powersave start, from Felix
    Fietkau.

 3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau.

 4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed.

 5) Avoid checksum complete with XDP in mlx5, also from Saeed.

 6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon.

 7) Partial sent TLS record leak fix from Jakub Kicinski.

 8) Reject zero size iova range in vhost, from Jason Wang.

 9) Allow pending work to complete before clcsock release from Karsten
    Graul.

10) Fix XDP handling max MTU in thunderx, from Matteo Croce.

11) A lot of protocols look at the sa_family field of a sockaddr before
    validating it's length is large enough, from Tetsuo Handa.

12) Don't write to free'd pointer in qede ptp error path, from Colin Ian
    King.

13) Have to recompile IP options in ipv4_link_failure because it can be
    invoked from ARP, from Stephen Suryaputra.

14) Doorbell handling fixes in qed from Denis Bolotin.

15) Revert net-sysfs kobject register leak fix, it causes new problems.
    From Wang Hai.

16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva.

17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay
    Aleksandrov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits)
  socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
  tcp: tcp_grow_window() needs to respect tcp_space()
  ocelot: Clean up stats update deferred work
  ocelot: Don't sleep in atomic context (irqs_disabled())
  net: bridge: fix netlink export of vlan_stats_per_port option
  qed: fix spelling mistake "faspath" -> "fastpath"
  tipc: set sysctl_tipc_rmem and named_timeout right range
  tipc: fix link established but not in session
  net: Fix missing meta data in skb with vlan packet
  net: atm: Fix potential Spectre v1 vulnerabilities
  net/core: work around section mismatch warning for ptp_classifier
  net: bridge: fix per-port af_packet sockets
  bnx2x: fix spelling mistake "dicline" -> "decline"
  route: Avoid crash from dereferencing NULL rt->from
  MAINTAINERS: normalize Woojung Huh's email address
  bonding: fix event handling for stacked bonds
  Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
  rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
  qed: Fix the DORQ's attentions handling
  qed: Fix missing DORQ attentions
  ...
2019-04-17 09:57:45 -07:00
Aurelien Aptel
b98749cac4 CIFS: keep FileInfo handle live during oplock break
In the oplock break handler, writing pending changes from pages puts
the FileInfo handle. If the refcount reaches zero it closes the handle
and waits for any oplock break handler to return, thus causing a deadlock.

To prevent this situation:

* We add a wait flag to cifsFileInfo_put() to decide whether we should
  wait for running/pending oplock break handlers

* We keep an additionnal reference of the SMB FileInfo handle so that
  for the rest of the handler putting the handle won't close it.
  - The ref is bumped everytime we queue the handler via the
    cifs_queue_oplock_break() helper.
  - The ref is decremented at the end of the handler

This bug was triggered by xfstest 464.

Also important fix to address the various reports of
oops in smb2_push_mandatory_locks

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2019-04-16 09:38:38 -05:00
Ronnie Sahlberg
e6d0fb7b34 cifs: fix handle leak in smb2_query_symlink()
If we enter smb2_query_symlink() for something that is not a symlink
and where the SMB2_open() would succeed we would never end up
closing this handle and would thus leak a handle on the server.

Fix this by immediately calling SMB2_close() on successfull open.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:26 -05:00
ZhangXiaoxu
b57a55e220 cifs: Fix lease buffer length error
There is a KASAN slab-out-of-bounds:
BUG: KASAN: slab-out-of-bounds in _copy_from_iter_full+0x783/0xaa0
Read of size 80 at addr ffff88810c35e180 by task mount.cifs/539

CPU: 1 PID: 539 Comm: mount.cifs Not tainted 4.19 #10
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
            rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
Call Trace:
 dump_stack+0xdd/0x12a
 print_address_description+0xa7/0x540
 kasan_report+0x1ff/0x550
 check_memory_region+0x2f1/0x310
 memcpy+0x2f/0x80
 _copy_from_iter_full+0x783/0xaa0
 tcp_sendmsg_locked+0x1840/0x4140
 tcp_sendmsg+0x37/0x60
 inet_sendmsg+0x18c/0x490
 sock_sendmsg+0xae/0x130
 smb_send_kvec+0x29c/0x520
 __smb_send_rqst+0x3ef/0xc60
 smb_send_rqst+0x25a/0x2e0
 compound_send_recv+0x9e8/0x2af0
 cifs_send_recv+0x24/0x30
 SMB2_open+0x35e/0x1620
 open_shroot+0x27b/0x490
 smb2_open_op_close+0x4e1/0x590
 smb2_query_path_info+0x2ac/0x650
 cifs_get_inode_info+0x1058/0x28f0
 cifs_root_iget+0x3bb/0xf80
 cifs_smb3_do_mount+0xe00/0x14c0
 cifs_do_mount+0x15/0x20
 mount_fs+0x5e/0x290
 vfs_kern_mount+0x88/0x460
 do_mount+0x398/0x31e0
 ksys_mount+0xc6/0x150
 __x64_sys_mount+0xea/0x190
 do_syscall_64+0x122/0x590
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

It can be reproduced by the following step:
  1. samba configured with: server max protocol = SMB2_10
  2. mount -o vers=default

When parse the mount version parameter, the 'ops' and 'vals'
was setted to smb30,  if negotiate result is smb21, just
update the 'ops' to smb21, but the 'vals' is still smb30.
When add lease context, the iov_base is allocated with smb21
ops, but the iov_len is initiallited with the smb30. Because
the iov_len is longer than iov_base, when send the message,
copy array out of bounds.

we need to keep the 'ops' and 'vals' consistent.

Fixes: 9764c02fcb ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)")
Fixes: d5c7076b77 ("smb3: add smb3.1.1 to default dialect list")

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:23 -05:00
ZhangXiaoxu
088aaf17aa cifs: Fix use-after-free in SMB2_read
There is a KASAN use-after-free:
BUG: KASAN: use-after-free in SMB2_read+0x1136/0x1190
Read of size 8 at addr ffff8880b4e45e50 by task ln/1009

Should not release the 'req' because it will use in the trace.

Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> 4.18+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:21 -05:00
ZhangXiaoxu
6a3eb33606 cifs: Fix use-after-free in SMB2_write
There is a KASAN use-after-free:
BUG: KASAN: use-after-free in SMB2_write+0x1342/0x1580
Read of size 8 at addr ffff8880b6a8e450 by task ln/4196

Should not release the 'req' because it will use in the trace.

Fixes: eccb4422cf ("smb3: Add ftrace tracepoints for improved SMB3 debugging")

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org> 4.18+
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-04-16 09:38:18 -05:00
Linus Torvalds
5512320c9f fsdax fix 5.1-rc6
- Avoid a crash scenario with architectures like powerpc that require
   'pgtable_deposit' for the zero page.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJctNryAAoJEB7SkWpmfYgCzcMP/37LJbb4SYNwnDIW4BF33ril
 ZwtPeJJVTR56Ojo+Dy1v9084zeyhUHHewz0Oqx15dm6k/N5SS19yKNFKQDOK+4OC
 zbaWD5UOtllU3RQ2ORUOUoqNGF278+h4VVVQMntVaHhdt5f120tgHXxmKoB5Z5zH
 Gcy0vZNHoJ5lVYfKjKYG0b0/dWWOD1ZEjTkZjTa4DjhVSQcFauN8DxJ4hSyumYqs
 HDnZZt44RTTUS5W3BTlhuaSEcZaDOznmyj1HmKXNg3ghxguKACho4xhA7xFKqT8O
 03WZxDBFnOXZb3yfKpHB6RclkJgrtmD5U5GStzl5SobLPb2E/TPQzCRhZ/kcFPZ8
 RE2JkgdGl8gqCDRqRsC/tbF3dETO66vxUyf5utNv0ttBk7qLMwTGTKm3VQz7Xvu2
 SLkwv6Rlw4UT6ML8nd2kNhf8xRkaLl6j1B6zWDy7wEoFPXWW+My0PPpsJZcbTeza
 eib2ood7AlPHRU0/mW2ZrGHGabbS6kNGeQlod9U5sikkE7ZA/LwzyFl4b/uCqYNP
 NKGQdz0iHVcq8lFPXEmZ7vP2krd6uUWIv9KaiwmjBBMf9w3ZAzS85c7HFAZD0zgC
 tTHm6stMhpdS3ndyIxMBf0sL7AB/Q9BH7jJwDK/P8QObovezW2zZ4CPx/gQYJ2XU
 LTeCJmQh3xcCpI3f/eka
 =ijtJ
 -----END PGP SIGNATURE-----

Merge tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull fsdax fix from Dan Williams:
 "A single filesystem-dax fix. It has been lingering in -next for a long
  while and there are no other fsdax fixes on the horizon:

   - Avoid a crash scenario with architectures like powerpc that require
     'pgtable_deposit' for the zero page"

* tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  fs/dax: Deposit pagetable even when installing zero page
2019-04-15 15:10:20 -07:00
Jens Axboe
b19062a567 io_uring: fix possible deadlock between io_uring_{enter,register}
If we have multiple threads, one doing io_uring_enter() while the other
is doing io_uring_register(), we can run into a deadlock between the
two. io_uring_register() must wait for existing users of the io_uring
instance to exit. But it does so while holding the io_uring mutex.
Callers of io_uring_enter() may need this mutex to make progress (and
eventually exit). If we wait for users to exit in io_uring_register(),
we can't do so with the io_uring mutex held without potentially risking
a deadlock.

Drop the io_uring mutex while waiting for existing callers to exit. This
is safe and guaranteed to make forward progress, since we already killed
the percpu ref before doing so. Hence later callers of io_uring_enter()
will be rejected.

Reported-by: syzbot+16dc03452dee970a0c3e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-15 10:49:38 -06:00
Linus Torvalds
6b3a707736 Merge branch 'page-refs' (page ref overflow)
Merge page ref overflow branch.

Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).

Admittedly it's not exactly easy.  To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers.  Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).

Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication.  So let's just do that.

* branch page-refs:
  fs: prevent page refcount overflow in pipe_buf_get
  mm: prevent get_user_pages() from overflowing page refcount
  mm: add 'try_get_page()' helper function
  mm: make page ref count overflow check tighter and more explicit
2019-04-14 15:09:40 -07:00
Matthew Wilcox
15fab63e1e fs: prevent page refcount overflow in pipe_buf_get
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount.  All
callers converted to handle a failure.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-14 10:00:04 -07:00
Jens Axboe
3d6770fbd9 io_uring: drop io_file_put() 'file' argument
Since the fget/fput handling was reworked in commit 09bb839434, we
never call io_file_put() with state == NULL (and hence file != NULL)
anymore. Remove that case.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13 19:08:22 -06:00
Jens Axboe
917257daa0 io_uring: only test SQPOLL cpu after we've verified it
We currently call cpu_possible() even if we don't use the CPU. Move the
test under the SQ_AFF branch, which is the only place where we'll use
the value. Do the cpu_possible() test AFTER we've limited it to a max
of NR_CPUS. This avoids triggering the following warning:

WARNING: CPU: 1 PID: 7600 at include/linux/cpumask.h:121 cpu_max_bits_warn

if CONFIG_DEBUG_PER_CPU_MAPS is enabled.

While in there, also move the SQ thread idle period assignment inside
SETUP_SQPOLL, as we don't use it otherwise either.

Reported-by: syzbot+cd714a07c6de2bc34293@syzkaller.appspotmail.com
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13 19:08:22 -06:00
Jens Axboe
0605863246 io_uring: park SQPOLL thread if it's percpu
kthread expects this, or we can throw a warning on exit:

WARNING: CPU: 0 PID: 7822 at kernel/kthread.c:399
__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 7822 Comm: syz-executor030 Not tainted 5.1.0-rc4-next-20190412
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2cb/0x72b kernel/panic.c:214
  __warn.cold+0x20/0x46 kernel/panic.c:576
  report_bug+0x263/0x2b0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
RIP: 0010:__kthread_bind_mask+0x3b/0xc0 kernel/kthread.c:399
Code: 48 89 fb e8 f7 ab 24 00 4c 89 e6 48 89 df e8 ac e1 02 00 31 ff 49 89
c4 48 89 c6 e8 7f ad 24 00 4d 85 e4 75 15 e8 d5 ab 24 00 <0f> 0b e8 ce ab
24 00 5b 41 5c 41 5d 41 5e 5d c3 e8 c0 ab 24 00 4c
RSP: 0018:ffff8880a89bfbb8 EFLAGS: 00010293
RAX: ffff88808ca7a280 RBX: ffff8880a98e4380 RCX: ffffffff814bdd11
RDX: 0000000000000000 RSI: ffffffff814bdd1b RDI: 0000000000000007
RBP: ffff8880a89bfbd8 R08: ffff88808ca7a280 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff87691148 R14: ffff8880a98e43a0 R15: ffffffff81c91e10
  __kthread_bind kernel/kthread.c:412 [inline]
  kthread_unpark+0x123/0x160 kernel/kthread.c:480
  kthread_stop+0xfa/0x6c0 kernel/kthread.c:556
  io_sq_thread_stop fs/io_uring.c:2057 [inline]
  io_sq_thread_stop fs/io_uring.c:2052 [inline]
  io_finish_async+0xab/0x180 fs/io_uring.c:2064
  io_ring_ctx_free fs/io_uring.c:2534 [inline]
  io_ring_ctx_wait_and_kill+0x133/0x510 fs/io_uring.c:2591
  io_uring_release+0x42/0x50 fs/io_uring.c:2599
  __fput+0x2e5/0x8d0 fs/file_table.c:278
  ____fput+0x16/0x20 fs/file_table.c:309
  task_work_run+0x14a/0x1c0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x90a/0x2fa0 kernel/exit.c:876
  do_group_exit+0x135/0x370 kernel/exit.c:980
  __do_sys_exit_group kernel/exit.c:991 [inline]
  __se_sys_exit_group kernel/exit.c:989 [inline]
  __x64_sys_exit_group+0x44/0x50 kernel/exit.c:989
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: syzbot+6d4a92619eb0ad08602b@syzkaller.appspotmail.com
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-13 19:08:22 -06:00
Linus Torvalds
4443f8e6ac for-linus-20190412
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyw354QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiMyEAC4THUReCrTuv9oFRNg5uILVYIq51nP8dw7
 XamC7A92jPXd6vl/QVjmvLwT34/Y2XvX0t62RBsk849CEjgGYTeF1/qI3tMkpN7c
 huupab3aYM/Rrv4i1KSPQu6iIto3DYqfmREaGJJ1Ikbu/CKDuUGyEo+Z4wrKUPon
 GWnE8QMS2fdc764eVzKKqB+GryaEiHmeD1N4NnPs+nla14ysueUvJUikkTt/Laef
 h7nOmz9mrqE6u1xVHNpo0TlW0oJdLfaDIL9ghwHFJXqvriTh8Tg2tEHpXI6vSTTt
 StnPbTA1s1uhHs4rWYl8J0UXSZnRRp0Ep8jCvqEb9CJ23uHCNyGEoy/R7q+x2quf
 T+ruolMXY7IIJP30ZMHar374YfajJdw7EH/565nlbLnjSBXhqjmc07kQ7mIYvpg6
 JgureSdDwOOHpfrJgVq5es48ndt5HBYUBPzkvVGTgkeSJkMydkkM1qZeYEnai105
 8EnUFusRUnYZtb73HBPjKS7i0BZZvZlI1oKYHabiMtajqcKyvwDP2tTmhqXYLDLY
 9uloW0u2B0lddfzCb9hTYZOroNWfifo4vuSU5DHvnJoKvf4z3auDxaFD9N8fGn6S
 aZsRjMCpFqFd0YEnZPbsctgPg2Licrs02uPntlzBTJ0ByH20pX4OepYrvgQk3vao
 tOQ1jRYMKw==
 =cISy
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Set of fixes that should go into this round. This pull is larger than
  I'd like at this time, but there's really no specific reason for that.
  Some are fixes for issues that went into this merge window, others are
  not. Anyway, this contains:

   - Hardware queue limiting for virtio-blk/scsi (Dongli)

   - Multi-page bvec fixes for lightnvm pblk

   - Multi-bio dio error fix (Jason)

   - Remove the cache hint from the io_uring tool side, since we didn't
     move forward with that (me)

   - Make io_uring SETUP_SQPOLL root restricted (me)

   - Fix leak of page in error handling for pc requests (Jérôme)

   - Fix BFQ regression introduced in this merge window (Paolo)

   - Fix break logic for bio segment iteration (Ming)

   - Fix NVMe cancel request error handling (Ming)

   - NVMe pull request with two fixes (Christoph):
       - fix the initial CSN for nvme-fc (James)
       - handle log page offsets properly in the target (Keith)"

* tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
  block: fix the return errno for direct IO
  nvmet: fix discover log page when offsets are used
  nvme-fc: correct csn initialization and increments on error
  block: do not leak memory in bio_copy_user_iov()
  lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
  nvme: cancel request synchronously
  blk-mq: introduce blk_mq_complete_request_sync()
  scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
  virtio-blk: limit number of hw queues by nr_cpu_ids
  block, bfq: fix use after free in bfq_bfqq_expire
  io_uring: restrict IORING_SETUP_SQPOLL to root
  tools/io_uring: remove IOCQE_FLAG_CACHEHIT
  block: don't use for-inside-for in bio_for_each_segment_all
2019-04-13 16:23:16 -07:00
Linus Torvalds
b60bc0665e NFS client bugfixes for Linux 5.1
Highlights include:
 
 Stable fixes:
 - Fix a deadlock in close() due to incorrect draining of RDMA queues
 
 Bugfixes:
 - Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
   as it is causing stack overflows
 - Fix a regression where NFSv4 getacl and fs_locations stopped working
 - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
 - Fix xfstests failures due to incorrect copy_file_range() return values
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJcsfeVAAoJEA4mA3inWBJcPjAQAIPERRVWjg7xRz6CJzt2yoM1
 ApPj965DCnC9bGcGAH2U+TbCWJOi3lJwaZOPTL0ut/Tcv9PpKETRqk+rrjUcFRy1
 1b1HH16GivprOmHgCRyqo5Qj2ZiaGNpY3tJfxl/6eIiSpHKPZLa4zY+q2KfK/YNI
 SOVyNU0Gq08p4AiKr3CG5VVZGdNgRMrnzBYJqeTh1zZ7erWE2nJoE+pmvcLhZR0w
 uxshbTWbJT21KLEI+PXTyGtFkz5jNaKy4Ts07MRBJdQjDv73MUW8CcqFZicSjtqx
 zdKYa1VH9pEOjFOs57xGELSnYRdB00Vgd9/b6MqKyWH8iJzXFbgjEusMWiU45aeF
 NLg9ySSU8LeY93SxV66CHG57NIgHqwZu6P+lO3efRzuHgEGceDsz0WwDF2KNIZlm
 /vOmbk0I+woneFUeNDWAXD9/ETUJ8RCNk1/b1UlbkUL7aD5WSLDp1bKPifk/WA6E
 Mtgwmqz1Vso3cIPglWcAgsfEAYJZSJVDMfRIhm2dy7vVU0nfW12I00G8BShgr8f7
 mxAxd/V+1/Q9ftPENgC9z5LWKYQjfjksnYRHXW1m5c92Yoe9TF0yiNyDmT5hBR6w
 MvUN2j3yeQBqk6JHZxtH/mmdSRD0o5kxvFrEqMj1PpP8X8DpWupQA8SZKnHq0wlj
 8Q7LRum+wmhbiKCmZ+1F
 =vRPB
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fix:

   - Fix a deadlock in close() due to incorrect draining of RDMA queues

  Bugfixes:

   - Revert "SUNRPC: Micro-optimise when the task is known not to be
     sleeping" as it is causing stack overflows

   - Fix a regression where NFSv4 getacl and fs_locations stopped
     working

   - Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.

   - Fix xfstests failures due to incorrect copy_file_range() return
     values"

* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
  NFSv4.1 fix incorrect return value in copy_file_range
  xprtrdma: Fix helper that drains the transport
  NFS: Fix handling of reply page vector
  NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
2019-04-13 14:47:06 -07:00
David Howells
eeba1e9cf3 afs: Fix in-progess ops to ignore server-level callback invalidation
The in-kernel afs filesystem client counts the number of server-level
callback invalidation events (CB.InitCallBackState* RPC operations) that it
receives from the server.  This is stored in cb_s_break in various
structures, including afs_server and afs_vnode.

If an inode is examined by afs_validate(), say, the afs_server copy is
compared, along with other break counters, to those in afs_vnode, and if
one or more of the counters do not match, it is considered that the
server's callback promise is broken.  At points where this happens,
AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
refetched from the server.

afs_validate() issues an FS.FetchStatus operation to get updated metadata -
and based on the updated data_version may invalidate the pagecache too.

However, the break counters are also used to determine whether to note a
new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
and whether to cache the permit data included in the YFSFetchStatus record
by the server.


The problem comes when the server sends us a CB.InitCallBackState op.  The
first such instance doesn't cause cb_s_break to be incremented, but rather
causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
after last use and all the volumes have been automatically unmounted and
the server has forgotten about the client[*], this *will* likely cause an
increment.

 [*] There are other circumstances too, such as the server restarting or
     needing to make space in its callback table.

Note that the server won't send us a CB.InitCallBackState op until we talk
to it again.

So what happens is:

 (1) A mount for a new volume is attempted, a inode is created for the root
     vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
     immediately, as we don't have a nominated server to talk to yet - and
     we may iterate through a few to find one.

 (2) Before the operation happens, afs_fetch_status(), say, notes in the
     cursor (fc.cb_break) the break counter sum from the vnode, volume and
     server counters, but the server->cb_s_break is currently 0.

 (3) We send FS.FetchStatus to the server.  The server sends us back
     CB.InitCallBackState.  We increment server->cb_s_break.

 (4) Our FS.FetchStatus completes.  The reply includes a callback record.

 (5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
     the callback promise was broken by checking the break counter sum from
     step (2) against the current sum.

     This fails because of step (3), so we don't set the callback record
     and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.

This does not preclude the syscall from progressing, and we don't loop here
rechecking the status, but rather assume it's good enough for one round
only and will need to be rechecked next time.

 (6) afs_validate() it triggered on the vnode, probably called from
     d_revalidate() checking the parent directory.

 (7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
     update vnode->cb_s_break and assumes the vnode to be invalid.

 (8) afs_validate() needs to calls afs_fetch_status().  Go back to step (2)
     and repeat, every time the vnode is validated.

This primarily affects volume root dir vnodes.  Everything subsequent to
those inherit an already incremented cb_s_break upon mounting.


The issue is that we assume that the callback record and the cached permit
information in a reply from the server can't be trusted after getting a
server break - but this is wrong since the server makes sure things are
done in the right order, holding up our ops if necessary[*].

 [*] There is an extremely unlikely scenario where a reply from before the
     CB.InitCallBackState could get its delivery deferred till after - at
     which point we think we have a promise when we don't.  This, however,
     requires unlucky mass packet loss to one call.

AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
a server we've never contacted before, but this should be unnecessary.
It's also further insulated from the problem on an initial mount by
querying the server first with FS.GetCapabilities, which triggers the
CB.InitCallBackState.


Fix this by

 (1) Remove AFS_SERVER_FL_NEW.

 (2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
     calculation.

 (3) In afs_cb_is_broken(), don't include cb_s_break in the check.


Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13 08:37:37 +01:00
Marc Dionne
21bd68f196 afs: Unlock pages for __pagevec_release()
__pagevec_release() complains loudly if any page in the vector is still
locked.  The pages need to be locked for generic_error_remove_page(), but
that function doesn't actually unlock them.

Unlock the pages afterwards.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jonathan Billings <jsbillin@umich.edu>
2019-04-13 08:37:37 +01:00
David Howells
8022c4b95c afs: Differentiate abort due to unmarshalling from other errors
Differentiate an abort due to an unmarshalling error from an abort due to
other errors, such as ENETUNREACH.  It doesn't make sense to set abort code
RXGEN_*_UNMARSHAL in such a case, so use RX_USER_ABORT instead.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13 08:37:37 +01:00
Andi Kleen
d2abfa86ff afs: Avoid section confusion in CM_NAME
__tracepoint_str cannot be const because the tracepoint_str
section is not read-only. Remove the stray const.

Cc: dhowells@redhat.com
Cc: viro@zeniv.linux.org.uk
Signed-off-by: Andi Kleen <ak@linux.intel.com>
2019-04-13 08:37:36 +01:00
Arnd Bergmann
ba25b81e3a afs: avoid deprecated get_seconds()
get_seconds() has a limited range on 32-bit architectures and is
deprecated because of that. While AFS uses the same limits for
its inode timestamps on the wire protocol, let's just use the
simpler current_time() as we do for other file systems.

This will still zero out the 'tv_nsec' field of the timestamps
internally.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13 08:37:36 +01:00
Marc Dionne
f7f1dd3162 afs: Check for rxrpc call completion in wait loop
Check the state of the rxrpc call backing an afs call in each iteration of
the call wait loop in case the rxrpc call has already been terminated at
the rxrpc layer.

Interrupt the wait loop and mark the afs call as complete if the rxrpc
layer call is complete.

There were cases where rxrpc errors were not passed up to afs, which could
result in this loop waiting forever for an afs call to transition to
AFS_CALL_COMPLETE while the rx call was already complete.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00