Commit Graph

3833 Commits

Author SHA1 Message Date
Linus Torvalds
12bf0b632e NFS client bugfixes for Linux 5.7
Highlights include:
 
 Stable fixes:
 - nfs: fix NULL deference in nfs4_get_valid_delegation
 
 Bugfixes:
 - Fix corruption of the return value in cachefiles_read_or_alloc_pages()
 - Fix several fscache cookie issues
 - Fix a fscache queuing race that can trigger a BUG_ON
 - NFS: Fix 2 use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
 - SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
 - SUNRPC: Fix a race when tearing down the rpc client debugfs directory
 - SUNRPC: Signalled ASYNC tasks need to exit
 - NFSv3: fix rpc receive buffer size for MOUNT call
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6/AkgACgkQZwvnipYK
 APICLQ/9ENY+mQmVMSbtw2VGlBphS+GLN44k70NgmjAaI9n8f3ILZyLl0/iHEPFU
 ZaCWhPWEs76olLfCWwoQFzISIUTHWVUow8mJn7gTh4mq8n9UFv8zgUxtJ3HTfEHt
 rtujG9xsK0Oa351UPJWE5yC7PvFzMohcc1vVD8WQeGJQ3sMBVHTuOuPatIv3vK6s
 8MflKTbtz/gUkcbWMCk9ljMPXr6/Ksgu9GZDnDFAYZBfFkwx//RNmq6K+z1Ru15s
 tkmPPZGNMCfKblrnUXUmPt78wxSExmWrXroSMNas2fyeOXgPL0ogNx8vfdFcFsxs
 sHpMkF97+npntNj1y5om6GrdU4SRYUFqXT7pNqV4wuGguOfFELIXrJIQuCDaKrGD
 ApEoo9UDisGCLdqs738ascZFHZiTQoy6drbpR8moalqhYkTI7Al/pPxHnozGDYsJ
 +wElaFXZX2hlPc6ih1q54RcB+D4qswDC9QudArKc9hJEKPv+SsmiVhBG/f+X+Jca
 M19UJGWZvRtY8L+0yJdG22O9Hwo0zSK917gtOZNgkwtkkKgzjj2kcNHTK2jGBEuR
 pqEIQCreUH8Le0WR9cPeJeYc2/HeCEWDHrf/q+gFRClaZMe+0Sfu3z3pBH3v0T9q
 qoZ1VvCDUhggsZyJZebcPTCL+ghqOXFJajHLlcA6BSdGJ/iXq2M=
 =lqFu
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - nfs: fix NULL deference in nfs4_get_valid_delegation

  Bugfixes:
   - Fix corruption of the return value in cachefiles_read_or_alloc_pages()
   - Fix several fscache cookie issues
   - Fix a fscache queuing race that can trigger a BUG_ON
   - NFS: Fix two use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
   - SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
   - SUNRPC: Fix a race when tearing down the rpc client debugfs directory
   - SUNRPC: Signalled ASYNC tasks need to exit
   - NFSv3: fix rpc receive buffer size for MOUNT call"

* tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv3: fix rpc receive buffer size for MOUNT call
  SUNRPC: 'Directory with parent 'rpc_clnt' already present!'
  NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs
  NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn
  SUNRPC: Signalled ASYNC tasks need to exit
  nfs: fix NULL deference in nfs4_get_valid_delegation
  SUNRPC: fix use-after-free in rpc_free_client_work()
  cachefiles: Fix race between read_waiter and read_copier involving op->to_do
  NFSv4: Fix fscache cookie aux_data to ensure change_attr is included
  NFS: Fix fscache super_cookie allocation
  NFS: Fix fscache super_cookie index_key from changing after umount
  cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()
2020-05-15 14:03:13 -07:00
J. Bruce Fields
933496e9cc SUNRPC: 'Directory with parent 'rpc_clnt' already present!'
Each rpc_client has a cl_clid which is allocated from a global ida, and
a debugfs directory which is named after cl_clid.

We're releasing the cl_clid before we free the debugfs directory named
after it.  As soon as the cl_clid is released, that value is available
for another newly created client.

That leaves a window where another client may attempt to create a new
debugfs directory with the same name as the not-yet-deleted debugfs
directory from the dying client.  Symptoms are log messages like

	Directory 4 with parent 'rpc_clnt' already present!

Fixes: 7c4310ff56 "SUNRPC: defer slow parts of rpc_free_client() to a workqueue."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-14 16:31:09 -04:00
Linus Torvalds
152036d137 Fixes:
- Resolve a data integrity problem with NFSD that I inadvertently
 introduced last year. The change I made makes the NFS server's
 duplicate reply cache ineffective when krb5i or krb5p are in use,
 thus allowing the replay of non-idempotent NFS requests such as
 RENAME, SETATTR, or even WRITEs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJerCuyAAoJEDNqszNvZn+XxvAQAJmUW5412OO7mkI2IW5PDP71
 ZnBAuTs4UpLBgp1VpS3ai0LYnOX9o8WLqolzGuFxGfK69ZZdh7U7fzX2aEytoTSP
 KkW3dNo+NzRppWOhMBEfMBLnAu22YF+F689RvwEqd0C1AgGugaFfzlF1ECrJVpA7
 g1WVhTi0ihfArhzSWTWO4LiuwjRd5TNF8gEci2j3DuHn1Hp6BagbKOv0rFdgK99X
 BbK8IaEalBUjtpGAPgRU/WY/WznzhgARVeOX7Rh/P/zFdFB1G1M4kycaadBk6uaU
 SHbdWBwDsYatDNuhZUI3Wv2g+DQ5LJRrjNNesLRot+kC3XD12sBCMsSI3owoz7Jt
 u0s48YmOJO8uWi4kDenR9XV8bAaDmX7R/+XGZm1lethNrpBKat9EIrqSHNvqAXZ4
 b3cC8/A/aCcOrWXtZnWqvJdqjx2EgL6DbcpaFheaPEekRofuiyOaAbXdlJQvzcwY
 Sv4EC4ymABpQRg0si+Sya5Int7bZ9ryLZTSCMiLA+L1TnoW26XjMlGAaRqYi7Tx7
 Qg4Bt400IIDE0FlE/76vE7b7YWQj7GfErA6moIyDio5AInRU9sHDFyB8iCfdpKxh
 ajNl1NuEO/FSoXOGQvOo1uHD0vKvNVK21T6vQsRCT1f6JXtpiwTn6eLX4Wn9YLdI
 iKqg2YXfdCbJnAuoxzGi
 =hT3x
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd fixes from Chuck Lever:
 "Resolve a data integrity problem with NFSD that I inadvertently
  introduced last year.

  The change I made makes the NFS server's duplicate reply cache
  ineffective when krb5i or krb5p are in use, thus allowing the replay
  of non-idempotent NFS requests such as RENAME, SETATTR, or even
  WRITEs"

* tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6:
  SUNRPC: Revert 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
  SUNRPC: Fix GSS privacy computation of auth->au_ralign
  SUNRPC: Add "@len" parameter to gss_unwrap()
2020-05-11 12:04:52 -07:00
Chuck Lever
ce99aa62e1 SUNRPC: Signalled ASYNC tasks need to exit
Ensure that signalled ASYNC rpc_tasks exit immediately instead of
spinning until a timeout (or forever).

To avoid checking for the signal flag on every scheduler iteration,
the check is instead introduced in the client's finite state
machine.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: ae67bd3821 ("SUNRPC: Fix up task signalling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-11 14:06:50 -04:00
NeilBrown
31e9a7f353 SUNRPC: fix use-after-free in rpc_free_client_work()
Parts of rpc_free_client() were recently moved to
a separate rpc_free_clent_work().  This introduced
a use-after-free as rpc_clnt_remove_pipedir() calls
rpc_net_ns(), and that uses clnt->cl_xprt which has already
been freed.
So move the call to xprt_put() after the call to
rpc_clnt_remove_pipedir().

Reported-by: syzbot+22b5ef302c7c40d94ea8@syzkaller.appspotmail.com
Fixes: 7c4310ff56 ("SUNRPC: defer slow parts of rpc_free_client() to a workqueue.")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-10 19:44:56 -04:00
Linus Torvalds
29a47f456d NFS client bugfixes for Linux 5.7
Highlights include:
 
 Stable fixes
 - fix handling of backchannel binding in BIND_CONN_TO_SESSION
 
 Bugfixes
 - Fix a credential use-after-free issue in pnfs_roc()
 - Fix potential posix_acl refcnt leak in nfs3_set_acl
 - defer slow parts of rpc_free_client() to a workqueue
 - Fix an Oopsable race in __nfs_list_for_each_server()
 - Fix trace point use-after-free race
 - Regression: the RDMA client no longer responds to server disconnect requests
 - Fix return values of xdr_stream_encode_item_{present, absent}
 - _pnfs_return_layout() must always wait for layoutreturn completion
 
 Cleanups
 - Remove unreachable error conditions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6tczsACgkQZwvnipYK
 APKHWg//QGx2Tolj5dh2jBHa47A5/SYnJxCZAA0/fWdwRtFkW3HyyGne1jU86do2
 SMAVpBpri1WJPt5d3DH66gu4l4UxG1h84s7QP4lGfSa85EmtLh+LoZQCZRqYoDOo
 JAMzWctELu1TUpaa1N5Dhg/qMtMy6ulRMWgzTLqB9a/pQa3onugTK6W7xiut2prj
 PBfFq7N9XXmPboSeGV9bR4L8XKSbTCLEt3U1F2zAGU7UUINvDfpjEXq7BHYCewKL
 ObPW6EWZksyna16H8i/xGWoKgE4JFVjMwQAP7UdDBi+FW9RI6UpTBoR6z9N748j0
 jEocDbI21wgnwmtrVTbzsYm6ttHl4D4egoNxn7m5zjxTU4Ba/RQG2aaHUGFOYpJj
 1FI1f6V1Y5v4mJajdsEH+pGW/4vK/4YMR+7YHJ/hYU/WiXjLf7onIIifdWt4SQdo
 lvZbGcx6IAHYUA4lI7hkcvrK4bbqAnPLFq28nlUWEID5q5D+nA1ZR9iN0FToviDy
 FYyhQzyfD1kt98SV1DjWUqvDDd6IB64iDZTXGmtWvj6c2nbezGiFffvtzUL5LFxY
 QfI8lkpmUyt1EiWlZWhtOh4zsiM5yMZkJB/3RJv3RMmswizSSAHdgCKWhdLpX0bl
 TG1L8yEmcTc5ANS37EhlpcBNbfYw7oIF/OXuReTSRoMQl5hxjfY=
 =w0zk
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - fix handling of backchannel binding in BIND_CONN_TO_SESSION

  Bugfixes:
   - Fix a credential use-after-free issue in pnfs_roc()
   - Fix potential posix_acl refcnt leak in nfs3_set_acl
   - defer slow parts of rpc_free_client() to a workqueue
   - Fix an Oopsable race in __nfs_list_for_each_server()
   - Fix trace point use-after-free race
   - Regression: the RDMA client no longer responds to server disconnect
     requests
   - Fix return values of xdr_stream_encode_item_{present, absent}
   - _pnfs_return_layout() must always wait for layoutreturn completion

  Cleanups:
   - Remove unreachable error conditions"

* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix a race in __nfs_list_for_each_server()
  NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
  SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
  NFSv4: Remove unreachable error condition due to rpc_run_task()
  SUNRPC: Remove unreachable error condition
  xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
  xprtrdma: Fix trace point use-after-free race
  xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
  nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
  NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
  NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
2020-05-02 11:24:01 -07:00
Trond Myklebust
8e2912c7c6 NFSoRDMA Client Fixes for Linux 5.7
Bugfixes:
 - Restore wake-up-all to rpcrdma_cm_event_handler()
   - Otherwise the client won't respond to server disconnect requests
 - Fix tracepoint use-after-free race
 - Fix usage of xdr_stream_encode_item_{present, absent}
   - These functions return a size on success, and not 0
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl6jVawACgkQ18tUv7Cl
 QOvklhAAmg0MImljNfQ6FmMHQokxtLlOuw7P8HQsARrwxO+cJjSx1vriCI0EyE+T
 yB386Bd+tGUurDZf6k+TjctQlVNSfz7Pv2gJTj/rQEvZFM2AopOBjbovKnNOnOTj
 aPnNOkshZzxZk9JYh9RN3zmyZHtlldimnvLloUrO3nZ63iUGuF1nBkQ9TWuG/CSz
 XuE1bUJgsJkdG07+I1yLrn9MAOdgzuJB2TJ5hL2GjN0BIrJ3jzweXVYXrLFikZUD
 vRTyf1K/zcDOVKPn/Aw4NSPNeTBlsP0Ain9zY7C2cuEXasu1gbEXYmmq/qvrF6kt
 N9TIDqqDZgDRA4Z3nfCJZOXyNwZ3nyEx0+S+8lK58w6khXRYiD21bxrA1eSB3aIS
 LzqJIoto/2/cl8DjNbuyvB7znjlthxYPcOo6cCeJOQ7jPWi6IDDZ6t9mwnXhSU3C
 +GdUyulWJPfx9hj6HlDbrXy6ESlCIFYobHGKVqYu7ZvaIxVVSdlfAIR64PsD7edX
 m3NzNcH1PJNKmTFvyyQDuMgwxauVLaAAs2A0Vvnzt8Vq8uSxEiySVUXOjH4+DnNm
 NNfDXGq7yQj9NqAsqlCe+zkcBT7Wy7ds+y05tipWD92Lu2mvOE+3Lht5H5GnIyWl
 1Lq/eUmrBBlxYRK0ir/RRI3aBYvuO4AK/g81ZzGQZAiWAL319MI=
 =jre3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-5.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFSoRDMA Client Fixes for Linux 5.7

Bugfixes:
- Restore wake-up-all to rpcrdma_cm_event_handler()
  - Otherwise the client won't respond to server disconnect requests
- Fix tracepoint use-after-free race
- Fix usage of xdr_stream_encode_item_{present, absent}
  - These functions return a size on success, and not 0

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-28 15:59:03 -04:00
NeilBrown
7c4310ff56 SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
The rpciod workqueue is on the write-out path for freeing dirty memory,
so it is important that it never block waiting for memory to be
allocated - this can lead to a deadlock.

rpc_execute() - which is often called by an rpciod work item - calls
rcp_task_release_client() which can lead to rpc_free_client().

rpc_free_client() makes two calls which could potentially block wating
for memory allocation.

rpc_clnt_debugfs_unregister() calls into debugfs and will block while
any of the debugfs files are being accessed.  In particular it can block
while any of the 'open' methods are being called and all of these use
malloc for one thing or another.  So this can deadlock if the memory
allocation waits for NFS to complete some writes via rpciod.

rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
obvious that memory allocations can happen while the lock it held, it is
safer to assume they might and to not let rpciod call
rpc_clnt_remove_pipedir().

So this patch moves these two calls (together with the final kfree() and
rpciod_down()) into a work-item to be run from the system work-queue.
rpciod can continue its important work, and the final stages of the free
can happen whenever they happen.

I have seen this deadlock on a 4.12 based kernel where debugfs used
synchronize_srcu() when removing objects.  synchronize_srcu() requires a
workqueue and there were no free workther threads and none could be
allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
is still possible.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-28 15:58:38 -04:00
Chuck Lever
0a8e7b7d08 SUNRPC: Revert 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
I've noticed that when krb5i or krb5p security is in use,
retransmitted requests are missing the server's duplicate reply
cache. The computed checksum on the retransmitted request does not
match the cached checksum, resulting in the server performing the
retransmitted request again instead of returning the cached reply.

The assumptions made when removing xdr_buf_trim() were not correct.
In the send paths, the upper layer has already set the segment
lengths correctly, and shorting the buffer's content is simply a
matter of reducing buf->len.

xdr_buf_trim() is the right answer in the receive/unwrap path on
both the client and the server. The buffer segment lengths have to
be shortened one-by-one.

On the server side in particular, head.iov_len needs to be updated
correctly to enable nfsd_cache_csum() to work correctly. The simple
buf->len computation doesn't do that, and that results in
checksumming stale data in the buffer.

The problem isn't noticed until there's significant instability of
the RPC transport. At that point, the reliability of retransmit
detection on the server becomes crucial.

Fixes: 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27 10:58:30 -04:00
Chuck Lever
a7e429a6fa SUNRPC: Fix GSS privacy computation of auth->au_ralign
When the au_ralign field was added to gss_unwrap_resp_priv, the
wrong calculation was used. Setting au_rslack == au_ralign is
probably correct for kerberos_v1 privacy, but kerberos_v2 privacy
adds additional GSS data after the clear text RPC message.
au_ralign needs to be smaller than au_rslack in that fairly common
case.

When xdr_buf_trim() is restored to gss_unwrap_kerberos_v2(), it does
exactly what I feared it would: it trims off part of the clear text
RPC message. However, that's because rpc_prepare_reply_pages() does
not set up the rq_rcv_buf's tail correctly because au_ralign is too
large.

Fixing the au_ralign computation also corrects the alignment of
rq_rcv_buf->pages so that the client does not have to shift reply
data payloads after they are received.

Fixes: 35e77d21ba ("SUNRPC: Add rpc_auth::au_ralign field")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27 10:58:30 -04:00
Chuck Lever
31c9590ae4 SUNRPC: Add "@len" parameter to gss_unwrap()
Refactor: This is a pre-requisite to fixing the client-side ralign
computation in gss_unwrap_resp_priv().

The length value is passed in explicitly rather that as the value
of buf->len. This will subsequently allow gss_unwrap_kerberos_v1()
to compute a slack and align value, instead of computing it in
gss_unwrap_resp_priv().

Fixes: 35e77d21ba ("SUNRPC: Add rpc_auth::au_ralign field")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27 10:58:30 -04:00
Xiyu Yang
efe57fd58e SUNRPC: Remove unreachable error condition
rpc_clnt_test_and_add_xprt() invokes rpc_call_null_helper(), which
return the value of rpc_run_task() to "task". Since rpc_run_task() is
impossible to return an ERR pointer, there is no need to add the
IS_ERR() condition on "task" here. So we need to remove it.

Fixes: 7f55489058 ("SUNRPC: Allow addition of new transports to a struct rpc_clnt")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-21 19:11:59 -04:00
Chuck Lever
48a124e383 xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
These new helpers do not return 0 on success, they return the
encoded size. Thus they are not a drop-in replacement for the
old helpers.

Fixes: 5c266df527 ("SUNRPC: Add encoders for list item discriminators")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20 10:45:01 -04:00
Chuck Lever
bdb2ce8281 xprtrdma: Fix trace point use-after-free race
It's not safe to use resources pointed to by the @send_wr of
ib_post_send() _after_ that function returns. Those resources are
typically freed by the Send completion handler, which can run before
ib_post_send() returns.

Thus the trace points currently around ib_post_send() in the
client's RPC/RDMA transport are a hazard, even when they are
disabled. Rearrange them so that they touch the Work Request only
_before_ ib_post_send() is invoked.

Fixes: ab03eff58e ("xprtrdma: Add trace points in RPC Call transmit paths")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20 10:44:01 -04:00
Chuck Lever
58bd6656f8 xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
Commit e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from
rpcrdma_xprt") erroneously removed a xprt_force_disconnect()
call from the "transport disconnect" path. The result was that the
client no longer responded to server-side disconnect requests.

Restore that call.

Fixes: e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20 10:41:00 -04:00
Chuck Lever
23cf1ee1f1 svcrdma: Fix leak of svc_rdma_recv_ctxt objects
Utilize the xpo_release_rqst transport method to ensure that each
rqstp's svc_rdma_recv_ctxt object is released even when the server
cannot return a Reply for that rqstp.

Without this fix, each RPC whose Reply cannot be sent leaks one
svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
Receive buffer, and any pages that might be part of the Reply
message.

The leak is infrequent unless the network fabric is unreliable or
Kerberos is in use, as GSS sequence window overruns, which result
in connection loss, are more common on fast transports.

Fixes: 3a88092ee3 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17 12:40:38 -04:00
Chuck Lever
e28b4fc652 svcrdma: Fix trace point use-after-free race
I hit this while testing nfsd-5.7 with kernel memory debugging
enabled on my server:

Mar 30 13:21:45 klimt kernel: BUG: unable to handle page fault for address: ffff8887e6c279a8
Mar 30 13:21:45 klimt kernel: #PF: supervisor read access in kernel mode
Mar 30 13:21:45 klimt kernel: #PF: error_code(0x0000) - not-present page
Mar 30 13:21:45 klimt kernel: PGD 3601067 P4D 3601067 PUD 87c519067 PMD 87c3e2067 PTE 800ffff8193d8060
Mar 30 13:21:45 klimt kernel: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
Mar 30 13:21:45 klimt kernel: CPU: 2 PID: 1933 Comm: nfsd Not tainted 5.6.0-rc6-00040-g881e87a3c6f9 #1591
Mar 30 13:21:45 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
Mar 30 13:21:45 klimt kernel: RIP: 0010:svc_rdma_post_chunk_ctxt+0xab/0x284 [rpcrdma]
Mar 30 13:21:45 klimt kernel: Code: c1 83 34 02 00 00 29 d0 85 c0 7e 72 48 8b bb a0 02 00 00 48 8d 54 24 08 4c 89 e6 48 8b 07 48 8b 40 20 e8 5a 5c 2b e1 41 89 c6 <8b> 45 20 89 44 24 04 8b 05 02 e9 01 00 85 c0 7e 33 e9 5e 01 00 00
Mar 30 13:21:45 klimt kernel: RSP: 0018:ffffc90000dfbdd8 EFLAGS: 00010286
Mar 30 13:21:45 klimt kernel: RAX: 0000000000000000 RBX: ffff8887db8db400 RCX: 0000000000000030
Mar 30 13:21:45 klimt kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000246
Mar 30 13:21:45 klimt kernel: RBP: ffff8887e6c27988 R08: 0000000000000000 R09: 0000000000000004
Mar 30 13:21:45 klimt kernel: R10: ffffc90000dfbdd8 R11: 00c068ef00000000 R12: ffff8887eb4e4a80
Mar 30 13:21:45 klimt kernel: R13: ffff8887db8db634 R14: 0000000000000000 R15: ffff8887fc931000
Mar 30 13:21:45 klimt kernel: FS:  0000000000000000(0000) GS:ffff88885bd00000(0000) knlGS:0000000000000000
Mar 30 13:21:45 klimt kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 CR3: 000000081b72e002 CR4: 00000000001606e0
Mar 30 13:21:45 klimt kernel: Call Trace:
Mar 30 13:21:45 klimt kernel: ? svc_rdma_vec_to_sg+0x7f/0x7f [rpcrdma]
Mar 30 13:21:45 klimt kernel: svc_rdma_send_write_chunk+0x59/0xce [rpcrdma]
Mar 30 13:21:45 klimt kernel: svc_rdma_sendto+0xf9/0x3ae [rpcrdma]
Mar 30 13:21:45 klimt kernel: ? nfsd_destroy+0x51/0x51 [nfsd]
Mar 30 13:21:45 klimt kernel: svc_send+0x105/0x1e3 [sunrpc]
Mar 30 13:21:45 klimt kernel: nfsd+0xf2/0x149 [nfsd]
Mar 30 13:21:45 klimt kernel: kthread+0xf6/0xfb
Mar 30 13:21:45 klimt kernel: ? kthread_queue_delayed_work+0x74/0x74
Mar 30 13:21:45 klimt kernel: ret_from_fork+0x3a/0x50
Mar 30 13:21:45 klimt kernel: Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue ib_umad ib_ipoib mlx4_ib sb_edac x86_pkg_temp_thermal iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd pcspkr rpcrdma i2c_i801 rdma_ucm lpc_ich mfd_core ib_iser rdma_cm iw_cm ib_cm mei_me raid0 libiscsi mei sg scsi_transport_iscsi ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod sr_mod cdrom mlx4_core crc32c_intel igb nvme i2c_algo_bit ahci i2c_core libahci nvme_core dca libata t10_pi qedr dm_mirror dm_region_hash dm_log dm_mod dax qede qed crc8 ib_uverbs ib_core
Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8
Mar 30 13:21:45 klimt kernel: ---[ end trace 87971d2ad3429424 ]---

It's absolutely not safe to use resources pointed to by the @send_wr
argument of ib_post_send() _after_ that function returns. Those
resources are typically freed by the Send completion handler, which
can run before ib_post_send() returns.

Thus the trace points currently around ib_post_send() in the
server's RPC/RDMA transport are a hazard, even when they are
disabled. Rearrange them so that they touch the Work Request only
_before_ ib_post_send() is invoked.

Fixes: bd2abef333 ("svcrdma: Trace key RDMA API events")
Fixes: 4201c74647 ("svcrdma: Introduce svc_rdma_send_ctxt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17 12:40:38 -04:00
Chuck Lever
6221f1d9b6 SUNRPC: Fix backchannel RPC soft lockups
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.

Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4e3c ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.

Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
2020-04-17 12:40:31 -04:00
Yihao Wu
43e33924c3 SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge
Deleting list entry within hlist_for_each_entry_safe is not safe unless
next pointer (tmp) is protected too. It's not, because once hash_lock
is released, cache_clean may delete the entry that tmp points to. Then
cache_purge can walk to a deleted entry and tries to double free it.

Fix this bug by holding only the deleted entry's reference.

Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NeilBrown <neilb@suse.de>
[ cel: removed unused variable ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-13 10:28:21 -04:00
Linus Torvalds
04de788e61 NFS client updates for Linux 5.7
Highlights include:
 
 Stable fixes:
 - Fix a page leak in nfs_destroy_unlinked_subrequests()
 - Fix use-after-free issues in nfs_pageio_add_request()
 - Fix new mount code constant_table array definitions
 - finish_automount() requires us to hold 2 refs to the mount record
 
 Features:
 - Improve the accuracy of telldir/seekdir by using 64-bit cookies when
   possible.
 - Allow one RDMA active connection and several zombie connections to
   prevent blocking if the remote server is unresponsive.
 - Limit the size of the NFS access cache by default
 - Reduce the number of references to credentials that are taken by NFS
 - pNFS files and flexfiles drivers now support per-layout segment
   COMMIT lists.
 - Enable partial-file layout segments in the pNFS/flexfiles driver.
 - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
 - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
   the DS using the layouterror mechanism.
 
 Bugfixes and cleanups:
 - SUNRPC: Fix krb5p regressions
 - Don't specify NFS version in "UDP not supported" error
 - nfsroot: set tcp as the default transport protocol
 - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
 - alloc_nfs_open_context() must use the file cred when available
 - Fix locking when dereferencing the delegation cred
 - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
 - Various clean ups of the NFS O_DIRECT commit code
 - Clean up RDMA connect/disconnect
 - Replace zero-length arrays with C99-style flexible arrays
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6LhhsACgkQZwvnipYK
 APIOJxAAiQOgmIg1CV4mrlcVhkwy09N5JAia6AENtoTmwm08nAYg5Y8REb9uX46a
 /MJsM2WG8hBCgI6eYmRY8LTr4Ft9rTQEJM9DRMuwQREXwMWwBhUv/QakCeqY1lHE
 lyB1z4hj5XKeUoN/OcfALC/GXFFf56A0UyN05nMzeCkBTdd3+qu+hW8Ge1wkAXcr
 f0pyLbzdFZlJuTmI4tr8F93g9p3ezuFBuEroT7XPIVJylAdZVumHqnOnz/Mvb99x
 rNTsX2dc44GhSAfRnTzPumU3MT6BOLvUzNH1xzdiqKzJrbOnG8WjFodrGr3JWpfp
 HkeyYQxJ+Hnfb2LiZBjvMQE8M7kVMZ1jVbrGJEbCxfSqgTly8lOHboqAeKsFaReK
 LStnusizdA1LHQVZxPdvn+oL49RDxnzm9dY+DkrXK1qT0GE+icN1CyTyLLfkSCp8
 tYvZSJ/qPk5BNZegqH1nBqXkMDkOJ4eEA7+luXDmajRkdRrZ3IWY2M1DpMEoueJ2
 j/zoj/NFr1oErU4o7PV9oolA1Euhn1L3wIDuzsbVtjySmbXJNQTtaVVRFpGw3SsZ
 7rbqi4BB0SzOooNhQ4q8mLNi4qT7bl/3D04eL8UVzEM73plexhQ8XiOEz/VrIRX7
 L9viXH49g4DHQ0rZIaWefxFueqpgbNvQwnlLZl2uQotG9hwhTts=
 =YUcP
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - Fix a page leak in nfs_destroy_unlinked_subrequests()

   - Fix use-after-free issues in nfs_pageio_add_request()

   - Fix new mount code constant_table array definitions

   - finish_automount() requires us to hold 2 refs to the mount record

  Features:
   - Improve the accuracy of telldir/seekdir by using 64-bit cookies
     when possible.

   - Allow one RDMA active connection and several zombie connections to
     prevent blocking if the remote server is unresponsive.

   - Limit the size of the NFS access cache by default

   - Reduce the number of references to credentials that are taken by
     NFS

   - pNFS files and flexfiles drivers now support per-layout segment
     COMMIT lists.

   - Enable partial-file layout segments in the pNFS/flexfiles driver.

   - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type

   - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
     the DS using the layouterror mechanism.

  Bugfixes and cleanups:
   - SUNRPC: Fix krb5p regressions

   - Don't specify NFS version in "UDP not supported" error

   - nfsroot: set tcp as the default transport protocol

   - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()

   - alloc_nfs_open_context() must use the file cred when available

   - Fix locking when dereferencing the delegation cred

   - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails

   - Various clean ups of the NFS O_DIRECT commit code

   - Clean up RDMA connect/disconnect

   - Replace zero-length arrays with C99-style flexible arrays"

* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
  NFS: Clean up process of marking inode stale.
  SUNRPC: Don't start a timer on an already queued rpc task
  NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
  NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
  NFS: Beware when dereferencing the delegation cred
  NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
  NFS: finish_automount() requires us to hold 2 refs to the mount record
  NFS: Fix a few constant_table array definitions
  NFS: Try to join page groups before an O_DIRECT retransmission
  NFS: Refactor nfs_lock_and_join_requests()
  NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
  NFS: Clean up nfs_lock_and_join_requests()
  NFS: Remove the redundant function nfs_pgio_has_mirroring()
  NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
  NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
  NFS: Fix use-after-free issues in nfs_pageio_add_request()
  NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
  NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
  NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
  pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
  ...
2020-04-07 13:51:39 -07:00
Trond Myklebust
1fab7dc477 SUNRPC: Don't start a timer on an already queued rpc task
Move the test for whether a task is already queued to prevent
corruption of the timer list in __rpc_sleep_on_priority_timeout().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-04 19:59:27 -04:00
Trond Myklebust
f764a1e1cb NFSoRDMA Client Updates for Linux 5.7
New Features:
 - Allow one active connection and several zombie connections to prevent
   blocking if the remote server is unresponsive.
 
 Bugfixes and Cleanups:
 - Enhance MR-related trace points
 - Refactor connection set-up and disconnect functions
 - Make Protection Domains per-connection instead of per-transport
 - Merge struct rpcrdma_ia into rpcrdma_ep
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5+GKEACgkQ18tUv7Cl
 QOvJVBAA4MNSvzo0856X414uUMe1YK1HXblqYbNWSe0bRsfaakp8mscMFmd4/IuG
 YSOlQPYcTErdfeEelS1ayMZA3qy5Ke62hKFFfv0fk11lN3VoT4XPiIqY6Ry2OJaR
 7XoYyH3YKJxgqOYhQwSo4rvFRNtPKbvEtcQn8uLXnMclomlZ1mli4F9rYs23SvLL
 Jbq1q/SKx0w+GAOUj418dYLjMnQU7L+TotplTBGzE7mkhBR0JarBiKZfLxbg2/md
 Fva1EwY4Tgef+4ofVr5KP8raGVGl6zy8TIZezlijeUv4rmk9JudrYjeUNqyCWufw
 ffkrM+wPr8kQViQoei5ByEvVG2pQK9nfxccx6rxiIOoG8Mb1aK8Wh75e2kjfZZeY
 bsEx8wSfRvI54ukk/NfGpP21OP1DDKU4ET6hsW+b6onqmDfYPZ/LinUvrvIHietQ
 IsMOG+QE+G1mXFiQR1P9dZcIwTNvE/SyUoaykbeslcYCkQt5yEk25UK+38TornbV
 hBxrm9KBd25RUu8prFJadmLBjPx3tLLd451xxppCyrIlpw0b+Q4yHlF7Jr75DYBT
 xjW5Xpr6tJZuoybbSyT9B/tiV4gIfyJZCU5S95nFlluKZ0gQd2jIGj/JagNWppke
 XSEqRR/Ja0OnB7yDz60gQrrZnjw/Em2jx7xbHIXFK+rs45dkdFc=
 =piIM
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-5.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFSoRDMA Client Updates for Linux 5.7

New Features:
- Allow one active connection and several zombie connections to prevent
  blocking if the remote server is unresponsive.

Bugfixes and Cleanups:
- Enhance MR-related trace points
- Refactor connection set-up and disconnect functions
- Make Protection Domains per-connection instead of per-transport
- Merge struct rpcrdma_ia into rpcrdma_ep
2020-03-28 12:01:17 -04:00
Chuck Lever
1a33d8a284 svcrdma: Fix leak of transport addresses
Kernel memory leak detected:

unreferenced object 0xffff888849cdf480 (size 8):
  comm "kworker/u8:3", pid 2086, jiffies 4297898756 (age 4269.856s)
  hex dump (first 8 bytes):
    30 00 cd 49 88 88 ff ff                          0..I....
  backtrace:
    [<00000000acfc370b>] __kmalloc_track_caller+0x137/0x183
    [<00000000a2724354>] kstrdup+0x2b/0x43
    [<0000000082964f84>] xprt_rdma_format_addresses+0x114/0x17d [rpcrdma]
    [<00000000dfa6ed00>] xprt_setup_rdma_bc+0xc0/0x10c [rpcrdma]
    [<0000000073051a83>] xprt_create_transport+0x3f/0x1a0 [sunrpc]
    [<0000000053531a8e>] rpc_create+0x118/0x1cd [sunrpc]
    [<000000003a51b5f8>] setup_callback_client+0x1a5/0x27d [nfsd]
    [<000000001bd410af>] nfsd4_process_cb_update.isra.7+0x16c/0x1ac [nfsd]
    [<000000007f4bbd56>] nfsd4_run_cb_work+0x4c/0xbd [nfsd]
    [<0000000055c5586b>] process_one_work+0x1b2/0x2fe
    [<00000000b1e3e8ef>] worker_thread+0x1a6/0x25a
    [<000000005205fb78>] kthread+0xf6/0xfb
    [<000000006d2dc057>] ret_from_fork+0x3a/0x50

Introduce a call to xprt_rdma_free_addresses() similar to the way
that the TCP backchannel releases a transport's peer address
strings.

Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27 12:25:06 -04:00
Christophe JAILLET
b25b60d7bf SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
'maxlen' is the total size of the destination buffer. There is only one
caller and this value is 256.

When we compute the size already used and what we would like to add in
the buffer, the trailling NULL character is not taken into account.
However, this trailling character will be added by the 'strcat' once we
have checked that we have enough place.

So, there is a off-by-one issue and 1 byte of the stack could be
erroneously overwridden.

Take into account the trailling NULL, when checking if there is enough
place in the destination buffer.


While at it, also replace a 'sprintf' by a safer 'snprintf', check for
output truncation and avoid a superfluous 'strlen'.

Fixes: dc9a16e49d ("svc: Add /proc/sys/sunrpc/transport files")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
[ cel: very minor fix to documenting comment
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27 12:22:57 -04:00
Chuck Lever
e28ce90083 xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
Change the rpcrdma_xprt_disconnect() function so that it no longer
waits for the DISCONNECTED event.  This prevents blocking if the
remote is unresponsive.

In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is
detached. Upon return from rpcrdma_xprt_disconnect(), the transport
(r_xprt) is ready immediately for a new connection.

The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now
handled almost identically.

However, because the lifetimes of rpcrdma_xprt structures and
rpcrdma_ep structures are now independent, creating an rpcrdma_ep
needs to take a module ref count. The ep now owns most of the
hardware resources for a transport.

Also, a kref is needed to ensure that rpcrdma_ep sticks around
long enough for the cm_event_handler to finish.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
745b734c9b xprtrdma: Extract sockaddr from struct rdma_cm_id
rpcrdma_cm_event_handler() is always passed an @id pointer that is
valid. However, in a subsequent patch, we won't be able to extract
an r_xprt in every case. So instead of using the r_xprt's
presentation address strings, extract them from struct rdma_cm_id.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
93aa8e0a9d xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
I eventually want to allocate rpcrdma_ep separately from struct
rpcrdma_xprt so that on occasion there can be more than one ep per
xprt.

The new struct rpcrdma_ep will contain all the fields currently in
rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings
for the connection, in addition to per-connection settings
negotiated with the remote.

Take this opportunity to rename the existing ep fields from rep_* to
re_* to disambiguate these from struct rpcrdma_rep.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
d6ccebf956 xprtrdma: Disconnect on flushed completion
Completion errors after a disconnect often occur much sooner than a
CM_DISCONNECT event. Use this to try to detect connection loss more
quickly.

Note that other kernel ULPs do take care to disconnect explicitly
when a WR is flushed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
897b7be9bc xprtrdma: Remove rpcrdma_ia::ri_flags
Clean up:
The upper layer serializes calls to xprt_rdma_close, so there is no
need for an atomic bit operation, saving 8 bytes in rpcrdma_ia.

This enables merging rpcrdma_ia_remove directly into the disconnect
logic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
81fe0c57f4 xprtrdma: Invoke rpcrdma_ia_open in the connect worker
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now
responsible for allocating all per-connection hardware resources.

With this clean-up, all three arms of the switch statement in
rpcrdma_ep_connect are exactly the same now, thus the switch can be
removed.

Because device removal behaves a little differently than
disconnection, there is a little more work to be done before
rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So
it is not quite symmetrical with rpcrdma_ep_create() yet.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
9ba373ee24 xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
Make a Protection Domain (PD) a per-connection resource rather than
a per-transport resource. In other words, when the connection
terminates, the PD is destroyed.

Thus there is one less HW resource that remains allocated to a
transport after a connection is closed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
9144a803df xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
Clean up: Simplify the synopses of functions in the connect and
disconnect paths in preparation for combining the rpcrdma_ia and
struct rpcrdma_ep structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
97d0de8812 xprtrdma: Clean up the post_send path
Clean up: Simplify the synopses of functions in the post_send path
by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
253a51622f xprtrdma: Refactor frwr_init_mr()
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep
structures. Take the opportunity to rename the function to be
consistent with the "subsystem _ object _ verb" naming scheme.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
85cd8e2b78 xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and
rpcrdma_ep_destroy().

rpcrdma_ep_create will be invoked at connect time instead of at
transport set-up time. It will be responsible for allocating per-
connection resources. In this patch it allocates the CQs and
creates a QP. More to come.

rpcrdma_ep_destroy() is the inverse functionality that is
invoked at disconnect time. It will be responsible for releasing
the CQs and QP.

These changes should be safe to do because both connect and
disconnect is guaranteed to be serialized by the transport send
lock.

This takes us another step closer to resolving the address and route
only at connect time so that connection failover to another device
will work correctly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
62a89501a3 xprtrdma: Enhance MR-related trace points
Two changes:
- Show the number of SG entries that were mapped. This helps debug
  DMA-related problems.
- Record the MR's resource ID instead of its memory address. This
  groups each MR with its associated rdma-tool output, and reduces
  needless exposure of memory addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Olga Kornievskaia
df513a7711 SUNRPC: fix krb5p mount to provide large enough buffer in rq_rcvsize
Ever since commit 2c94b8eca1 ("SUNRPC: Use au_rslack when computing
reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It
used to use au_cslack value which was nice and large and changed it to
au_rslack value which turns out to be too small.

Since 5.1, v3 mount with sec=krb5p fails against an Ontap server
because client's receive buffer it too small.

For gss krb5p, we need to account for the mic token in the verifier,
and the wrap token in the wrap token.

RFC 4121 defines:
mic token
Octet no   Name        Description
         --------------------------------------------------------------
         0..1     TOK_ID     Identification field.  Tokens emitted by
                             GSS_GetMIC() contain the hex value 04 04
                             expressed in big-endian order in this
                             field.
         2        Flags      Attributes field, as described in section
                             4.2.2.
         3..7     Filler     Contains five octets of hex value FF.
         8..15    SND_SEQ    Sequence number field in clear text,
                             expressed in big-endian order.
         16..last SGN_CKSUM  Checksum of the "to-be-signed" data and
                             octet 0..15, as described in section 4.2.4.

that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum

wrap token
Octet no   Name        Description
         --------------------------------------------------------------
          0..1     TOK_ID    Identification field.  Tokens emitted by
                             GSS_Wrap() contain the hex value 05 04
                             expressed in big-endian order in this
                             field.
          2        Flags     Attributes field, as described in section
                             4.2.2.
          3        Filler    Contains the hex value FF.
          4..5     EC        Contains the "extra count" field, in big-
                             endian order as described in section 4.2.3.
          6..7     RRC       Contains the "right rotation count" in big-
                             endian order, as described in section
                             4.2.5.
          8..15    SND_SEQ   Sequence number field in clear text,
                             expressed in big-endian order.
          16..last Data      Encrypted data for Wrap tokens with
                             confidentiality, or plaintext data followed
                             by the checksum for Wrap tokens without
                             confidentiality, as described in section
                             4.2.4.

Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum
(other things like padding)

RFC 3961 defines known cksum sizes:
Checksum type              sumtype        checksum         section or
                                value            size         reference
   ---------------------------------------------------------------------
   CRC32                            1               4           6.1.3
   rsa-md4                          2              16           6.1.2
   rsa-md4-des                      3              24           6.2.5
   des-mac                          4              16           6.2.7
   des-mac-k                        5               8           6.2.8
   rsa-md4-des-k                    6              16           6.2.6
   rsa-md5                          7              16           6.1.1
   rsa-md5-des                      8              24           6.2.4
   rsa-md5-des3                     9              24             ??
   sha1 (unkeyed)                  10              20             ??
   hmac-sha1-des3-kd               12              20            6.3
   hmac-sha1-des3                  13              20             ??
   sha1 (unkeyed)                  14              20             ??
   hmac-sha1-96-aes128             15              20         [KRB5-AES]
   hmac-sha1-96-aes256             16              20         [KRB5-AES]
   [reserved]                  0x8003               ?         [GSS-KRB5]

Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes.
(GSS_KRB5_MAX_CKSUM_LEN)

Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used
for encoding the gss_wrap tokens (same tokens are used in reply).

Fixes: 2c94b8eca1 ("SUNRPC: Use au_rslack when computing reply buffer size")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-26 10:51:01 -04:00
Trond Myklebust
78a947f50a sunrpc: Add tracing for cache events
Add basic tracing for debugging the sunrpc cache events.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:34 -04:00
Trond Myklebust
277f27e2f2 SUNRPC/cache: Allow garbage collection of invalid cache entries
If the cache entry never gets initialised, we want the garbage
collector to be able to evict it. Otherwise if the upcall daemon
fails to initialise the entry, we end up never expiring it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[ cel: resolved a merge conflict ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:34 -04:00
Trond Myklebust
65286b883c nfsd: export upcalls must not return ESTALE when mountd is down
If the rpc.mountd daemon goes down, then that should not cause all
exports to start failing with ESTALE errors. Let's explicitly
distinguish between the cache upcall cases that need to time out,
and those that do not.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
da1661b93b SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
xprt_sock_sendmsg uses the more efficient iov_iter-enabled kernel
socket API, and is a pre-requisite for server send-side support for
TLS.

Note that svc_process no longer needs to reserve a word for the
stream record marker, since the TCP transport now provides the
record marker automatically in a separate buffer.

The dprintk() in svc_send_common is also removed. It didn't seem
crucial for field troubleshooting. If more is needed there, a trace
point could be added in xprt_sock_sendmsg().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
9e55eef4ab SUNRPC: Refactor xs_sendpages()
Re-locate xs_sendpages() so that it can be shared with server code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
0dabe948f2 svcrdma: Avoid DMA mapping small RPC Replies
On some platforms, DMA mapping part of a page is more costly than
copying bytes. Indeed, not involving the I/O MMU can help the
RPC/RDMA transport scale better for tiny I/Os across more RDMA
devices. This is because interaction with the I/O MMU is eliminated
for each of these small I/Os. Without the explicit unmapping, the
NIC no longer needs to do a costly internal TLB shoot down for
buffers that are just a handful of bytes.

Since pull-up is now a more a frequent operation, I've introduced a
trace point in the pull-up path. It can be used for debugging or
user-space tools that count pull-up frequency.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
aee4b74a3f svcrdma: Fix double sync of transport header buffer
Performance optimization: Avoid syncing the transport buffer twice
when Reply buffer pull-up is necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
6fd5034db4 svcrdma: Refactor chunk list encoders
Same idea as the receive-side changes I did a while back: use
xdr_stream helpers rather than open-coding the XDR chunk list
encoders. This builds the Reply transport header from beginning to
end without backtracking.

As additional clean-ups, fill in documenting comments for the XDR
encoders and sprinkle some trace points in the new encoding
functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
5c266df527 SUNRPC: Add encoders for list item discriminators
Clean up. These are taken from the client-side RPC/RDMA transport
to a more global header file so they can be used elsewhere.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
a406c563e8 svcrdma: Rename svcrdma_encode trace points in send routines
These trace points are misnamed:

	trace_svcrdma_encode_wseg
	trace_svcrdma_encode_write
	trace_svcrdma_encode_reply
	trace_svcrdma_encode_rseg
	trace_svcrdma_encode_read
	trace_svcrdma_encode_pzr

Because they actually trace posting on the Send Queue. Let's rename
them so that I can add trace points in the chunk list encoders that
actually do trace chunk list encoding events.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
db9602e404 svcrdma: Update synopsis of svc_rdma_send_reply_msg()
Preparing for subsequent patches, no behavior change expected.

Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
4554755ed8 svcrdma: Update synopsis of svc_rdma_map_reply_msg()
Preparing for subsequent patches, no behavior change expected.

Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
6fa5785e78 svcrdma: Update synopsis of svc_rdma_send_reply_chunk()
Preparing for subsequent patches, no behavior change expected.

Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into the lower-level send
functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00