Commit Graph

365 Commits

Author SHA1 Message Date
Jeff Layton
d67fa4d85a sunrpc: turn swapper_enable/disable functions into rpc_xprt_ops
RDMA xprts don't have a sock_xprt, but an rdma_xprt, so the
xs_swapper_enable/disable functions will likely oops when fed an RDMA
xprt. Turn these functions into rpc_xprt_ops so that that doesn't
occur. For now the RDMA versions are no-ops that just return -EINVAL
on an attempt to swapon.

Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10 18:26:26 -04:00
Chuck Lever
ffe1f0df58 rpcrdma: Merge svcrdma and xprtrdma modules into one
Bi-directional RPC support means code in svcrdma.ko invokes a bit of
code in xprtrdma.ko, and vice versa. To avoid loader/linker loops,
merge the server and client side modules together into a single
module.

When backchannel capabilities are added, the combined module will
register all needed transport capabilities so that Upper Layer
consumers automatically have everything needed to create a
bi-directional transport connection.

Module aliases are added for backwards compatibility with user
space, which still may expect svcrdma.ko or xprtrdma.ko to be
present.

This commit reverts commit 2e8c12e1b7 ("xprtrdma: add separate
Kconfig options for NFSoRDMA client and server support") and
provides a single CONFIG option for enabling the new module.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-04 16:56:02 -04:00
Chuck Lever
0380a3f375 svcrdma: Add a separate "max data segs macro for svcrdma
The server and client maximum are architecturally independent.
Allow changing one without affecting the other.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-04 16:56:01 -04:00
Chuck Lever
b7e0b9a965 svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL
At the 2015 LSF/MM, it was requested that memory allocation
call sites that request GFP_KERNEL allocations in a loop should be
annotated with __GFP_NOFAIL.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-04 16:56:00 -04:00
Chuck Lever
30b7e246a6 svcrdma: Keep rpcrdma_msg fields in network byte-order
Fields in struct rpcrdma_msg are __be32. Don't byte-swap these
fields when decoding RPC calls and then swap them back for the
reply. For the most part, they can be left alone.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-04 16:55:59 -04:00
Chuck Lever
70747c25a7 svcrdma: Fix byte-swapping in svc_rdma_sendto.c
In send_write_chunks(), we have:

	for (xdr_off = rqstp->rq_res.head[0].iov_len, chunk_no = 0;
	     xfer_len && chunk_no < arg_ary->wc_nchunks;
	     chunk_no++) {
		 . . .
	}

Note that arg_ary->wc_nchunk is in network byte-order. For the
comparison to work correctly, both have to be in native byte-order.

In send_reply_chunks, we have:

	write_len = min(xfer_len, htonl(ch->rs_length));

xfer_len is in native byte-order, and ch->rs_length is in
network byte-order. be32_to_cpu() is the correct byte swap
for ch->rs_length.

As an additional clean up, replace ntohl() with be32_to_cpu() in
a few other places.

This appears to address a problem with large rsize hangs while
using PHYSICAL memory registration. I suspect that is the only
registration mode that uses more than one chunk element.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-04 16:55:58 -04:00
Chuck Lever
da7049f834 svcrdma: Remove svc_rdma_xdr_decode_deferred_req()
svc_rdma_xdr_decode_deferred_req() indexes an array with an
un-byte-swapped value off the wire. Fortunately this function
isn't used anywhere, so simply remove it.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-06-03 15:15:23 -04:00
Doug Ledford
175e8efe69 Merge branches 'bart-srp', 'generic-errors', 'ira-cleanups' and 'mwang-v8' into k.o/for-4.2 2015-05-20 16:12:40 -04:00
Ira Weiny
5d9fb04406 IB/core: Change rdma_protocol_iboe to roce
After discussion upstream, it was agreed to transition the usage of iboe
in the kernel to roce.  This keeps our terminology consistent with what
was finalized in the IBTA Annex 16 and IBTA Annex 17 publications.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-20 15:58:19 -04:00
Sagi Grimberg
76357c715f xprtrdma, svcrdma: Switch to generic logging helpers
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Anna Schumaker <anna.schumaker@netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18 13:44:23 -04:00
Michael Wang
bc0f1d7153 IB/Verbs: Use management helper rdma_cap_read_multi_sge()
Introduce helper rdma_cap_read_multi_sge() to help us check if the port of an
IB device support RDMA Read Multiple Scatter-Gather Entries.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18 13:35:05 -04:00
Michael Wang
3de2c31ce7 IB/Verbs: Reform IB-ulp xprtrdma
Use raw management helpers to reform IB-ulp xprtrdma.

Signed-off-by: Michael Wang <yun.wang@profitbricks.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Tested-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Sean Hefty <sean.hefty@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-05-18 13:35:04 -04:00
Linus Torvalds
59953fba87 NFS client updates for Linux 4.1
Highlights include:
 
 Stable patches:
 - Fix a regression in /proc/self/mountstats
 - Fix the pNFS flexfiles O_DIRECT support
 - Fix high load average due to callback thread sleeping
 
 Bugfixes:
 - Various patches to fix the pNFS layoutcommit support
 - Do not cache pNFS deviceids unless server notifications are enabled
 - Fix a SUNRPC transport reconnection regression
 - make debugfs file creation failure non-fatal in SUNRPC
 - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints
 - Fix locking around NFSv4.2 fallocate() support
 - Truncating NFSv4 file opens should also sync O_DIRECT writes
 - Prevent infinite loop in rpcrdma_ep_create()
 
 Features:
 - Various improvements to the RDMA transport code's handling of memory
   registration
 - Various code cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVOmT6AAoJEGcL54qWCgDyrhYQAMPKXB55jrdOR/7UVSF/xPML
 7OjMGHvBnTn/y0pNIyLyS1PjTZZsD/WZjoW9EFGpTv727qQNVoFxFRLNUcgi3NoL
 1YledCkLf7Q32aqod93SRRFPc9hzBoKhOZpOzBuWaAviyAB3KLi70DWAq9qRReYM
 prXUQQjpW5FLU+B2ifaVc2RCnu/rZ2c02YdR2XdtkBaAJxuhB2vR8IY1evwjCv3R
 5zyLDd9zSDDoArdpUzM97cxZPcYRSqbOwgTKvaaRnDDq/mKbKMZaqmEfjblwzNFt
 b43FbveJzZ3hlPADIpmaiMHjRTbxWjIKc9K1sOF2FPfcuPe2yM3DMAxDegUkEveS
 7fkbv/qRZ30NqfchGanX/pmBlLOcdI76qe/bwhN19wCnw48O1eeHi1HK8rWGhU+E
 qcrRZ3ZS2ufP/MVBuhauy0qU9Q4wcEtm7NGGP1231ZtmfjHKyBa4pLirNfG1AlJt
 dK7tBrknVx+WVm/UddJp/fEsxbP0+fki6TwzioHUSWcz8rDVYF6PFT/QPM54SX2h
 0oqwvu6d/uShpkVRm+fbje8FHmUxKdgqDsCYX2fNjWskh1oXSPsItvjqmTmTlE0i
 EBmBwVwI0uB1ZQ3PrJLadhRcO3ZJmLQ5gNj456dstvWy6UQds1xyIQ/DgvmlzxWO
 E9t0l18xHGRwbndsDa8f
 =j5dP
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Another set of mainly bugfixes and a couple of cleanups.  No new
  functionality in this round.

  Highlights include:

  Stable patches:
   - Fix a regression in /proc/self/mountstats
   - Fix the pNFS flexfiles O_DIRECT support
   - Fix high load average due to callback thread sleeping

  Bugfixes:
   - Various patches to fix the pNFS layoutcommit support
   - Do not cache pNFS deviceids unless server notifications are enabled
   - Fix a SUNRPC transport reconnection regression
   - make debugfs file creation failure non-fatal in SUNRPC
   - Another fix for circular directory warnings on NFSv4 "junctioned"
     mountpoints
   - Fix locking around NFSv4.2 fallocate() support
   - Truncating NFSv4 file opens should also sync O_DIRECT writes
   - Prevent infinite loop in rpcrdma_ep_create()

  Features:
   - Various improvements to the RDMA transport code's handling of
     memory registration
   - Various code cleanups"

* tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
  fs/nfs: fix new compiler warning about boolean in switch
  nfs: Remove unneeded casts in nfs
  NFS: Don't attempt to decode missing directory entries
  Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
  NFS: Rename idmap.c to nfs4idmap.c
  NFS: Move nfs_idmap.h into fs/nfs/
  NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
  NFS: Add a stub for GETDEVICELIST
  nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
  nfs: fix DIO good bytes calculation
  nfs: Fetch MOUNTED_ON_FILEID when updating an inode
  sunrpc: make debugfs file creation failure non-fatal
  nfs: fix high load average due to callback thread sleeping
  NFS: Reduce time spent holding the i_mutex during fallocate()
  NFS: Don't zap caches on fallocate()
  xprtrdma: Make rpcrdma_{un}map_one() into inline functions
  xprtrdma: Handle non-SEND completions via a callout
  xprtrdma: Add "open" memreg op
  xprtrdma: Add "destroy MRs" memreg op
  xprtrdma: Add "reset MRs" memreg op
  ...
2015-04-26 17:33:59 -07:00
Linus Torvalds
d0bbe0dd35 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "Usual trivial tree updates.  Nothing outstanding -- mostly printk()
  and comment fixes and unused identifier removals"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  goldfish: goldfish_tty_probe() is not using 'i' any more
  powerpc: Fix comment in smu.h
  qla2xxx: Fix printks in ql_log message
  lib: correct link to the original source for div64_u64
  si2168, tda10071, m88ds3103: Fix firmware wording
  usb: storage: Fix printk in isd200_log_config()
  qla2xxx: Fix printk in qla25xx_setup_mode
  init/main: fix reset_device comment
  ipwireless: missing assignment
  goldfish: remove unreachable line of code
  coredump: Fix do_coredump() comment
  stacktrace.h: remove duplicate declaration task_struct
  smpboot.h: Remove unused function prototype
  treewide: Fix typo in printk messages
  treewide: Fix typo in printk messages
  mod_devicetable: fix comment for match_flags
2015-04-14 09:50:27 -07:00
Chuck Lever
d654788e98 xprtrdma: Make rpcrdma_{un}map_one() into inline functions
These functions are called in a loop for each page transferred via
RDMA READ or WRITE. Extract loop invariants and inline them to
reduce CPU overhead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:53 -04:00
Chuck Lever
e46ac34c3c xprtrdma: Handle non-SEND completions via a callout
Allow each memory registration mode to plug in a callout that handles
the completion of a memory registration operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:53 -04:00
Chuck Lever
3968cb5850 xprtrdma: Add "open" memreg op
The open op determines the size of various transport data structures
based on device capabilities and memory registration mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:53 -04:00
Chuck Lever
4561f347d4 xprtrdma: Add "destroy MRs" memreg op
Memory Region objects associated with a transport instance are
destroyed before the instance is shutdown and destroyed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:53 -04:00
Chuck Lever
31a701a947 xprtrdma: Add "reset MRs" memreg op
This method is invoked when a transport instance is about to be
reconnected. Each Memory Region object is reset to its initial
state.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:53 -04:00
Chuck Lever
91e70e70e4 xprtrdma: Add "init MRs" memreg op
This method is used when setting up a new transport instance to
create a pool of Memory Region objects that will be used to register
memory during operation.

Memory Regions are not needed for "physical" registration, since
->prepare and ->release are no-ops for that mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
6814baead8 xprtrdma: Add a "deregister_external" op for each memreg mode
There is very little common processing among the different external
memory deregistration functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
9c1b4d775f xprtrdma: Add a "register_external" op for each memreg mode
There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
1c9351ee0e xprtrdma: Add a "max_payload" op for each memreg mode
The max_payload computation is generalized to ensure that the
payload maximum is the lesser of RPC_MAX_DATA_SEGS and the number of
data segments that can be transmitted in an inline buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
a0ce85f595 xprtrdma: Add vector of ops for each memory registration strategy
Instead of employing switch() statements, let's use the typical
Linux kernel idiom for handling behavioral variation: virtual
functions.

Start by defining a vector of operations for each supported memory
registration mode, and by adding a source file for each mode.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
41f9702896 xprtrdma: Prevent infinite loop in rpcrdma_ep_create()
If a provider advertizes a zero max_fast_reg_page_list_len, FRWR
depth detection loops forever. Instead of just failing the mount,
try other memory registration modes.

Fixes: 0fc6c4e7bb ("xprtrdma: mind the device's max fast . . .")
Reported-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
805272406a xprtrdma: Byte-align FRWR registration
The RPC/RDMA transport's FRWR registration logic registers whole
pages. This means areas in the first and last pages that are not
involved in the RDMA I/O are needlessly exposed to the server.

Buffered I/O is typically page-aligned, so not a problem there. But
for direct I/O, which can be byte-aligned, and for reply chunks,
which are nearly always smaller than a page, the transport could
expose memory outside the I/O buffer.

FRWR allows byte-aligned memory registration, so let's use it as
it was intended.

Reported-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
e23779451e xprtrdma: Perform a full marshal on retransmit
Commit 6ab59945f2 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.

Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.

Revert 6ab59945f2, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945f2, while also performing pullup
during retransmits.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Chuck Lever
0dd39cae26 xprtrdma: Display IPv6 addresses and port numbers correctly
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-03-31 09:52:52 -04:00
Masanari Iida
d939be3add treewide: Fix typo in printk messages
This patch fix spelling typo in printk messages.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-03-06 23:05:39 +01:00
Trond Myklebust
7a27eae362 NFS: RDMA Client Sparse Fix #2
This patch fixes another sparse fix found by Dan Carpenter's tool.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJU66HvAAoJENfLVL+wpUDrJdYP/Ap2xiMpEmO6Zl4h/MiDJKYs
 GIL3o+Ayc+wzlga2qBuNMfFBqzochCL2xqgS/9bvA0z/OT3ilq+FKZRr8FuDsmb+
 5XtIS49Kvo2mjWNpnh8GMoCbBoQfkBMJLETtR0KhhrMBNX5sidBPi2cSeZJxqnAw
 9LIc3Od9jyZSLh2je3gJCKxh90CVWgjs9OfJookfksNyVMnwK+TRWjo1FInEfkW6
 2fMFrC20xBg0qWWgtZL73ib8ojaIwv5MxnyFNgsiv/xcDWShZhdaMizBlqjDiNUq
 co9S1skJ9RTgInoDLeWvKoKkc7JjkLTpSzqsZx+Kke75p/D41Sah215jp0yyz2tb
 fEi+lZ8q1f5iorjamw5UDQPAuNcPx/eBzQNmkFGbue2xDP/kOXErb5AtnfMpDqav
 FXX/C3Vf6BxeAVB+8VifcdKEchE6cN9St6SBUludIisdmQWLLjSxazTAG7MXZdYW
 X0dxJ7D2hcwGEbjYKdNOn6a8+N3m1++XiJdPc1KSN71EO6z0xqjQM+3PytXCXaGI
 t3dKCDoEM/Ee1+4VxVa02OjW62ApNDd8APrCrn+8zMglG0eFwPmocNx/U42WanXd
 bGnEI9qd6mvn22eJrrOc55gShaDjovhXRmvvc2E8tHKVnaMXdfG8o5CjtB6LPXbW
 6deK+m1/LxkzFhRvMySs
 =ooZn
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-4.0-3' of git://git.linux-nfs.org/projects/anna/nfs-rdma

NFS: RDMA Client Sparse Fix #2

This patch fixes another sparse fix found by Dan Carpenter's tool.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

* tag 'nfs-rdma-for-4.0-3' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Store RDMA credits in unsigned variables
2015-02-23 19:16:43 -05:00
Chuck Lever
9b1dcbc8cf xprtrdma: Store RDMA credits in unsigned variables
Dan Carpenter's static checker pointed out:

   net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler()
   warn: can 'credits' be negative?

"credits" is defined as an int. The credits value comes from the
server as a 32-bit unsigned integer.

A malicious or broken server can plant a large unsigned integer in
that field which would result in an underflow in the following
logic, potentially triggering a deadlock of the mount point by
blocking the client from issuing more RPC requests.

net/sunrpc/xprtrdma/rpc_rdma.c:

  876          credits = be32_to_cpu(headerp->rm_credit);
  877          if (credits == 0)
  878                  credits = 1;    /* don't deadlock */
  879          else if (credits > r_xprt->rx_buf.rb_max_requests)
  880                  credits = r_xprt->rx_buf.rb_max_requests;
  881
  882          cwnd = xprt->cwnd;
  883          xprt->cwnd = credits << RPC_CWNDSHIFT;
  884          if (xprt->cwnd > cwnd)
  885                  xprt_release_rqst_cong(rqst->rq_task);

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: eba8ff660b ("xprtrdma: Move credit update to RPC . . .")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-02-23 16:54:04 -05:00
Linus Torvalds
61845143fe Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "The main change is the pNFS block server support from Christoph, which
  allows an NFS client connected to shared disk to do block IO to the
  shared disk in place of NFS reads and writes.  This also requires xfs
  patches, which should arrive soon through the xfs tree, barring
  unexpected problems.  Support for other filesystems is also possible
  if there's interest.

  Thanks also to Chuck Lever for continuing work to get NFS/RDMA into
  shape"

* 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits)
  nfsd: default NFSv4.2 to on
  nfsd: pNFS block layout driver
  exportfs: add methods for block layout exports
  nfsd: add trace events
  nfsd: update documentation for pNFS support
  nfsd: implement pNFS layout recalls
  nfsd: implement pNFS operations
  nfsd: make find_any_file available outside nfs4state.c
  nfsd: make find/get/put file available outside nfs4state.c
  nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c
  nfsd: add fh_fsid_match helper
  nfsd: move nfsd_fh_match to nfsfh.h
  fs: add FL_LAYOUT lease type
  fs: track fl_owner for leases
  nfs: add LAYOUT_TYPE_MAX enum value
  nfsd: factor out a helper to decode nfstime4 values
  sunrpc/lockd: fix references to the BKL
  nfsd: fix year-2038 nfs4 state problem
  svcrdma: Handle additional inline content
  svcrdma: Move read list XDR round-up logic
  ...
2015-02-12 10:39:41 -08:00
Chuck Lever
b625a61698 xprtrdma: Address sparse complaint in rpcr_to_rdmar()
With "make ARCH=x86_64 allmodconfig make C=1 CF=-D__CHECK_ENDIAN__":

linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: warning: incorrect
  type in initializer (different base types)
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30: expected restricted
  __be32 [usertype] *buffer
linux-2.6/net/sunrpc/xprtrdma/xprt_rdma.h:273:30:    got unsigned int
  [usertype] *rq_buffer

As far as I can tell this is a false positive.

Reported-by: kbuild-all@01.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-02-05 15:38:29 -05:00
Chuck Lever
a0a1d50cd1 xprtrdma: Update the GFP flags used in xprt_rdma_allocate()
Reflect the more conservative approach used in the socket transport's
version of this transport method. An RPC buffer allocation should
avoid forcing not just FS activity, but any I/O.

In particular, two recent changes missed updating xprtrdma:

 - Commit c6c8fe79a8 ("net, sunrpc: suppress allocation warning ...")
 - Commit a564b8f039 ("nfs: enable swap on NFS")

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 12:18:48 -05:00
Chuck Lever
df515ca7b3 xprtrdma: Clean up after adding regbuf management
rpcrdma_{de}register_internal() are used only in verbs.c now.

MAX_RPCRDMAHDR is no longer used and can be removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
c05fbb5a59 xprtrdma: Allocate zero pad separately from rpcrdma_buffer
Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.

This is for consistency with the other uses of internally
registered memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
6b1184cd4f xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep
The rr_base field is currently the buffer where RPC replies land.

An RPC/RDMA reply header lands in this buffer. In some cases an RPC
reply header also lands in this buffer, just after the RPC/RDMA
header.

The inline threshold is an agreed-on size limit for RDMA SEND
operations that pass from server and client. The sum of the
RPC/RDMA reply header size and the RPC reply header size must be
less than this threshold.

The largest RDMA RECV that the client should have to handle is the
size of the inline threshold. The receive buffer should thus be the
size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.

RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
85275c874e xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req
The rl_base field is currently the buffer where each RPC/RDMA call
header is built.

The inline threshold is an agreed-on size limit to for RDMA SEND
operations that pass between client and server. The sum of the
RPC/RDMA header size and the RPC header size must be less than or
equal to this threshold.

Increasing the r/wsize maximum will require MAX_SEGS to grow
significantly, but the inline threshold size won't change (both
sides agree on it). The server's inline threshold doesn't change.

Since an RPC/RDMA header can never be larger than the inline
threshold, make all RPC/RDMA header buffers the size of the
inline threshold.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
0ca77dc372 xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req
Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.

A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call and reply headers.

For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.

That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array
and the fact that both the send and receive buffers are part of
struct rpcrdma_req.

Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced rpcrdma_req structs, and
restore the original.

This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).

This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().

xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().

A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
9128c3e794 xprtrdma: Add struct rpcrdma_regbuf and helpers
There are several spots that allocate a buffer via kmalloc (usually
contiguously with another data structure) and then register that
buffer internally. I'd like to split the buffers out of these data
structures to allow the data structures to scale.

Start by adding functions that can kmalloc and register a buffer,
and can manage/preserve the buffer's associated ib_sge and ib_mr
fields.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:49 -05:00
Chuck Lever
1392402c40 xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()
Move the details of how to create and destroy rpcrdma_req and
rpcrdma_rep structures into helper functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
ac920d04a7 xprtrdma: Simplify synopsis of rpcrdma_buffer_create()
Clean up: There is one call site for rpcrdma_buffer_create(). All of
the arguments there are fields of an rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
ce1ab9ab47 xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stack
Reduce stack footprint of the connection upcall handler function.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
7bc7972cdd xprtrdma: Take struct ib_device_attr off the stack
Device attributes are large, and are used in more than one place.
Stash a copy in dynamically allocated memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
5ae711a246 xprtrdma: Free the pd if ib_query_qp() fails
If ib_query_qp() fails or the memory registration mode isn't
supported, don't leak the PD. An orphaned IB/core resource will
cause IB module removal to hang.

Fixes: bd7ed1d133 ("RPC/RDMA: check selected memory registration ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
afadc468eb xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt
Clean up: The rep_func field always refers to rpcrdma_conn_func().
rep_func should have been removed by commit b45ccfd25d ("xprtrdma:
Remove MEMWINDOWS registration modes").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
eba8ff660b xprtrdma: Move credit update to RPC reply handler
Reduce work in the receive CQ handler, which can be run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.

This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header (the CPU stalls while
waiting for the header contents to be brought into the cache).

This further extends work begun by commit e7ce710a88 ("xprtrdma:
Avoid deadlock when credit window is reset").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
3eb3581066 xprtrdma: Remove rl_mr field, and the mr_chunk union
Clean up: Since commit 0ac531c183 ("xprtrdma: Remove REGISTER
memory registration mode"), the rl_mr pointer is no longer used
anywhere.

After removal, there's only a single member of the mr_chunk union,
so mr_chunk can be removed as well, in favor of a single pointer
field.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
5d410ba061 xprtrdma: Remove rpcrdma_ep::rep_ia
Clean up: This field is not used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
5abefb861f xprtrdma: Rename "xprt" and "rdma_connect" fields in struct rpcrdma_xprt
Clean up: Use consistent field names in struct rpcrdma_xprt.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
f2846481b4 xprtrdma: Clean up hdrlen
Clean up: Replace naked integers with a documenting macro.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
052151a979 xprtrdma: Display XIDs in host byte order
xprtsock.c and the backchannel code display XIDs in host byte order.
Follow suit in xprtrdma.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
284f4902a6 xprtrdma: Modernize htonl and ntohl
Clean up: Replace htonl and ntohl with the be32 equivalents.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:48 -05:00
Chuck Lever
8502427ccd xprtrdma: human-readable completion status
Make it easier to grep the system log for specific error conditions.

The wc.opcode field is not included because opcode numbers are
sparse, and because wc.opcode is not necessarily valid when
completion reports an error.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2015-01-30 10:47:47 -05:00
Chuck Lever
a97c331f9a svcrdma: Handle additional inline content
Most NFS RPCs place their large payload argument at the end of the
RPC header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
sends the complete RPC header inline, and the payload argument in
the read list. Data in the read list is the last part of the XDR
stream.

One important case is not like this, however. NFSv4 COMPOUND is a
counted array of operations. A WRITE operation, with its large data
payload, can appear in the middle of the compound's operations
array. Thus NFSv4 WRITE compounds can have header content after the
WRITE payload.

The Linux client, for example, performs an NFSv4 WRITE like this:

  { PUTFH, WRITE, GETATTR }

Though RFC 5667 is not precise about this, the proper way to convey
this compound is to place the GETATTR inline, _after_ the front of
the RPC header. The receiver inserts the read list payload into the
XDR stream after the initial WRITE arguments, and before the GETATTR
operation, thanks to the value of the read list "position" field.

The Linux client currently sends the GETATTR at the end of the
RPC/RDMA read list, which is incorrect. It will be corrected in the
future.

The Linux server currently rejects NFSv4 compounds with inline
content after the read list. For the above NFSv4 WRITE compound, the
NFS compound header indicates there are three operations, but the
server finds nonsense when it looks in the XDR stream for the third
operation, and the compound fails with OP_ILLEGAL.

Move trailing inline content to the end of the XDR buffer's page
list. This presents incoming NFSv4 WRITE compounds to NFSD in the
same way the socket transport does.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:49 -05:00
Chuck Lever
fcbeced5b4 svcrdma: Move read list XDR round-up logic
This is a pre-requisite for a subsequent patch.

Read list XDR round-up needs to be done _before_ additional inline
content is copied to the end of the XDR buffer's page list. Move
the logic added by commit e560e3b510 ("svcrdma: Add zero padding
if the client doesn't send it").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:48 -05:00
Chuck Lever
0b056c224b svcrdma: Support RDMA_NOMSG requests
Currently the Linux server can not decode RDMA_NOMSG type requests.
Operations whose length exceeds the fixed size of RDMA SEND buffers,
like large NFSv4 CREATE(NF4LNK) operations, must be conveyed via
RDMA_NOMSG.

For an RDMA_MSG type request, the client sends the RPC/RDMA, RPC
headers, and some or all of the NFS arguments via RDMA SEND.

For an RDMA_NOMSG type request, the client sends just the RPC/RDMA
header via RDMA SEND. The request's read list contains elements for
the entire RPC message, including the RPC header.

NFSD expects the RPC/RMDA header and RPC header to be contiguous in
page zero of the XDR buffer. Add logic in the RDMA READ path to make
the read list contents land where the server prefers, when the
incoming message is a type RDMA_NOMSG message.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:47 -05:00
Chuck Lever
61edbcb7c7 svcrdma: rc_position sanity checking
An RPC/RDMA client may send large RPC arguments via a read
list. This is a list of scatter/gather elements which convey
RPC call arguments too large to fit in a small RDMA SEND.

Each entry in the read list has a "position" field, whose value is
the byte offset in the XDR stream where the data in that entry is to
be inserted. Entries which share the same "position" value make up
the same RPC argument. The receiver inserts entries with the same
position field value in list order into the XDR stream.

Currently the Linux NFS/RDMA server cannot handle receiving read
chunks in more than one position, mostly because no current client
sends read lists with elements in more than one position. As a
sanity check, ensure that all received chunks have the same
"rc_position."

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:47 -05:00
Chuck Lever
e54524111f svcrdma: Plant reader function in struct svcxprt_rdma
The RDMA reader function doesn't change once an svcxprt_rdma is
instantiated. Instead of checking sc_devcap during every incoming
RPC, set the reader function once when the connection is accepted.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:46 -05:00
Chuck Lever
e5523bd281 svcrdma: Find rmsgp more reliably
xdr_start() can return the wrong rmsgp address if an assumption
about how the xdr_buf was constructed changes.  When it gets it
wrong, the client receives a reply that has gibberish in the
RPC/RDMA header, preventing it from matching a waiting RPC request.

Instead, make (and document) just one assumption: that the RDMA
header for the client's RPC call is at the start of the first page
in rq_pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:45 -05:00
Chuck Lever
3fe04ee9f9 svcrdma: Scrub BUG_ON() and WARN_ON() call sites
Current convention is to avoid using BUG_ON() in places where an
oops could cause complete system failure.

Replace BUG_ON() call sites in svcrdma with an assertion error
message and allow execution to continue safely.

Some BUG_ON() calls are removed because they have never fired in
production (that we are aware of).

Some WARN_ON() calls are also replaced where a back trace is not
helpful; e.g., in a workqueue task.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:45 -05:00
Chuck Lever
2397aa8b51 svcrdma: Clean up read chunk counting
The byte_count argument is not used, and the function is called
only from one place.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:44 -05:00
Chuck Lever
83f2bedfc6 svcrdma: Remove unused variable
Nit: remove an unused variable to squelch a compiler warning.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:43 -05:00
Chuck Lever
597561bf6a svcrdma: Clean up dprintk
Nit: Fix inconsistent white space in dprintk messages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-01-15 15:01:43 -05:00
Trond Myklebust
ea5264138d NFS: Client side changes for RDMA
These patches various bugfixes and cleanups for using NFS over RDMA, including
 better error handling and performance improvements by using pad optimization.
 
 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJUdPv2AAoJENfLVL+wpUDrJkMQAKjtPZHLMcj+eHm4f1ZKLJxy
 GSrUZV21TU9tL0NVE/5An8US6hoLwHpNXsW8o+gHTAeGRyiCmIaNXGd1Ql/4PYRH
 zfzdNXoaJAh1N5iXX11fF3gOWqx/SolqzO2xLDVETK/3lAvq0VwMYoMElBQB6qQW
 8sN3z8yVuz/9Ia9oGIFhqu1B6dcKPHkQDMtmsElGxeEX/+9yEg4HUKx+kZDtV0Uj
 8/JM8Jh1FKRCQT/P6INkRItdY5KaSJGFc43BkC/8lbugfxa5XCyu/m/qMr9FJsDV
 nM6rwaiVcmR/mvD3fL82+Jg/M+P9VUHQ1/Az0sV9G+fEoHH/1Mey3LfMzNpUmf9v
 bykrPRuzXkPPQgN1VnjSaF2RF+CWwV9Nme1VVXM/zj8gHX1mcmQF/wPRxDuLjCrt
 EObAFsvHOwDTZZmYp9bG5kc6IvwvT8aeeVQMJ4q4PSGD3w8AtoIyJDn+Ee0LFD1K
 Zw0oZpTJpI4t7DVxGBSdo2wZWuMU/UKqGqGtGJ+ljXfTRuuq968Q5j5ujaA9vf0v
 C9igYTU8hq4teMzhZrfR1jtTWoSS+5zamb1KtvAZy8gsht2PQVgE9xka2k/AV8uE
 ul/w5HU4OV+QIrHNbiu7BE8B2Ags6smpdHMqn9fqLBwvG+JEwbWqk1zeTsajxzq+
 hkvKkkMq6JjDbsDf96Yk
 =YIru
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next

Pull NFS client RDMA changes for 3.19 from Anna Schumaker:
 "NFS: Client side changes for RDMA

  These patches various bugfixes and cleanups for using NFS over RDMA, including
  better error handling and performance improvements by using pad optimization.

  Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>"

* tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma:
  xprtrdma: Display async errors
  xprtrdma: Enable pad optimization
  xprtrdma: Re-write rpcrdma_flush_cqs()
  xprtrdma: Refactor tasklet scheduling
  xprtrdma: unmap all FMRs during transport disconnect
  xprtrdma: Cap req_cqinit
  xprtrdma: Return an errno from rpcrdma_register_external()
2014-11-26 17:37:13 -05:00
Chuck Lever
7ff11de1ba xprtrdma: Display async errors
An async error upcall is a hard error, and should be reported in
the system log.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Chuck Lever
d5440e27d3 xprtrdma: Enable pad optimization
The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
pad optimization was enabled. That bug was fixed by commit
e560e3b510 ("svcrdma: Add zero padding if the client doesn't send
it").

We can now enable pad optimization on the client, which helps
performance and is supported now by both Linux and Solaris servers.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Chuck Lever
5c166bef4f xprtrdma: Re-write rpcrdma_flush_cqs()
Currently rpcrdma_flush_cqs() attempts to avoid code duplication,
and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall.

1. rpcrdma_flush_cqs() can run concurrently with provider upcalls.
   Both flush_cqs() and the upcalls were invoking ib_poll_cq() in
   different threads using the same wc buffers (ep->rep_recv_wcs
   and ep->rep_send_wcs), added by commit 1c00dd0776 ("xprtrmda:
   Reduce calls to ib_poll_cq() in completion handlers").

   During transport disconnect processing, this sometimes resulted
   in the same reply getting added to the rpcrdma_tasklets_g list
   more than once, which corrupted the list.

2. The upcall functions drain only a limited number of CQEs,
   thanks to the poll budget added by commit 8301a2c047
   ("xprtrdma: Limit work done by completion handler").

Fixes: a7bc211ac9 ("xprtrdma: On disconnect, don't ignore ... ")
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Chuck Lever
f1a03b76fe xprtrdma: Refactor tasklet scheduling
Restore the separate function that schedules the reply handling
tasklet. I need to call it from two different paths.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Chuck Lever
467c9674bc xprtrdma: unmap all FMRs during transport disconnect
When using RPCRDMA_MTHCAFMR memory registration, after a few
transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to
return EINVAL because the provider has exhausted its map pool.

Make sure that all FMRs are unmapped during transport disconnect,
and that ->send_request remarshals them during an RPC retransmit.
This resets the transport's MRs to ensure that none are leaked
during a disconnect.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Chuck Lever
e7104a2a96 xprtrdma: Cap req_cqinit
Recent work made FRMR registration and invalidation completions
unsignaled. This greatly reduces the adapter interrupt rate.

Every so often, however, a posted send Work Request is allowed to
signal. Otherwise, the provider's Work Queue will wrap and the
workload will hang.

The number of Work Requests that are allowed to remain unsignaled is
determined by the value of req_cqinit. Currently, this is set to the
size of the send Work Queue divided by two, minus 1.

For FRMR, the send Work Queue is the maximum number of concurrent
RPCs (currently 32) times the maximum number of Work Requests an
RPC might use (currently 7, though some adapters may need more).

For mlx4, this is 224 entries. This leaves completion signaling
disabled for 111 send Work Requests.

Some providers hold back dispatching Work Requests until a CQE is
generated.  If completions are disabled, then no CQEs are generated
for quite some time, and that can stall the Work Queue.

I've seen this occur running xfstests generic/113 over NFSv4, where
eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM
because the Work Queue has overflowed. The connection is dropped
and re-established.

Cap the rep_cqinit setting so completions are not left turned off
for too long.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Chuck Lever
92b98361f1 xprtrdma: Return an errno from rpcrdma_register_external()
The RPC/RDMA send_request method and the chunk registration code
expects an errno from the registration function. This allows
the upper layers to distinguish between a recoverable failure
(for example, temporary memory exhaustion) and a hard failure
(for example, a bug in the registration logic).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-11-25 13:39:20 -05:00
Jeff Layton
f895b252d4 sunrpc: eliminate RPC_DEBUG
It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24 17:31:46 -05:00
Linus Torvalds
6dea0737bc Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "Highlights:

   - support the NFSv4.2 SEEK operation (allowing clients to support
     SEEK_HOLE/SEEK_DATA), thanks to Anna.
   - end the grace period early in a number of cases, mitigating a
     long-standing annoyance, thanks to Jeff
   - improve SMP scalability, thanks to Trond"

* 'for-3.18' of git://linux-nfs.org/~bfields/linux: (55 commits)
  nfsd: eliminate "to_delegation" define
  NFSD: Implement SEEK
  NFSD: Add generic v4.2 infrastructure
  svcrdma: advertise the correct max payload
  nfsd: introduce nfsd4_callback_ops
  nfsd: split nfsd4_callback initialization and use
  nfsd: introduce a generic nfsd4_cb
  nfsd: remove nfsd4_callback.cb_op
  nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
  nfsd: fix nfsd4_cb_recall_done error handling
  nfsd4: clarify how grace period ends
  nfsd4: stop grace_time update at end of grace period
  nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
  nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
  nfsd: serialize nfsdcltrack upcalls for a particular client
  nfsd: pass extra info in env vars to upcalls to allow for early grace period end
  nfsd: add a v4_end_grace file to /proc/fs/nfsd
  lockd: add a /proc/fs/lockd/nlm_end_grace file
  nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
  nfsd: remove redundant boot_time parm from grace_done client tracking op
  ...
2014-10-08 12:51:44 -04:00
Steve Wise
7e5be28827 svcrdma: advertise the correct max payload
Svcrdma currently advertises 1MB, which is too large.  The correct value
is the minimum of RPCSVC_MAXPAYLOAD and the max scatter-gather allowed
in an NFSRDMA IO chunk * the host page size. This bug is usually benign
because the Linux X64 NFSRDMA client correctly limits the payload size to
the correct value (64*4096 = 256KB).  But if the Linux client is PPC64
with a 64KB page size, then the client will indeed use a payload size
that will overflow the server.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29 14:35:18 -04:00
NeilBrown
1aff525629 NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
 - it doesn't hurt for kswapd to block occasionally.  If it doesn't
   want to block it would clear __GFP_WAIT.  The current_is_kswapd()
   was only added to avoid deadlocks and we have a new approach for
   that.
 - memory allocation in the SUNRPC layer can very rarely try to
   ->releasepage() a page it is trying to handle.  The deadlock
   is removed as nfs_release_page() doesn't block indefinitely.

So we don't need to set PF_FSTRANS for sunrpc network operations any
more.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:47 -04:00
Linus Torvalds
06b8ab5528 NFS client updates for Linux 3.17
Highlights include:
 
 - Stable fix for a bug in nfs3_list_one_acl()
 - Speed up NFS path walks by supporting LOOKUP_RCU
 - More read/write code cleanups
 - pNFS fixes for layout return on close
 - Fixes for the RCU handling in the rpcsec_gss code
 - More NFS/RDMA fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1
 TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY
 mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL
 RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg
 nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A
 OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh
 f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7
 gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR
 t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo
 VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF
 NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc
 ivr2s7aIDV48jhnwBSVL
 =KE07
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - stable fix for a bug in nfs3_list_one_acl()
   - speed up NFS path walks by supporting LOOKUP_RCU
   - more read/write code cleanups
   - pNFS fixes for layout return on close
   - fixes for the RCU handling in the rpcsec_gss code
   - more NFS/RDMA fixes"

* tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  nfs: reject changes to resvport and sharecache during remount
  NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
  SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
  NFS: fix two problems in lookup_revalidate in RCU-walk
  NFS: allow lockless access to access_cache
  NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
  NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
  NFS: support RCU_WALK in nfs_permission()
  sunrpc/auth: allow lockless (rcu) lookup of credential cache.
  NFS: prepare for RCU-walk support but pushing tests later in code.
  NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
  NFS: add checks for returned value of try_module_get()
  nfs: clear_request_commit while holding i_lock
  pnfs: add pnfs_put_lseg_async
  pnfs: find swapped pages on pnfs commit lists too
  nfs: fix comment and add warn_on for PG_INODE_REF
  nfs: check wait_on_bit_lock err in page_group_lock
  sunrpc: remove "ec" argument from encrypt_v2 operation
  sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
  sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
  ...
2014-08-13 18:13:19 -06:00
Steve Wise
d1e458fe67 svcrdma: remove rdma_create_qp() failure recovery logic
In svc_rdma_accept(), if rdma_create_qp() fails, there is useless
logic to try and call rdma_create_qp() again with reduced sge depths.
The assumption, I guess, was that perhaps the initial sge depths
chosen were too big.  However they initial depths are selected based
on the rdma device attribute max_sge returned from ib_query_device().
If rdma_create_qp() fails, it would not be because the max_send_sge and
max_recv_sge values passed in exceed the device's max.  So just remove
this code.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 16:09:21 -04:00
Chuck Lever
8079fb785e xprtrdma: Handle additional connection events
Commit 38ca83a5 added RDMA_CM_EVENT_TIMEWAIT_EXIT. But that status
is relevant only for consumers that re-use their QPs on new
connections. xprtrdma creates a fresh QP on reconnection, so that
event should be explicitly ignored.

Squelch the alarming "unexpected CM event" message.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:59 -04:00
Chuck Lever
a779ca5fa7 xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro
Clean up.

RPCRDMA_PERSISTENT_REGISTRATION was a compile-time switch between
RPCRDMA_REGISTER mode and RPCRDMA_ALLPHYSICAL mode.  Since
RPCRDMA_REGISTER has been removed, there's no need for the extra
conditional compilation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:59 -04:00
Chuck Lever
282191cb72 xprtrdma: Make rpcrdma_ep_disconnect() return void
Clean up: The return code is used only for dprintk's that are
already redundant.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:58 -04:00
Chuck Lever
bb96193d91 xprtrdma: Schedule reply tasklet once per upcall
Minor optimization: grab rpcrdma_tk_lock_g and disable hard IRQs
just once after clearing the receive completion queue.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:58 -04:00
Chuck Lever
2e84522c2e xprtrdma: Allocate each struct rpcrdma_mw separately
Currently rpcrdma_buffer_create() allocates struct rpcrdma_mw's as
a single contiguous area of memory. It amounts to quite a bit of
memory, and there's no requirement for these to be carved from a
single piece of contiguous memory.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:57 -04:00
Chuck Lever
f590e878c5 xprtrdma: Rename frmr_wr
Clean up: Name frmr_wr after the opcode of the Work Request,
consistent with the send and local invalidation paths.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:57 -04:00
Chuck Lever
dab7e3b8da xprtrdma: Disable completions for LOCAL_INV Work Requests
Instead of relying on a completion to change the state of an FRMR
to FRMR_IS_INVALID, set it in advance. If an error occurs, a completion
will fire anyway and mark the FRMR FRMR_IS_STALE.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:57 -04:00
Chuck Lever
050557220e xprtrdma: Disable completions for FAST_REG_MR Work Requests
Instead of relying on a completion to change the state of an FRMR
to FRMR_IS_VALID, set it in advance. If an error occurs, a completion
will fire anyway and mark the FRMR FRMR_IS_STALE.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:56 -04:00
Chuck Lever
440ddad51b xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external()
Any FRMR arriving in rpcrdma_register_frmr_external() is now
guaranteed to be either invalid, or to be targeted by a queued
LOCAL_INV that will invalidate it before the adapter processes
the FAST_REG_MR being built here.

The problem with current arrangement of chaining a LOCAL_INV to the
FAST_REG_MR is that if the transport is not connected, the LOCAL_INV
is flushed and the FAST_REG_MR is flushed. This leaves the FRMR
valid with the old rkey. But rpcrdma_register_frmr_external() has
already bumped the in-memory rkey.

Next time through rpcrdma_register_frmr_external(), a LOCAL_INV and
FAST_REG_MR is attempted again because the FRMR is still valid. But
the rkey no longer matches the hardware's rkey, and a memory
management operation error occurs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:56 -04:00
Chuck Lever
ddb6bebcc6 xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work Request
When a LOCAL_INV Work Request is flushed, it leaves an FRMR in the
VALID state. This FRMR can be returned by rpcrdma_buffer_get(), and
must be knocked down in rpcrdma_register_frmr_external() before it
can be re-used.

Instead, capture these in rpcrdma_buffer_get(), and reset them.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:55 -04:00
Chuck Lever
9f9d802a28 xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnect
FAST_REG_MR Work Requests update a Memory Region's rkey. Rkey's are
used to block unwanted access to the memory controlled by an MR. The
rkey is passed to the receiver (the NFS server, in our case), and is
also used by xprtrdma to invalidate the MR when the RPC is complete.

When a FAST_REG_MR Work Request is flushed after a transport
disconnect, xprtrdma cannot tell whether the WR actually hit the
adapter or not. So it is indeterminant at that point whether the
existing rkey is still valid.

After the transport connection is re-established, the next
FAST_REG_MR or LOCAL_INV Work Request against that MR can sometimes
fail because the rkey value does not match what xprtrdma expects.

The only reliable way to recover in this case is to deregister and
register the MR before it is used again. These operations can be
done only in a process context, so handle it in the transport
connect worker.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:55 -04:00
Chuck Lever
c2922c0235 xprtrdma: Properly handle exhaustion of the rb_mws list
If the rb_mws list is exhausted, clean up and return NULL so that
call_allocate() will delay and try again.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:55 -04:00
Chuck Lever
3111d72c7c xprtrdma: Chain together all MWs in same buffer pool
During connection loss recovery, need to visit every MW in a
buffer pool. Any MW that is in use by an RPC will not be on the
rb_mws list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:54 -04:00
Chuck Lever
c93e986a29 xprtrdma: Back off rkey when FAST_REG_MR fails
If posting a FAST_REG_MR Work Reqeust fails, revert the rkey update
to avoid subsequent IB_WC_MW_BIND_ERR completions.

Suggested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:54 -04:00
Chuck Lever
0dbb4108a6 xprtrdma: Unclutter struct rpcrdma_mr_seg
Clean ups:
 - make it obvious that the rl_mw field is a pointer -- allocated
   separately, not as part of struct rpcrdma_mr_seg
 - promote "struct {} frmr;" to a named type
 - promote the state enum to a named type
 - name the MW state field the same way other fields in
   rpcrdma_mw are named

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:54 -04:00
Chuck Lever
539431a437 xprtrdma: Don't invalidate FRMRs if registration fails
If FRMR registration fails, it's likely to transition the QP to the
error state. Or, registration may have failed because the QP is
_already_ in ERROR.

Thus calling rpcrdma_deregister_external() in
rpcrdma_create_chunks() is useless in FRMR mode: the LOCAL_INVs just
get flushed.

It is safe to leave existing registrations: when FRMR registration
is tried again, rpcrdma_register_frmr_external() checks if each FRMR
is already/still VALID, and knocks it down first if it is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:53 -04:00
Chuck Lever
a7bc211ac9 xprtrdma: On disconnect, don't ignore pending CQEs
xprtrdma is currently throwing away queued completions during
a reconnect. RPC replies posted just before connection loss, or
successful completions that change the state of an FRMR, can be
missed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:53 -04:00
Chuck Lever
6ab59945f2 xprtrdma: Update rkeys after transport reconnect
Various reports of:

  rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0
		ep ffff8800bfd3e848

Ensure that rkeys in already-marshalled RPC/RDMA headers are
refreshed after the QP has been replaced by a reconnect.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=249
Suggested-by: Selvin Xavier <Selvin.Xavier@Emulex.Com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:53 -04:00
Chuck Lever
43e9598817 xprtrdma: Limit data payload size for ALLPHYSICAL
When the client uses physical memory registration, each page in the
payload gets its own array entry in the RPC/RDMA header's chunk list.

Therefore, don't advertise a maximum payload size that would require
more array entries than can fit in the RPC buffer where RPC/RDMA
headers are built.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:52 -04:00
Chuck Lever
73806c8832 xprtrdma: Protect ia->ri_id when unmapping/invalidating MRs
Ensure ia->ri_id remains valid while invoking dma_unmap_page() or
posting LOCAL_INV during a transport reconnect. Otherwise,
ia->ri_id->device or ia->ri_id->qp is NULL, which triggers a panic.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=259
Fixes: ec62f40 'xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting'
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:52 -04:00
Chuck Lever
5fc83f470d xprtrdma: Fix panic in rpcrdma_register_frmr_external()
seg1->mr_nsegs is not yet initialized when it is used to unmap
segments during an error exit. Use the same unmapping logic for
all error exits.

"if (frmr_wr.wr.fast_reg.length < len) {" used to be a BUG_ON check.
The broken code will never be executed under normal operation.

Fixes: c977dea (xprtrdma: Remove BUG_ON() call sites)
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Shirley Ma <shirley.ma@oracle.com>
Tested-by: Devesh Sharma <devesh.sharma@emulex.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2014-07-31 16:22:52 -04:00
Chuck Lever
e560e3b510 svcrdma: Add zero padding if the client doesn't send it
See RFC 5666 section 3.7: clients don't have to send zero XDR
padding.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=246
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-22 16:40:21 -04:00