- Add iWARP port mapper to avoid conflicts between RDMA and normal
stack TCP connections.
- Fixes for i386 / x86-64 structure padding differences (ABI
compatibility for 32-on-64) from Yann Droneaud.
- A pile of SRP initiator fixes from Bart Van Assche.
- Fixes for a writeback / memory allocation deadlock with NFS over
IPoIB connected mode from Jiri Kosina.
- The usual fixes and cleanups to mlx4, mlx5, cxgb4 and other
low-level drivers.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJTlzyEAAoJEENa44ZhAt0h9yoP/1UeXlejOpCJyiNdtJZ+ilcU
cb0PEzsjzqACyDqcoQ0EpQM3/3emccVIC3uUXK12mzlTIXOFYTeRLays/TbxZDLt
FK5D/NrMmmJmciPt1ZRgUX82kFFRGScEfpkXYs7jxtRaNT7CW5KwSNQr6aFXskUz
1gpdK1ARCN5rWcGl2HJx5o9C4c/Fa/Vov8lOsAkUZXD1SuPNT/fFN0u1pRzU68g0
k3oj81XnZq5ejOBQKXEHImcmjXwaJ2yjmzxhSsKebqDWDdXuS/F9e4taKneHTZmr
AdwJaLLJPWmAGi/vYYhkuLKpzIDpzMCqwr39lEabmjWvznYOlnjfVUXwUTE2nwNC
DIXuHOLFrSvF2cNxh8ZeEYKS8AV+PjAOahPC5whkWkY256Q67uB7cy9ilWAK+7xS
QcQ5Inr6iXvxIGYA4hNwUo8aK0NuKFwhkVVFEbkPaurbQZPqiKwyVE3w2FOws/Qp
0kLLCVvpRQYjKzkxyof2tb1AcNuVNKXHrYk6RaBDJ9mjxHbhvY4OSt4CBxAAXBu6
zoedUydN1Nz1UgAB1jDsBdyE2QQnXockA1+JJKNq6gM5Dz0DUdAylzQ2NqY9tnYz
RTzihEPYIiQUkV3B8ErbqsuO6z7M830AXO5AR6bLZn1zgJ0cbMLBaKLA8LRufJI/
qxNVwL32Uv1PjKZ+yX1x
=Wcdc
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull main InfiniBand/RDMA updates from Roland Dreier:
- add iWARP port mapper to avoid conflicts between RDMA and normal
stack TCP connections.
- fixes for i386 / x86-64 structure padding differences (ABI
compatibility for 32-on-64) from Yann Droneaud.
- a pile of SRP initiator fixes from Bart Van Assche.
- fixes for a writeback / memory allocation deadlock with NFS over
IPoIB connected mode from Jiri Kosina.
- the usual fixes and cleanups to mlx4, mlx5, cxgb4 and other low-level
drivers.
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (61 commits)
RDMA/cxgb4: Add support for iWARP Port Mapper user space service
RDMA/nes: Add support for iWARP Port Mapper user space service
RDMA/core: Add support for iWARP Port Mapper user space service
IB/mlx4: Fix gfp passing in create_qp_common()
IB/umad: Fix use-after-free on close
IB/core: Fix kobject leak on device register error flow
RDMA/cxgb4: add missing padding at end of struct c4iw_alloc_ucontext_resp
mlx4_core: Fix GFP flags parameters to be gfp_t
IB/core: Fix port kobject deletion during error flow
IB/core: Remove unneeded kobject_get/put calls
IB/core: Fix sparse warnings about redeclared functions
IB/mad: Fix sparse warning about gfp_t use
IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO
IB: Add a QP creation flag to use GFP_NOIO allocations
IB: Return error for unsupported QP creation flags
IB: Allow build of hw/ and ulp/ subdirectories independently
mlx4_core: Move handling of MLX4_QP_ST_MLX to proper switch statement
RDMA/cxgb4: Add missing padding at end of struct c4iw_create_cq_resp
IB/srp: Avoid problems if a header uses pr_fmt
IB/umad: Fix error handling
...
There are two kzalloc() calls which were not converted to use value of
gfp passed to create_qp_common() instead of using hardcoded GFP_KERNEL
in 40f2287bd5 ("IB/mlx4: Implement IB_QP_CREATE_USE_GFP_NOIO"). Fix
this by passing gfp value down properly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Modify the various routines used to allocate memory resources which
serve QPs in mlx4 to get an input GFP directive. Have the Ethernet
driver to use GFP_KERNEL in it's QP allocations as done prior to this
commit, and the IB driver to use GFP_NOIO when the IB verbs
IB_QP_CREATE_USE_GFP_NOIO QP creation flag is provided.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This commit adds the infrastructure for enabling selected VFs to
operate SMI (QP0) MADs without restriction.
Additionally, for these enabled VFs, their QP0 proxy and tunnel QPs
are MLX QPs. As such, they operate over VL15. Therefore, they are
not affected by "credit" problems or changes in the VLArb table (which
may shut down VL0).
Non-enabled VFs may only create UD proxy QP0 qps (which are forced by
the hypervisor to send packets using the q-key it assigns and places
in the qp-context). Thus, non-enabled VFs will not pose a security
risk. The hypervisor discards any privileged MADs it receives from
these non-enabled VFs.
By default, all VFs are NOT enabled, and must explicitly be enabled
by the administrator.
The sysfs interface which operates the VF enablement infrastructure
is provided in the next commit.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Currently, VFs in SRIOV VFs are denied QP0 access. The main reason
for this decision is security, since Subnet Management Datagrams
(SMPs) are not restricted by network partitioning and may affect the
physical network topology. Moreover, even the SM may be denied access
from portions of the network by setting management keys unknown to the
SM.
However, it is desirable to grant SMI access to certain privileged
VFs, so that certain network management activities may be conducted
within virtual machines instead of the hypervisor.
This commit does the following:
1. Create QP0 tunnel QPs for all VFs.
2. Discard SMI mads sent-from/received-for non-privileged VFs in the
hypervisor MAD multiplex/demultiplex logic. SMI mads from/for
privileged VFs are allowed to pass.
3. MAD_IFC wrapper changes/fixes. For non-privileged VFs, only
host-view MAD_IFC commands are allowed, and only for SMI LID-Routed
GET mads. For privileged VFs, there are no restrictions.
This commit does not allow privileged VFs as yet. To determine if a VF
is privileged, it calls function mlx4_vf_smi_enabled(). This function
returns 0 unconditionally for now.
The next two commits allow defining and activating privileged VFs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When we receive a netdev event indicating a netdev change and/or
a netdev address change, we must change the MAC index used by the
proxy QP1 (in the QP context), otherwise RoCE CM packets sent by the
VF will not carry the same source MAC address as the non-CM packets.
We use the UPDATE_QP command to perform this change.
In order to avoid modifying a QP context based on netdev event,
while the driver attempts to destroy this QP (e.g either the mlx4_ib
or ib_mad modules are unloaded), we use mutex locking in both flows.
Since the relevant mlx4 proxy GSI QP is created indirectly by the
mad module when they create their GSI QP, the mlx4 didn't need to
keep track on that QP prior to this change.
Now, when QP modifications are needed to this QP from within the
driver, we added refernece to it.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- The biggest change is core API extensions and mlx5 low-level driver
support for handling DIF/DIX-style protection information, and the
addition of PI support to the iSER initiator. Target support will be
arriving shortly through the SCSI target tree.
- A nice simplification to the "umem" memory pinning library now that
we have chained sg lists. Kudos to Yishai Hadas for realizing our
code didn't have to be so crazy.
- Another nice simplification to the sg wrappers used by qib, ipath and
ehca to handle their mapping of memory to adapter.
- The usual batch of fixes to bugs found by static checkers etc. from
intrepid people like Dan Carpenter and Yann Droneaud.
- A large batch of cxgb4, ocrdma, qib driver updates.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJTPYBnAAoJEENa44ZhAt0hGI4P/29eotGwpkANUQE6FQvxCUL2
CXJtSg52lmYvGJrPK4IhihpbtQmHJz3iXEzlOOWidTw1dJgObR6vFaRymh7+vDLs
CdzybMcXdasarqTuYeJbFzhkimpwtWWrMy/8Ik/Jj/5glGQ6cUSpdYZzVtFhYNqf
hCGE8iLi+tuekJJj1htut5D6apXM7udcdc2yLJNOdsSj/VUXt1oqG1x9xAi9R8Tq
7o8eFSStdlja0EBQ6Hli2zauCSnQkaUtr8h6EAFbcCtvBK8HqsHSc2gfq2ViFUiN
ztt167oWoQnVkR0qCPL5nVt+CRQHHROprVXvbpcTI3aW61gNIl6OrUUOXefzHXac
TNi+fdMpiEB/JQ4Z04Jzd1dGCSjYeTqPj4rO4meFjBmxRDdTgZHu7FWwejT1nYJ5
d2abVdCOT+QWlIlM7m/pjdWJII5OYM+4/jtTayGepEaR4fTUzKtPZPBLNUBDBKE+
4f92PC8LiuPkwJgb6XT96onPz1bDCOnPSEdwoKUFKPeGUcwgVOM/Wx5NU4Yf7rfg
RxQwZ7mJXbjCYFlmGGo/0QDy6UEGkIFYlJSzooP+wlK1JvZ5h2M+9QKX2FtwzR+R
I2kBxcTXWsM/h88R7MkNqbNIllmhssrJwmAE46OneZbfoBOB+JZjb4nLRTu0jEcS
zn6f16GmJ37BKn2/qYY/
=Ww6H
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband updates from Roland Dreier:
"Main batch of InfiniBand/RDMA changes for 3.15:
- The biggest change is core API extensions and mlx5 low-level driver
support for handling DIF/DIX-style protection information, and the
addition of PI support to the iSER initiator. Target support will
be arriving shortly through the SCSI target tree.
- A nice simplification to the "umem" memory pinning library now that
we have chained sg lists. Kudos to Yishai Hadas for realizing our
code didn't have to be so crazy.
- Another nice simplification to the sg wrappers used by qib, ipath
and ehca to handle their mapping of memory to adapter.
- The usual batch of fixes to bugs found by static checkers etc.
from intrepid people like Dan Carpenter and Yann Droneaud.
- A large batch of cxgb4, ocrdma, qib driver updates"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (102 commits)
RDMA/ocrdma: Unregister inet notifier when unloading ocrdma
RDMA/ocrdma: Fix warnings about pointer <-> integer casts
RDMA/ocrdma: Code clean-up
RDMA/ocrdma: Display FW version
RDMA/ocrdma: Query controller information
RDMA/ocrdma: Support non-embedded mailbox commands
RDMA/ocrdma: Handle CQ overrun error
RDMA/ocrdma: Display proper value for max_mw
RDMA/ocrdma: Use non-zero tag in SRQ posting
RDMA/ocrdma: Memory leak fix in ocrdma_dereg_mr()
RDMA/ocrdma: Increment abi version count
RDMA/ocrdma: Update version string
be2net: Add abi version between be2net and ocrdma
RDMA/ocrdma: ABI versioning between ocrdma and be2net
RDMA/ocrdma: Allow DPP QP creation
RDMA/ocrdma: Read ASIC_ID register to select asic_gen
RDMA/ocrdma: SQ and RQ doorbell offset clean up
RDMA/ocrdma: EQ full catastrophe avoidance
RDMA/cxgb4: Disable DSGL use by default
RDMA/cxgb4: rx_data() needs to hold the ep mutex
...
Fix the following warning for the mlx4 driver:
$ make M=drivers/infiniband C=2 CF=-D__CHECK_ENDIAN__
drivers/infiniband/hw/mlx4/qp.c:1885:31: warning: restricted __be16 degrades to integer
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Since there is no connection between the MAC/VLAN and the GID
when using IP-based addressing, the proxy QP1 (running on the
slave) must pass the source-mac, destination-mac, and vlan_id
information separately from the GID. Additionally, the Host
must pass the remote source-mac and vlan_id back to the slave,
This is achieved as follows:
Outgoing MADs:
1. Source MAC: obtained from the CQ completion structure
(struct ib_wc, smac field).
2. Destination MAC: obtained from the tunnel header
3. vlan_id: obtained from the tunnel header.
Incoming MADs
1. The source (i.e., remote) MAC and vlan_id are passed in
the tunnel header to the proxy QP1.
VST mode support:
For outgoing MADs, the vlan_id obtained from the header is
discarded, and the vlan_id specified by the Hypervisor is used
instead.
For incoming MADs, the incoming vlan_id (in the wc) is discarded, and the
"invalid" vlan (0xffff) is substituted when forwarding to the slave.
Signed-off-by: Moni Shoua <monis@mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IB side of RoCE requires the MAC table index of the
MAC address used by its QPs.
To obtain the real MAC index, the IB side registers the
MAC (increasing its ref count, and also returning the
real MAC index) during the modify-qp sequence.
This protects against the ETH side deleting or modifying
that MAC table entry while the QP is active.
Note that until the modify-qp command returns success,
the MAC and VLAN information only has "candidate" status.
If the modify-qp succeeds, the "candidate" info is promoted
to the operational MAC/VLAN info for the qp. If the modify fails,
the candidate MAC/VLAN is unregistered, and the old qp info
is preserved.
The patch is a bit complex, because there are multiple qp
transitions where the primary-path information may be
modified: INIT-to-RTR, and SQD-to-SQD.
Similarly for the alternate path information.
Therefore the code must handle cases where path information
has already been entered into the QP context by previous
qp transitions.
For the MAC address, the success logic is as follows:
1. If there was no previous MAC, simply move the candidate
MAC information to the operational information, and reset
the candidate MAC info.
2. If there was a previous MAC, unregister it. Then move
the MAC information from candidate to operational, and
reset the candidate info (as in 1. above).
The MAC address failure logic is the same for all cases:
- Unregister the candidate MAC, and reset the candidate MAC info.
For Vlan registration, the logic is similar.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This requires the following modifications:
1. Fix build_mlx4_header to properly fill in the ETH fields
2. Adjust mux and demux QP1 flow to support RoCE.
This commit still assumes only one GID per slave for RoCE.
The commit enabling multiple GIDs is a subsequent commit, and
is done separately because of its complexity.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IP based RoCE gids don't store Ethernet L2 parameters, MAC and VLAN.
Therefore, we need to extract them from the CQE and place them in
struct ib_wc (to be used for cases were they were taken from the gid).
Also, when modifying a QP or building address handle, instead of
parsing the dgid to get the MAC and VLAN, take them from the address
handle attributes.
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.
When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.
Thus, those attributes were added to the following structures:
* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id
For the path record structure, extra care was taken to avoid the new
fields when packing it into wire format, so we don't break the IB CM
and SA wire protocol.
On the active side, the CM fills. its internal structures from the
path provided by the ULP. We add there taking the ETH L2 attributes
and placing them into the CM Address Handle (struct cm_av).
On the passive side, the CM fills its internal structures from the WC
associated with the REQ message. We add there taking the ETH L2
attributes from the WC.
When the HW driver provides the required ETH L2 attributes in the WC,
they set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core
code checks for the presence of these flags, and in their absence does
address resolution from the ib_init_ah_from_wc() helper function.
ib_modify_qp_is_ok is also updated to consider the link layer. Some
parameters are mandatory for Ethernet link layer, while they are
irrelevant for IB. Vendor drivers are modified to support the new
function signature.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds support for steerable (NETIF) QP creation. When we
create the device, we allocate a range of steerable QPs.
Afterward when a QP is created with the NETIF flag, it's allocated
from this range. Allocation is managed by bitmap allocator.
Internal steering rules for those QPs is automatically generated on
their creation.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
When the link type is Ethernet, setting the link type in the QP
context will enable TCP/IP stateless offloads (checksum, LSO, RSS) for
RAW PACKET Ethernet QPs. For IB UD QPs this worked OK since the value
assumed by the firmware for IB link layer is zero.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Fix the asymmetric behavior w.r.t VLAN insertion/stripping for RAW
PACKET QPs -- we don't insert on send and need not strip on receive.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
* Implement memory windows binding in mlx4_ib_post_send.
* Implement mlx4_ib_bind_mw by deferring to mlx4_ib_post_send.
* Rename MLX4_WQE_FMR_PERM_* flags to MLX4_WQE_FMR_AND_BIND_PERM_*,
indicating that they are used both for fast registration work
requests, and for memory window bind work requests.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Building qp.o triggers this gcc warning:
drivers/infiniband/hw/mlx4/qp.c: In function ‘mlx4_ib_post_send’:
drivers/infiniband/hw/mlx4/qp.c:1862:62: warning: ‘vlan’ may be used uninitialized in this function [-Wmaybe-uninitialized]
drivers/infiniband/hw/mlx4/qp.c:1752:6: note: ‘vlan’ was declared here
Looking at the code it is clear 'vlan' is only set and used if 'is_eth'
is non-zero. But by initializing 'vlan' to 0xffff, on
gcc (Ubuntu 4.7.2-22ubuntu1) 4.7.2
on x86-64 at least, we fix the warning, and the compiler was already
setting 'vlan' to 0 in the generated code, so there's no real downside.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
[ Get rid of unnecessary move of 'is_vlan' initialization. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Matches the way they're used, and actually lets at least x86-64 generate
better code:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-38 (-38)
function old new delta
mlx4_ib_post_send 4416 4378 -38
Signed-off-by: Roland Dreier <roland@purestorage.com>
Remove unused fields from the local invalidate WQE segment structure.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Previously, the structure of a guest's proxy QPs followed the
structure of the PPF special qps (qp0 port 1, qp0 port 2, qp1 port 1,
qp1 port 2, ...). The guest then did offset calculations on the
sqp_base qp number that the PPF passed to it in QUERY_FUNC_CAP().
This is now changed so that the guest does no offset calculations
regarding proxy or tunnel QPs to use. This change frees the PPF from
needing to adhere to a specific order in allocating proxy and tunnel
QPs.
Now QUERY_FUNC_CAP provides each port individually with its proxy
qp0, proxy qp1, tunnel qp0, and tunnel qp1 QP numbers, and these are
used directly where required (with no offset calculations).
To accomplish this change, several fields were added to the phys_caps
structure for use by the PPF and by non-SR-IOV mode:
base_sqpn -- in non-sriov mode, this was formerly sqp_start.
base_proxy_sqpn -- the first physical proxy qp number -- used by PPF
base_tunnel_sqpn -- the first physical tunnel qp number -- used by PPF.
The current code in the PPF still adheres to the previous layout of
sqps, proxy-sqps and tunnel-sqps. However, the PPF can change this
layout without affecting VF or (paravirtualized) PF code.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
1. Introduce the basic SR-IOV parvirtualization context objects for
multiplexing and demultiplexing MADs.
2. Introduce support for the new proxy and tunnel QP types.
This patch introduces the objects required by the master for managing
QP paravirtualization for guests.
struct mlx4_ib_sriov is created by the master only.
It is a container for the following:
1. All the info required by the PPF to multiplex and de-multiplex MADs
(including those from the PF). (struct mlx4_ib_demux_ctx demux)
2. All the info required to manage alias GUIDs (i.e., the GUID at
index 0 that each guest perceives. In fact, this is not the GUID
which is actually at index 0, but is, in fact, the GUID which is at
index[<VF number>] in the physical table.
3. structures which are used to manage CM paravirtualization
4. structures for managing the real special QPs when running in SR-IOV
mode. The real SQPs are controlled by the PPF in this case. All
SQPs created and controlled by the ib core layer are proxy SQP.
struct mlx4_ib_demux_ctx contains the information per port needed
to manage paravirtualization:
1. All multicast paravirt info
2. All tunnel-qp paravirt info for the port.
3. GUID-table and GUID-prefix for the port
4. work queues.
struct mlx4_ib_demux_pv_ctx contains all the info for managing the
paravirtualized QPs for one slave/port.
struct mlx4_ib_demux_pv_qp contains the info need to run an individual
QP (either tunnel qp or real SQP).
Note: We made use of the 2 most significant bits in enum
mlx4_ib_qp_flags (based on enum ib_qp_create_flags in ib_verbs.h).
We need these bits in the low-level driver for internal purposes.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Unlike other parts of the mlx4_ib code, the function build_mlx_header()
doesn't check if the iboe netdev of the given port is valid before
dereferencing it, which can cause a crash if the ethernet interface
has already been taken down.
Fix this by checking for a valid netdev pointer before using it to get
the port MAC address.
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
- Updates to the qib low-level driver
- First chunk of changes for SR-IOV support for mlx4 IB
- RDMA CM support for IPv6-only binding
- Other misc cleanups and fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJQDXhUAAoJEENa44ZhAt0hZxwQAJydS9f9me31AGa45SAq8rA8
6e3LgnQ6jS38he7LZZdSsT7g25jROCKOcTj6VUkNSVAVSkK8zcp7wngjDxw50IK6
FF9hNeljMkWBflOhxzB34DRGCW4b4J+Yt1o7v1RtawiG/Mhri/6imKL0Aqjt4GwX
a1MPZn+xI2osLujfdHJtATPWWB9jCaXdFe4DJUNPdqJhS6TN7s8OP3XMiqoJtdaV
ptHeSSbjdUR1mg/h2LU2FVmWHXNSBxn7MEsrBBRkQVyiEkXieFBLwPTp4DqmAUJf
xugm6Hf4sbiQ+QuU0baJODt56wveuYVQ4HKUzE0urUFXyU4TUB9blrehWZnKsRte
wo/w4nvVlnXgGhHH0Igq76RDX8aCwc/6uQJ/29oChWjrei3HE0LjmIlPAu0vAhyw
ViLe02/r2gKXQv1NxIqhPmGsJTZizg2mUk2eEJHHPQb/NGL7iNo6b+141AfURoqQ
goGmlGdzffCRpmo6FXFZ57RPVRS4gwMunCY/Pmvq5a2t4oZh8899l2+V3N7bCfvH
+JdavxAjia9U4IlgPsAqVaz8z8TyHOY/lEd75Wnw8q9www3kLx7hASvHsdwrS4LL
ihzECzsaXOSoIXQzghs+6iyp1pzmzE9ve8OqbIGJStzlrOLyn7Gjo+Ixwm0SLsTO
I7h9iJ1OMKFZ8bzmHsWb
=f09j
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA changes from Roland Dreier:
- Updates to the qib low-level driver
- First chunk of changes for SR-IOV support for mlx4 IB
- RDMA CM support for IPv6-only binding
- Other misc cleanups and fixes
Fix up some add-add conflicts in include/linux/mlx4/device.h and
drivers/net/ethernet/mellanox/mlx4/main.c
* tag 'rdma-for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (30 commits)
IB/qib: checkpatch fixes
IB/qib: Add congestion control agent implementation
IB/qib: Reduce sdma_lock contention
IB/qib: Fix an incorrect log message
IB/qib: Fix QP RCU sparse warnings
mlx4: Put physical GID and P_Key table sizes in mlx4_phys_caps struct and paravirtualize them
mlx4_core: Allow guests to have IB ports
mlx4_core: Implement mechanism for reserved Q_Keys
net/mlx4_core: Free ICM table in case of error
IB/cm: Destroy idr as part of the module init error flow
mlx4_core: Remove double function declarations
IB/mlx4: Fill the masked_atomic_cap attribute in query device
IB/mthca: Fill in sq_sig_type in query QP
IB/mthca: Warning about event for non-existent QPs should show event type
IB/qib: Fix sparse RCU warnings in qib_keys.c
net/mlx4_core: Initialize IB port capabilities for all slaves
mlx4: Use port management change event instead of smp_snoop
IB/qib: RCU locking for MR validation
IB/qib: Avoid returning EBUSY from MR deregister
IB/qib: Fix UC MR refs for immediate operations
...
Define pr_fmt and add some pr_debug prints.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The driver is modified to support three operation modes.
If supported by firmware use the device managed flow steering
API, that which we call device managed steering mode. Else, if
the firmware supports the B0 steering mode use it, and finally,
if none of the above, use the A0 steering mode.
When the steering mode is device managed, the code is modified
such that L2 based rules set by the mlx4_en driver for Ethernet
unicast and multicast, and the IB stack multicast attach calls
done through the mlx4_ib driver are all routed to use the device
managed API.
When attaching rule using device managed flow steering API,
the firmware returns a 64 bit registration id, which is to be
provided during detach.
Currently the firmware is always programmed during HCA initialization
to use standard L2 hashing. Future work should be done to allow
configuring the flow-steering hash function with common, non
proprietary means.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Limit the max number of WQEs per QP reported when querying the
device, so that ib_create_qp() will not fail for a QP size that the
device claimed to support due to additional headroom WQEs being
allocated.
2. Limit qp resources accepted for ib_create_qp() to the limits
reported in ib_query_device(). In kernel space, make sure that the
limits returned to the caller following qp creation also lie within
the reported device limits. For userspace, report as before, and do
adjustment in libmlx4 (so as not to break ABI).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
[ Replace one more printk_once() with pr_info_once(). - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Otherwise CM packets going over MLX QP1 get fixed scheduling priority 0.
We want CM packets to get the same scheduling priority, and therefore
map to the same SQ (Schedule Queue) and eventually TC (Traffic Class),
as the application requested for the actual QP used for the connection.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Implement raw packet QPs for Ethernet ports using the MLX transport (as
done by the mlx4_en Ethernet netdevice driver).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
If the opcode of a work request exceeds the range of valid opcodes,
return the pointer to the offending work request.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For IBoE, SLs 0-7 are mapped to Ethernet 802.1Q user priority bits
(pbits) which are part of the VLAN tag, SLs 8-15 are reserved.
Under Ethernet, the ConnectX firmware treats (decode/encode) the four
bit SL field in various constructs such as QPC / UD WQE / CQE as PPP0
and not as 0PPP. This correlates well to the fact that within the
vlan tag the pbits are located in bits 15-13 and not 12-14.
The current code wasn't consistent around that area - the
encoding was correct for the IBoE QPC.path.schedule_queue field,
but was wrong for IBoE CQEs and when MLX header was built.
These inconsistencies resulted in wrong SL <--> wire 802.1Q pbits
mapping, which is fixed by using SL <--> PPP0 all around the place.
Signed-off-by: Oren Duer <oren@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
There's no need to set the vlan-related fields in an IBoE send WQE
control segment:
- the vlan to be used by a UD QP is set in the datagram segment.
- for GSI (CM) QP, all the headers down to 8021q and MAC are built by
the software anyway.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Support the creation of XRC INI and TGT QPs. To handle the case where
a CQ or PD is not provided, we allocate them internally with the xrcd.
Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Allocate flow counter per Ethernet/IBoE port, and attach this counter
to all the QPs created on that port. Based on patch by Eli Cohen
<eli@mellanox.co.il>.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.co.il>
Signed-off-by: Roland Dreier <roland@purestorage.com>
We must fully update the control segment before marking it as valid,
so that hardware doesn't start executing it before we're ready.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Move VLAN control bit setting to before wmb(). - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This patch allows IBoE traffic to be encapsulated in 802.1Q tagged
VLAN frames. The VLAN tag is encoded in the GID and derived from it
by a simple computation.
The netdev notifier callback is modified to catch VLAN device
addition/removal and the port's GID table is updated to reflect the
change, so that for each netdevice there is an entry in the GID table.
When the port's GID table is exhausted, GID entries will not be added.
Only children of the main interfaces can add to the GID table; if a
VLAN interface is added on another VLAN interface (e.g. "vconfig add
eth2.6 8"), then that interfaces will not add an entry to the GID
table.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add 802.1q VLAN support to IBoE. The VLAN tag is encoded within the
GID derived from a link local address in the following way:
GID[11] GID[12] contain the VLAN ID when the GID contains a VLAN.
The 3 bits user priority field of the packets are identical to the 3
bits of the SL.
In case of rdma_cm apps, the TOS field is used to generate the SL
field by doing a shift right of 5 bits effectively taking to 3 MS bits
of the TOS field.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for IBoE to mlx4_ib. The bulk of the code is handling the
new address vector fields; mlx4 needs the MAC address of a remote node
to include it in a WQE (for datagrams) or in the QP context (for
connected QPs). Address resolution is done by assuming all unicast
GIDs are either link-local IPv6 addresses.
Multicast group attach/detach needs to update the NIC's multicast
filters; but since attaching a QP to a multicast group can be done
before the QP is bound to a port, for IBoE we need to keep track of
all multicast groups that a QP is attached too before it transitions
from INIT to RTR (since it does not have a port in the INIT state).
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Many things cleaned up and otherwise monkeyed with; hope I didn't
introduce too many bugs. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for packing IBoE packet headers.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
[ Clean up and fix ib_ud_header_init() a bit. - Roland ]
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for masked atomic operations (masked compare and swap,
masked fetch and add).
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
ib_ud_header_init() first clears header and then fills up the various
fields. Later on, it tests header->immediate_present, which it has
already cleared, so the condition is always false. Fix this by adding
an immediate_present parameter and setting header->immediate_present
as is done with grh_present. Also remove unused calculation of
header_len.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
struct ib_qp already holds a pointer to the ib device. No need to dive to the
hw device object to retrieve it.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
In mlx4_ib_post_recv(), we should check the queue for overflow using
recv_cq instead of send_cq (current code looks like a copy-and-paste
mistake).
Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code has a limitation: an LSO header is not allowed to cross a
64 byte boundary. This patch removes this limitation by setting the
WQE RR for large headers thus allowing LSO headers of any size. The
extra buffer reserved for MLX4_IB_QP_LSO QPs has been doubled, from 64
to 128 bytes, assuming this is reasonable upper limit for header
length. Also, this patch will cause IB_DEVICE_UD_TSO to be set only
for HCA FW versions that set MLX4_DEV_CAP_FLAG_BLH; e.g. FW version
2.6.000 and higher.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
There is no such flag DE - the field is reserved and should be zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
mlx4_ib_lock_cqs()/mlx4_ib_unlock_cqs() are helper functions that
lock/unlock both CQs attached to a QP in the proper order to avoid
AB-BA deadlocks. Annotate this so sparse can understand what's going
on (and warn us if we misuse these functions).
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The ConnectX Programmer's Reference Manual states that the "SO" bit
must be set when posting Fast Register and Local Invalidate send work
requests. When this bit is set, the work request will be executed
only after all previous work requests on the send queue have been
executed. (If the bit is not set, Fast Register and Local Invalidate
WQEs may begin execution too early, which violates the defined
semantics for these operations)
This fixes the issue with NFS/RDMA reported in
<http://lists.openfabrics.org/pipermail/general/2009-April/059253.html>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Cc: <stable@kernel.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The low-level mlx4 driver modified the page-list addresses for fast
register work requests post send to big-endian, and set a "present"
bit. This caused problems later when the consumer attempted to unmap
the pages using the page-list (using the list addresses which were
assumed to be still in CPU-endian order). Fix the mlx4 driver to
allocate two buffers and use a private buffer for the hardware-format
bus addresses.
This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1571>,
an NFS/RDMA server crash. The cause of the crash was found by Vu Pham
of Mellanox. The fix is along the lines suggested by Steve Wise in
comment #21 in bug 1571.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The base versions handle constant folding just fine, use them
directly. The replacements are OK in the include/ files as they are
not exported to userspace so we don't need the __ prefixed versions.
This patch does not affect code generation at all.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The current work request posting code writes the LSO segment before
writing any data segments. This leaves a window where the LSO segment
overwrites the stamping in one cacheline that the HCA prefetches
before the rest of the cacheline is filled with the correct data
segments. When the HCA processes this work request, a local
protection error may result.
Fix this by saving the LSO header size field off and writing it only
after all data segments are written. This fix is a cleaned-up version
of a patch from Jack Morgenstein <jackm@dev.mellanox.co.il>.
This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1383>.
Reported-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
To allow allocating an aligned range of consecutive QP numbers, add an
interface to reserve an aligned range of QP numbers and have the QP
allocation function always take a QP number.
This will be used for RSS support in the mlx4_en Ethernet driver and
also potentially by IPoIB RSS support.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Set RLKEY bit in the HW context for kernel QPs so that kernel QPs can
use the reserved L_Key for memory reference.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Byte swap the addresses in the page list for fast register work requests
to big endian to match what the HCA expectx. Also, the addresses must
have the "present" bit set so that the HCA knows it can access them.
Otherwise the HCA will fault the first time it accesses the memory
region.
Signed-off-by: Vladimir Sokolovsky <vlad@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code limits the max message size to 2K for UD QPs, while MTU
might be as big as 4K. This patch sets the maximum message size to
4K, which is needed for UD to work correctly on fabrics with a 4K MTU.
Signed-off-by: Alex Naslednikov <xalex@mellanox.co.il>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Update existing Mellanox copyright lines to 2008, and add such lines
to files where they are missing.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for the following operations to mlx4 when device firmware
supports them:
- Send with invalidate and local invalidate send queue work requests;
- Allocate/free fast register MRs;
- Allocate/free fast register MR page lists;
- Fast register MR send queue work requests;
- Local DMA L_Key.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Current code uses kmalloc() and then just does a bitwise OR operation on
qp->flags in create_qp_common(), which means that qp->flags may
potentially have some unintended bits set. This patch uses kzalloc()
and avoids further explicit clearing of structure members, which also
shrinks the code:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-65 (-65)
function old new delta
create_qp_common 2024 1959 -65
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add support for handling the IB_QP_CREATE_MULTICAST_BLOCK_LOOPBACK
flag by using the per-multicast group loopback blocking feature of
mlx4 hardware.
Signed-off-by: Ron Livne <ronli@voltaire.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Commit 65adfa91 ("IB/mlx4: Fix RESET to RESET and RESET to ERROR
transitions") added some extra code to handle a QP state transition
from RESET to ERROR. However, the latest 1.2.1 version of the IB spec
has clarified that this transition is actually not allowed, so we can
remove this extra code again.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX returns the max message size it supports through the
QUERY_DEV_CAP firmware command. When modifying a QP to RTR, the max
message size for the QP must be specified. This value must not exceed
the value declared through QUERY_DEV_CAP. The current code ignores
the max allowed size and unconditionally sets the value to 2^31. This
patch sets all QPs to the max value allowed as returned from firmware.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The idea is that for QPs with fixed size work requests (eg selective
signaling QPs), before stamping the WQE, we read the value of the DS
field, which gives the effective size of the descriptor as used in the
previous post. Then we stamp only that area, since the rest of the
descriptor is already stamped.
When initializing the send queue buffer, make sure the DS field is
initialized to the max descriptor size so that the subsequent stamping
will be done on the entire descriptor area.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When creating a kernel QP where the consumer asked for a send queue
with lots of scatter/gater entries, set_kernel_sq_size() incorrectly
returned an error if the send queue stride is larger than the
hardware's maximum send work request descriptor size. This is not a
problem; the only issue is to make sure that the actual descriptors
used do not overflow the maximum descriptor size, so check this instead.
Clamp the returned max_send_sge value to be no bigger than what
query_device returns for the max_sge to avoid confusing hapless users,
even if the hardware is capable of handling a few more s/g entries.
This bug caused NFS/RDMA mounts to fail when the server adapter used
the mlx4 driver.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
drivers/infiniband/hw/mlx4/qp.c: In function 'mlx4_ib_post_send':
drivers/infiniband/hw/mlx4/qp.c:1460: warning: 'seglen' may be used uninitialized in this function
This is the dopey gcc-doesn't-know-that-foo(&var)-writes-to-var problem.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new parameter, dmasync, to the ib_umem_get() prototype. Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In addition to mlx4_ib, there will be ethernet and FC consumers of
mlx4_core, so move the code for managing kernel doorbells into the
core module to avoid having to duplicate this multiple times.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
If the QP was moved to another state (such as SQE) by the hardware,
then after this change the user won't have to set the IBV_QP_CUR_STATE
mask in order to execute modify QP in order to recover from this state.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a new IB_WR_SEND_WITH_INV send opcode that can be used to mark a
"send with invalidate" work request as defined in the iWARP verbs and
the InfiniBand base memory management extensions. Also put "imm_data"
and a new "invalidate_rkey" member in a new "ex" union in struct
ib_send_wr. The invalidate_rkey member can be used to pass in an
R_Key/STag to be invalidated. Add this new union to struct
ib_uverbs_send_wr. Add code to copy the invalidate_rkey field in
ib_uverbs_post_send().
Fix up low-level drivers to deal with the change to struct ib_send_wr,
and just remove the imm_data initialization from net/sunrpc/xprtrdma/,
since that code never does any send with immediate operations.
Also, move the existing IB_DEVICE_SEND_W_INV flag to a new bit, since
the iWARP drivers currently in the tree set the bit. The amso1100
driver at least will silently fail to honor the IB_SEND_INVALIDATE bit
if passed in as part of userspace send requests (since it does not
implement kernel bypass work request queueing). Remove the flag from
all existing drivers that set it until we know which ones are OK.
The values chosen for the new flag is not consecutive to avoid clashing
with flags defined in the XRC patches, which are not merged yet but
which are already in use and are likely to be merged soon.
This resurrects a patch sent long ago by Mikkel Hagen <mhagen@iol.unh.edu>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Rather than have build_mlx_header() return a negative value on failure
and the length of the segments it builds on success, add a pointer
parameter to return the length and return 0 on success. This matches
the calling convention used for build_lso_seg() and generates slightly
smaller code -- eg, on 64-bit x86:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-22 (-22)
function old new delta
mlx4_ib_post_send 2023 2001 -22
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add a create_flags member to struct ib_qp_init_attr that will allow a
kernel verbs consumer to create a pass special flags when creating a QP.
Add a flag value for telling low-level drivers that a QP will be used
for IPoIB UD LSO. The create_flags member will also be useful for XRC
and ehca low-latency QP support.
Since no create_flags handling is implemented yet, add code to all
low-level drivers to return -EINVAL if create_flags is non-zero.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX devices support checksum generation and verification of TCP
and UDP packets for UD IPoIB messages. This patch checks if the HCA
supports this and sets the IB_DEVICE_UD_IP_CSUM capability flag if it
does. It implements support for handling the IB_SEND_IP_CSUM send
flag and setting the csum_ok field in receive work completions.
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
Signed-off-by: Ali Ayub <ali@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
ConnectX HCA supports shrinking WQEs, so that a single work request
can be made of multiple units of wqe_shift. This way, WRs can differ
in size, and do not have to be a power of 2 in size, saving memory and
speeding up send WR posting. Unfortunately, if we do this then the
wqe_index field in CQEs can't be used to look up the WR ID anymore, so
our implementation does this only if selective signaling is off.
Further, on 32-bit platforms, we can't use vmap() to make the QP
buffer virtually contigious. Thus we have to use constant-sized WRs to
make sure a WR is always fully within a single page-sized chunk.
Finally, we use WRs with the NOP opcode to avoid wrapping around the
queue buffer in the middle of posting a WR, and we set the
NoErrorCompletion bit to avoid getting completions with error for NOP
WRs. However, NEC is only supported starting with firmware 2.2.232,
so we use constant-sized WRs for older firmware. And, since MLX QPs
only support SEND, we use constant-sized WRs in this case.
When stamping during NOP posting, do stamping following setting of the
NOP WQE valid bit.
Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
We use struct mlx4_buf for kernel QP, CQ and SRQ buffers, and the code
to look up an entry is duplicated in get_cqe_from_buf() and the QP and
SRQ versions of get_wqe(). Factor this out into mlx4_buf_offset().
This will also make it easier to switch over to using vmap() for buffers.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Because of a typo, mlx4_ib_post_send() takes the same lock rq.lock as
mlx4_ib_post_recv(). Correct the code so the intended sq.lock is
taken when posting a send.
Noticed by Yossi Leybovitch and pointed out by Jack Morgenstein from
Mellanox.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Add sanity checks to send queue sizes passed in from userspace. The
minimum sq stride value below is taken from the MT25408 PRM (section
11.10, Table 306, log_sq_stride definition).
Without this check, userspace can submit arbitrarily large/small
values for the number of WQEs and the stride, which can crash the
kernel.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Use a __set_data_seg() helper in mlx4_ib_post_recv() too; in addition
to making the code easier to read, this also allows gcc to generate
better code -- on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-8 (-8)
function old new delta
mlx4_ib_post_recv 359 351 -8
Signed-off-by: Roland Dreier <rolandd@cisco.com>
This is an addendum to commit 0e6e7416 ("IB/mlx4: Handle new FW
requirement for send request prefetching"). We also need to handle
prefetch marking properly for S/G segments, or else the HCA may end up
processing S/G segments that are not fully written and end up sending
the wrong data. This can actually cause data corruption in practice,
especially on systems with relatively slow CPUs (where the HCA is more
likely to prefetch while the CPU is in the middle of writing a work
request into memory).
We write S/G segments in reverse order into the WQE, in order to
guarantee that the first dword of all cachelines containing S/G
segments is written last (overwriting the headroom invalidation
pattern). The entire cacheline will thus contain valid data when the
invalidation pattern is overwritten.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
The error handling code at err_wrid in create_qp_common() does not
handle a userspace QP attached to an SRQ correctly, since it ends up
in the else clause of the if statement. This means it tries to
kfree() the uninitialized qp->sq.wrid and qp->rq.wrid pointers. Fix
this so we only free the wrid arrays for kernel QPs.
Pointed out by Michael S. Tsirkin <mst@dev.mellanox.co.il>.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Temporarily allocated struct mlx4_qp_context *context is leaked by
several error paths. The patch takes advantage of the return value
'err' being preinitialized to -EINVAL.
Spotted by Coverity (CID 1768).
Signed-off-by: Florin Malita <fmalita@gmail.com>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor code to set remote address, atomic and datagram segments out of
mlx4_ib_post_send() into small helper functions. This doesn't change
the generated code in any significant way, and makes the source easier
on the eyes.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Factor code to set data segment entries out of mlx4_ib_post_send()
into set_data_seg(). This cleans up the code and lets the compiler do
a better job -- on x86_64:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-16 (-16)
function old new delta
mlx4_ib_post_send 1598 1582 -16
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Return the receive queue sizes for both userspace QPs and kernel Qps
(not just kernel QPs) from mlx4_ib_query_qp(). Also zero the send
queue sizes for userspace QPs to avoid a possible information leak,
and set the max_inline_data for kernel QPs to 0 since inline sends are
not supported for kernel QPs.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When clearing the ib_ah_attr parameter in to_ib_ah_attr(), use sizeof
*ib_ah_attr instead of sizeof *path. This is the same bug as was
fixed for mthca in 99d4f22e ("IB/mthca: Use correct structure size in
call to memset()"), but the code was cut and pasted into mlx4 before the
fix was merged.
Signed-off-by: Dotan Barak <dotanb@mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
When a QP is in the INIT state, the sched_queue field hasn't been given
to the firmware yet, so the firmware cannot return the value when the QP
is queried. To handle this, use the port number that is saved in the
driver's QP data structure.
Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Correct the mask used to get the flow label, since the field is 20 bits,
not 24 bits.
Found by Dotan Barak and Yaron Gepstein of Mellanox.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd@cisco.com>