linux/drivers/infiniband/core
Dennis Dalessandro ba7d8117f3 IB/core, ipoib: Do not overreact to SM LID change event
When IPoIB receives an SM LID change event, it reacts by flushing its
path record cache and rejoining multicast groups. This is the same
behavior it performs when it receives a reregistration event. This
behavior is unnecessary as an SM may have database backup or
synchronization mechanisms which permit the SM location or LID to change
without loss of multicast membership and without impact to path records.

Both opensm and the OPA FM issue reregistration events if a new SM is
started (or restarted with a new config) or an SM event occurs which
results in loss of multicast membership records by the SM (such as
opensm failover) or the SM encounters new nodes with Active ports (such
as after joining 2 fabrics by connecting switches via ISLs). Hence this
event can be depended on as the trigger for IPoIB cache and multicast
flushing.

It appears that some drivers, such as qib, and hfi1 issue the
IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
that Mellanox EDR HCAs do not generate SM change events and that opensm
does generate reregistration.

An SM LID change event is generated by the mentioned drivers to reflect
that sm_lid and/or sm_sl in the local port info has changed. The intent
of this event is to permit applications and ULPs which have a local copy
of this information (or an address handle using it) to update their
information.

The intent is that the reregistration event (caused by the SM via a bit
in Set(PortInfo)) be used to inform nodes that they need to rejoin
multicast groups, resubscribe for notices and potentially update path
records.

When an SM migrates or fails over, a SM LID change event can occur. In
response IPoIB discards path records and multicast membership and loses
connectivity until these records are restored via SA requests. In very
large fabrics, it may take minutes for the SM to be ready and for the SA
responses to be supplied.  This can result in undesirable and
unnecessary IPoIB connectivity impacts. It also can result in an
unnecessary storm of SA queries from all nodes in a cluster potentially
followed by yet another storm if the SM issues the reregistration
request.

The fact the Mellanox HCAs do not even generate this event, is further
evidence that on modern IB fabrics there will be no ill side effects
from the proposed changes below to reduce the reaction by 3 kernel
components to this event. So these changes should be benign for Mellanox
IB fabrics and will benefit OPA fabrics while also making ib_core and
ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
APIs.

Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
anything on SM LID change. IPoIB makes use of other ib_core components
to issue SA requests for it and those components correctly track SM LID
and SM LID changes.

Also in ib_core multicast handling,  remove the test for
IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
error state, which will trigger rejoins. This code is used by IPoIB as
well as the connection manager and other clients of multicast groups.
This kernel module centralizes group membership status and joins since a
node can only join a given group once but multiple ULPs or applications
may want to join the same group. It makes use of the sa_query.c
component in ib_core, which correctly trackes SM LID and SL. This
component does not track SM LID nor SL itself and hence need not react
to their changes.

Similarly in the ib_core cache code remove the handling for the
IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
which is ultimately called is updating local copies of the pkey table,
gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
such it does not need to be called on an SM LID change. It technically
also does not need to be called on a reregistration. The LID_CHANGE,
PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
PORT_ACTICE) should be sufficient triggers.

It is worth noting that the alternative of simply having the hfi1 and
qib drivers not generate the SM LID change event was explored. While
this would duplicate what Mellanox drivers do now, it is not the correct
behavior and removes the ability for an SM to migrate without requiring
reregistration. Since both opensm and OPA SM have mechanisms to backup
or synchronize registration information, it is desirable to let them
perform SM migrations (with LID or SL changes) without requiring
reregistration when they deem it appropriate.

Suggested-by: Todd Rimmer <todd.rimmer@intel.com>
Tested-by: Michael Brooks <michael.brooks@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Todd Rimmer <todd.rimmer@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07 16:06:03 -03:00
..
addr.c RDMA/cma: Use rdma_read_gid_attr_ndev_rcu to access netdev 2019-05-03 11:10:03 -03:00
agent.c RDMA: Mark if destroy address handle is in a sleepable context 2018-12-19 16:28:03 -07:00
agent.h
cache.c IB/core, ipoib: Do not overreact to SM LID change event 2019-05-07 16:06:03 -03:00
cgroup.c IB/core: Simplify rdma cgroup registration 2019-01-18 13:43:10 -07:00
cm_msgs.h RDMA: Use __packed annotation instead of __attribute__ ((packed)) 2019-03-25 21:14:12 -03:00
cm.c IB/cm: Reduce dependency on gid attribute ndev check 2019-05-03 11:10:02 -03:00
cma_configfs.c RDMA/cma: Move cma module specific functions to cma_priv.h 2018-11-22 11:57:33 -07:00
cma_priv.h IB/cma: Define option to set ack timeout and pack tos_set 2019-02-08 16:14:21 -07:00
cma.c RDMA/cma: Use rdma_read_gid_attr_ndev_rcu to access netdev 2019-05-03 11:10:03 -03:00
core_priv.h RDMA/core: Do not invoke init_port on compat devices 2019-05-03 10:41:23 -03:00
cq.c IB: Pass only ib_udata in function prototypes 2019-04-01 15:00:47 -03:00
device.c RDMA/device: Don't fire uevent before device is fully initialized 2019-05-07 13:02:43 -03:00
fmr_pool.c RDMA: Start use ib_device_ops 2018-12-12 07:40:16 -07:00
iwcm.c RDMA: Get rid of iw_cm_verbs 2019-05-03 10:56:56 -03:00
iwcm.h
iwpm_msg.c RDMA/iwpm: move kdoc comments to functions 2019-02-05 15:40:41 -07:00
iwpm_util.c RDMA/iwpm: Remove set but not used variable 'msg_seq' 2019-02-14 14:47:39 -07:00
iwpm_util.h RDMA/IWPM: Support no port mapping requirements 2019-02-04 16:26:02 -07:00
mad_priv.h RDMA: Use __packed annotation instead of __attribute__ ((packed)) 2019-03-25 21:14:12 -03:00
mad_rmpp.c RDMA: Mark if destroy address handle is in a sleepable context 2018-12-19 16:28:03 -07:00
mad_rmpp.h
mad.c IB/MAD: Add SMP details to MAD tracing 2019-03-27 15:52:01 -03:00
Makefile IB/{core,uverbs}: Move ib_umem_xxx functions from ib_core to ib_uverbs 2019-01-10 17:06:44 -07:00
mr_pool.c
multicast.c IB/core, ipoib: Do not overreact to SM LID change event 2019-05-07 16:06:03 -03:00
netlink.c RDMA/cma: Remove CM_ID statistics provided by rdma-cm module 2019-02-05 15:30:33 -07:00
nldev.c RDMA/core: Add a netlink command to change net namespace of rdma device 2019-04-22 14:44:58 -03:00
opa_smi.h RDMA: Start use ib_device_ops 2018-12-12 07:40:16 -07:00
packer.c
rdma_core.c uverbs: Convert idr to XArray 2019-04-25 12:27:11 -03:00
rdma_core.h IB: Pass uverbs_attr_bundle down uobject destroy path 2019-04-01 14:55:36 -03:00
restrack.c XArray updates for 5.1-rc1 2019-03-11 20:06:18 -07:00
restrack.h RDMA/restrack: Prepare restrack_root to addition of extra fields per-type 2019-02-19 10:13:38 -07:00
roce_gid_mgmt.c IB/core: Fix oops in netdev_next_upper_dev_rcu() 2018-12-12 12:14:49 -05:00
rw.c IB/core: Remove ib_sg_dma_address() and ib_sg_dma_len() 2019-02-04 14:34:07 -07:00
sa_query.c ib core: Convert query_idr to XArray 2019-03-26 11:47:05 -03:00
sa.h
security.c RDMA/device: Consolidate ib_device per_port data into one place 2019-02-19 10:13:39 -07:00
smi.c
smi.h RDMA: Start use ib_device_ops 2018-12-12 07:40:16 -07:00
sysfs.c RDMA: Add EFA related definitions 2019-05-06 13:47:50 -03:00
ucm.c ucm: Convert ctx_id_table to XArray 2019-03-26 11:50:29 -03:00
ucma.c IB/cma: Define option to set ack timeout and pack tos_set 2019-02-08 16:14:21 -07:00
ud_header.c
umem_odp.c RDMA/umem: Remove hugetlb flag 2019-05-06 13:08:11 -03:00
umem.c RDMA/umem: Remove hugetlb flag 2019-05-06 13:08:11 -03:00
user_mad.c RDMA: Check net namespace access for uverbs, umad, cma and nldev 2019-03-28 14:52:02 -03:00
uverbs_cmd.c IB/core: Set qp->real_qp before it may be accessed 2019-05-03 10:17:45 -03:00
uverbs_ioctl.c RDMA/uverbs: Initialize udata struct on destroy flows 2019-05-02 17:07:02 -03:00
uverbs_main.c Merge branch 'rdma_mmap' into rdma.git for-next 2019-04-24 16:20:34 -03:00
uverbs_marshall.c
uverbs_std_types_counters.c IB: When attrs.udata/ufile is available use that instead of uobject 2019-04-08 13:05:25 -03:00
uverbs_std_types_cq.c IB: When attrs.udata/ufile is available use that instead of uobject 2019-04-08 13:05:25 -03:00
uverbs_std_types_device.c IB/uverbs: Fix ioctl query port to consider device disassociation 2019-01-25 11:58:06 -07:00
uverbs_std_types_dm.c IB: When attrs.udata/ufile is available use that instead of uobject 2019-04-08 13:05:25 -03:00
uverbs_std_types_flow_action.c IB: When attrs.udata/ufile is available use that instead of uobject 2019-04-08 13:05:25 -03:00
uverbs_std_types_mr.c IB: Pass uverbs_attr_bundle down ib_x destroy path 2019-04-01 14:57:35 -03:00
uverbs_std_types.c IB: Remove 'uobject->context' dependency in object destroy APIs 2019-04-01 14:59:35 -03:00
uverbs_uapi.c IB/mlx5: Introduce MLX5_IB_OBJECT_DEVX_ASYNC_CMD_FD 2019-01-29 13:32:43 -07:00
uverbs.h uverbs: Convert idr to XArray 2019-04-25 12:27:11 -03:00
verbs.c RDMA: Add EFA related definitions 2019-05-06 13:47:50 -03:00