linux/drivers/infiniband/hw/hfi1
Kaike Wan b25e8e85e7 RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request
The following message occurs when running an AI application with TID RDMA
enabled:

hfi1 0000:7f:00.0: hfi1_0: [QP74] hfi1_tid_timeout 4084
hfi1 0000:7f:00.0: hfi1_0: [QP70] hfi1_tid_timeout 4084

The issue happens when TID RDMA WRITE request is followed by an
IB_WR_RDMA_WRITE_WITH_IMM request, the latter could be completed first on
the responder side. As a result, no ACK packet for the latter could be
sent because the TID RDMA WRITE request is still being processed on the
responder side.

When the TID RDMA WRITE request is eventually completed, the requester
will wait for the IB_WR_RDMA_WRITE_WITH_IMM request to be acknowledged.

If the next request is another TID RDMA WRITE request, no TID RDMA WRITE
DATA packet could be sent because the preceding IB_WR_RDMA_WRITE_WITH_IMM
request is not completed yet.

Consequently the IB_WR_RDMA_WRITE_WITH_IMM will be retried but it will be
ignored on the responder side because the responder thinks it has already
been completed. Eventually the retry will be exhausted and the qp will be
put into error state on the requester side. On the responder side, the TID
resource timer will eventually expire because no TID RDMA WRITE DATA
packets will be received for the second TID RDMA WRITE request.  There is
also risk of a write-after-write memory corruption due to the issue.

Fix by adding a requester side interlock to prevent any potential data
corruption and TID RDMA protocol error.

Fixes: a0b34f75ec ("IB/hfi1: Add interlock between a TID RDMA request and other requests")
Link: https://lore.kernel.org/r/20200811174931.191210.84093.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org> # 5.4.x+
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-08-20 08:31:41 -03:00
..
affinity.c IB/hfi1: Add interrupt handler functions for accelerated ipoib 2020-05-21 11:23:56 -03:00
affinity.h IB/hfi1: Add interrupt handler functions for accelerated ipoib 2020-05-21 11:23:56 -03:00
aspm.c
aspm.h
chip_registers.h IB/hfi1: Add RcvShortLengthErrCnt to hfi1stats 2020-01-10 10:57:17 -04:00
chip.c IB/hfi1: Remove unnecessary fall-through markings 2020-07-16 15:00:02 -03:00
chip.h IB/hfi1: Add interrupt handler functions for accelerated ipoib 2020-05-21 11:23:56 -03:00
common.h IB/hfi1: Enable the transmit side of the datagram ipoib netdev 2020-05-21 11:23:58 -03:00
debugfs.c IB/hfi1: Fix module use count flaw due to leftover module put calls 2020-06-24 15:54:08 -03:00
debugfs.h
device.c
device.h
driver.c IB/hfi1: Add packet histogram trace event 2020-05-21 11:23:57 -03:00
efivar.c infiniband: hfi1: Use EFI GetVariable only when available 2020-02-23 21:59:42 +01:00
efivar.h
eprom.c
eprom.h
exp_rcv.c
exp_rcv.h
fault.c IB/hfi1: Use scnprintf() for avoiding potential buffer overflow 2020-03-26 15:06:14 -03:00
fault.h
file_ops.c IB/hfi1: Remove module parameter for KDETH qpns 2020-05-21 11:23:54 -03:00
firmware.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
hfi.h IB/hfi1: Activate the dummy netdev 2020-05-21 11:23:56 -03:00
init.c IB/hfi1: Do not destroy link_wq when the device is shut down 2020-07-02 13:54:50 -03:00
intr.c
iowait.c IB/hfi1: Don't cancel unused work item 2020-01-03 16:41:51 -04:00
iowait.h RDMA/hfi1: Fix trivial mis-spelling of 'descriptor' 2020-06-15 15:56:54 -03:00
ipoib_main.c IB/hfi1: Add rx functions for dummy netdev 2020-05-21 11:23:56 -03:00
ipoib_rx.c IB/hfi1: Activate the dummy netdev 2020-05-21 11:23:56 -03:00
ipoib_tx.c IB/hfi1: Add atomic triggered sleep/wakeup 2020-06-24 16:13:38 -03:00
ipoib.h IB/hfi1: Add atomic triggered sleep/wakeup 2020-06-24 16:13:38 -03:00
Kconfig treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
mad.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
mad.h RDMA: Replace zero-length array with flexible-array member 2020-02-20 13:33:51 -04:00
Makefile IB/hfi1: Add functions to receive accelerated ipoib packets 2020-05-21 11:23:56 -03:00
mmu_rb.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
mmu_rb.h
msix.c IB/hfi1: Activate the dummy netdev 2020-05-21 11:23:56 -03:00
msix.h IB/hfi1: Activate the dummy netdev 2020-05-21 11:23:56 -03:00
netdev_rx.c IB/hfi1: Restore kfree in dummy_netdev cleanup 2020-06-24 15:54:08 -03:00
netdev.h IB/hfi1: Activate the dummy netdev 2020-05-21 11:23:56 -03:00
opa_compat.h
opfn.c
opfn.h
pcie.c IB/hfi1: Convert PCIBIOS_* errors to generic -E* errors 2020-06-30 13:27:14 -03:00
pio_copy.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
pio.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
pio.h RDMA: Replace zero-length array with flexible-array member 2020-02-20 13:33:51 -04:00
platform.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
platform.h
qp.c RDMA 5.9 merge window pull request 2020-08-06 16:43:36 -07:00
qp.h RDMA/hfi1: Remove hfi1_create_qp declaration 2020-06-22 14:49:27 -03:00
qsfp.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
qsfp.h
rc.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
rc.h
ruc.c
sdma_txreq.h
sdma.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
sdma.h RDMA: Replace zero-length array with flexible-array member 2020-02-20 13:33:51 -04:00
sysfs.c IB/hfi1: Call kobject_put() when kobject_init_and_add() fails 2020-03-27 13:13:36 -03:00
tid_rdma.c RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request 2020-08-20 08:31:41 -03:00
tid_rdma.h IB/hfi1: Calculate flow weight based on QP MTU for TID RDMA 2019-11-06 13:15:36 -04:00
trace_ctxts.h IB/hfi1: Add packet histogram trace event 2020-05-21 11:23:57 -03:00
trace_dbg.h
trace_ibhdrs.h
trace_iowait.h
trace_misc.h
trace_mmu.h
trace_rc.h
trace_rx.h IB/hfi1: Add fast and slow handlers for receive context 2020-01-10 10:57:16 -04:00
trace_tid.h ftrace: Rework event_create_dir() 2019-11-27 07:44:25 +01:00
trace_tx.h ftrace: Rework event_create_dir() 2019-11-27 07:44:25 +01:00
trace.c IB/hfi1: Add packet histogram trace event 2020-05-21 11:23:57 -03:00
trace.h
uc.c IB/hfi1: Use fallthrough pseudo-keyword 2020-07-24 16:59:55 -03:00
ud.c
user_exp_rcv.c hfi1: get rid of pointless access_ok() 2020-05-29 11:06:32 -04:00
user_exp_rcv.h RDMA: Replace zero-length array with flexible-array member 2020-02-20 13:33:51 -04:00
user_pages.c mm, tree-wide: rename put_user_page*() to unpin_user_page*() 2020-01-31 10:30:38 -08:00
user_sdma.c IB/hfi1: Fix another case where pq is left on waitlist 2020-05-12 11:47:48 -03:00
user_sdma.h
verbs_txreq.c
verbs_txreq.h RDMA/hfi1: Fix trivial mis-spelling of 'descriptor' 2020-06-15 15:56:54 -03:00
verbs.c RDMA: Remove 'max_map_per_fmr' 2020-06-02 20:32:54 -03:00
verbs.h treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
vnic_main.c IB/hfi1: Fix hfi1_netdev_rx_init() error handling 2020-06-02 20:32:54 -03:00
vnic_sdma.c
vnic.h IB/hfi1: Activate the dummy netdev 2020-05-21 11:23:56 -03:00