linux/drivers/infiniband/sw/rxe
Alexandru Moise 1661d3b0e2 nvmet,rxe: defer ip datagram sending to tasklet
This addresses 3 separate problems:

1. When using NVME over Fabrics we may end up sending IP
packets in interrupt context, we should defer this work
to a tasklet.

[   50.939957] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x1f/0xa0
[   50.942602] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G        W         4.17.0-rc3-ARCH+ #104
[   50.945466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
[   50.948163] RIP: 0010:__local_bh_enable_ip+0x1f/0xa0
[   50.949631] RSP: 0018:ffff88009c183900 EFLAGS: 00010006
[   50.951029] RAX: 0000000080010403 RBX: 0000000000000200 RCX: 0000000000000001
[   50.952636] RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffffffff817e04ec
[   50.954278] RBP: ffff88009c183910 R08: 0000000000000001 R09: 0000000000000614
[   50.956000] R10: ffffea00021d5500 R11: 0000000000000001 R12: ffffffff817e04ec
[   50.957779] R13: 0000000000000000 R14: ffff88009566f400 R15: ffff8800956c7000
[   50.959402] FS:  0000000000000000(0000) GS:ffff88009c180000(0000) knlGS:0000000000000000
[   50.961552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.963798] CR2: 000055c4ec0ccac0 CR3: 0000000002209001 CR4: 00000000000606e0
[   50.966121] Call Trace:
[   50.966845]  <IRQ>
[   50.967497]  __dev_queue_xmit+0x62d/0x690
[   50.968722]  dev_queue_xmit+0x10/0x20
[   50.969894]  neigh_resolve_output+0x173/0x190
[   50.971244]  ip_finish_output2+0x2b8/0x370
[   50.972527]  ip_finish_output+0x1d2/0x220
[   50.973785]  ? ip_finish_output+0x1d2/0x220
[   50.975010]  ip_output+0xd4/0x100
[   50.975903]  ip_local_out+0x3b/0x50
[   50.976823]  rxe_send+0x74/0x120
[   50.977702]  rxe_requester+0xe3b/0x10b0
[   50.978881]  ? ip_local_deliver_finish+0xd1/0xe0
[   50.980260]  rxe_do_task+0x85/0x100
[   50.981386]  rxe_run_task+0x2f/0x40
[   50.982470]  rxe_post_send+0x51a/0x550
[   50.983591]  nvmet_rdma_queue_response+0x10a/0x170
[   50.985024]  __nvmet_req_complete+0x95/0xa0
[   50.986287]  nvmet_req_complete+0x15/0x60
[   50.987469]  nvmet_bio_done+0x2d/0x40
[   50.988564]  bio_endio+0x12c/0x140
[   50.989654]  blk_update_request+0x185/0x2a0
[   50.990947]  blk_mq_end_request+0x1e/0x80
[   50.991997]  nvme_complete_rq+0x1cc/0x1e0
[   50.993171]  nvme_pci_complete_rq+0x117/0x120
[   50.994355]  __blk_mq_complete_request+0x15e/0x180
[   50.995988]  blk_mq_complete_request+0x6f/0xa0
[   50.997304]  nvme_process_cq+0xe0/0x1b0
[   50.998494]  nvme_irq+0x28/0x50
[   50.999572]  __handle_irq_event_percpu+0xa2/0x1c0
[   51.000986]  handle_irq_event_percpu+0x32/0x80
[   51.002356]  handle_irq_event+0x3c/0x60
[   51.003463]  handle_edge_irq+0x1c9/0x200
[   51.004473]  handle_irq+0x23/0x30
[   51.005363]  do_IRQ+0x46/0xd0
[   51.006182]  common_interrupt+0xf/0xf
[   51.007129]  </IRQ>

2. Work must always be offloaded to tasklet for rxe_post_send_kernel()
when using NVMEoF in order to solve lock ordering between neigh->ha_lock
seqlock and the nvme queue lock:

[   77.833783]  Possible interrupt unsafe locking scenario:
[   77.833783]
[   77.835831]        CPU0                    CPU1
[   77.837129]        ----                    ----
[   77.838313]   lock(&(&n->ha_lock)->seqcount);
[   77.839550]                                local_irq_disable();
[   77.841377]                                lock(&(&nvmeq->q_lock)->rlock);
[   77.843222]                                lock(&(&n->ha_lock)->seqcount);
[   77.845178]   <Interrupt>
[   77.846298]     lock(&(&nvmeq->q_lock)->rlock);
[   77.847986]
[   77.847986]  *** DEADLOCK ***

3. Same goes for the lock ordering between sch->q.lock and nvme queue lock:

[   47.634271]  Possible interrupt unsafe locking scenario:
[   47.634271]
[   47.636452]        CPU0                    CPU1
[   47.637861]        ----                    ----
[   47.639285]   lock(&(&sch->q.lock)->rlock);
[   47.640654]                                local_irq_disable();
[   47.642451]                                lock(&(&nvmeq->q_lock)->rlock);
[   47.644521]                                lock(&(&sch->q.lock)->rlock);
[   47.646480]   <Interrupt>
[   47.647263]     lock(&(&nvmeq->q_lock)->rlock);
[   47.648492]
[   47.648492]  *** DEADLOCK ***

Using NVMEoF after this patch seems to finally be stable, without it,
rxe eventually deadlocks the whole system and causes RCU stalls.

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09 10:39:51 -04:00
..
Kconfig IB/rxe: Change RDMA_RXE kconfig to use select 2018-01-29 12:58:00 -07:00
Makefile License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
rxe_av.c rxe: Do not use 'struct sockaddr' in a uapi header 2018-02-14 16:31:35 -07:00
rxe_comp.c IB/rxe: Convert timers to use timer_setup() 2017-10-25 15:24:49 -04:00
rxe_cq.c RDMA/rxe: Use structs to describe the uABI instead of opencoding 2018-03-15 15:58:02 -06:00
rxe_hdr.h IB/rxe: Enable type checking on SKB_TO_PKT() and PKT_TO_SKB() arguments 2017-01-10 16:52:47 -05:00
rxe_hw_counters.c IB/rxe: Make rxe_counter_name static 2017-08-24 16:44:48 -04:00
rxe_hw_counters.h IB/rxe: Add port protocol stats 2017-04-21 10:43:28 -04:00
rxe_icrc.c IB/rxe: Offload CRC calculation when possible 2017-04-21 10:45:02 -04:00
rxe_loc.h RDMA/rxe: Use structs to describe the uABI instead of opencoding 2018-03-15 15:58:02 -06:00
rxe_mcast.c IB/rxe: Remove a pointless indirection layer 2017-01-10 16:52:47 -05:00
rxe_mmap.c IB/rxe: Constify static rxe_vm_ops 2017-07-24 08:43:12 -04:00
rxe_mr.c IB/rxe: Avoid ICRC errors by copying into the skb first 2017-08-28 19:12:36 -04:00
rxe_net.c rdma_rxe: make rxe work over 802.1q VLAN devices 2018-03-14 16:33:25 -04:00
rxe_net.h IB/rxe: add the static type to the variable 2018-01-08 17:43:06 -05:00
rxe_opcode.c IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV 2018-04-27 14:20:47 -04:00
rxe_opcode.h
rxe_param.h rxe: expose num_possible_cpus() cnum_comp_vectors 2017-05-04 19:33:02 -04:00
rxe_pool.c IB/rxe: put the pool on allocation failure 2017-10-09 12:10:41 -04:00
rxe_pool.h IB/rxe: Let the compiler check the type of the cleanup functions 2017-01-10 16:52:47 -05:00
rxe_qp.c RDMA/rxe: Use structs to describe the uABI instead of opencoding 2018-03-15 15:58:02 -06:00
rxe_queue.c RDMA/rxe: Use structs to describe the uABI instead of opencoding 2018-03-15 15:58:02 -06:00
rxe_queue.h RDMA/rxe: Use structs to describe the uABI instead of opencoding 2018-03-15 15:58:02 -06:00
rxe_recv.c IB/rxe: optimize mcast recv process 2018-03-29 13:25:22 -06:00
rxe_req.c IB/rxe: avoid double kfree_skb 2018-04-27 14:20:47 -04:00
rxe_resp.c IB/rxe: avoid double kfree_skb 2018-04-27 14:20:47 -04:00
rxe_srq.c RDMA/rxe: Use structs to describe the uABI instead of opencoding 2018-03-15 15:58:02 -06:00
rxe_sysfs.c IB/rxe: improved debug prints & code cleanup 2016-10-06 13:50:04 -04:00
rxe_task.c RDMA/rxe: Suppress gcc 7 fall-through complaints 2017-10-14 20:47:07 -04:00
rxe_task.h IB/rxe: Wait for tasklets to finish before tearing down QP 2016-12-12 16:31:45 -05:00
rxe_verbs.c nvmet,rxe: defer ip datagram sending to tasklet 2018-05-09 10:39:51 -04:00
rxe_verbs.h IB/rxe: Remove unused variable (char *rxe_qp_state_name[]) 2018-02-28 13:57:40 -07:00
rxe.c IB/rxe: change the function rxe_init_device_param type 2018-03-07 15:56:15 -07:00
rxe.h RDMA/rxe: Fix uABI structure layouts for 32/64 compat 2018-03-27 14:25:09 -06:00