linux/net/sunrpc
Chuck Lever 671c450b6f xprtrdma: Fix oops in Receive handler after device removal
Since v5.4, a device removal occasionally triggered this oops:

Dec  2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
Dec  2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
Dec  2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
Dec  2 17:13:53 manet kernel: PGD 0 P4D 0
Dec  2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
Dec  2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G        W         5.4.0-00050-g53717e43af61 #883
Dec  2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Dec  2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Dec  2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
Dec  2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
Dec  2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
Dec  2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
Dec  2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
Dec  2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
Dec  2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
Dec  2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
Dec  2 17:13:53 manet kernel: FS:  0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
Dec  2 17:13:53 manet kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
Dec  2 17:13:53 manet kernel: Call Trace:
Dec  2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
Dec  2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
Dec  2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
Dec  2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec  2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
Dec  2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec  2 17:13:53 manet kernel: kthread+0xf4/0xf9
Dec  2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
Dec  2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
is still pointing to the old ib_device, which has been freed. The
only way that is possible is if this rpcrdma_rep was not destroyed
by rpcrdma_ia_remove.

Debugging showed that was indeed the case: this rpcrdma_rep was
still in use by a completing RPC at the time of the device removal,
and thus wasn't on the rep free list. So, it was not found by
rpcrdma_reps_destroy().

The fix is to introduce a list of all rpcrdma_reps so that they all
can be found when a device is removed. That list is used to perform
only regbuf DMA unmapping, replacing that call to
rpcrdma_reps_destroy().

Meanwhile, to prevent corruption of this list, I've moved the
destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
protecting the rb_all_reps list.

Fixes: b0b227f071 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
..
auth_gss SUNRPC: Fix svcauth_gss_proxy_init() 2019-10-30 16:32:37 -04:00
xprtrdma xprtrdma: Fix oops in Receive handler after device removal 2020-01-14 13:30:24 -05:00
addr.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
auth_null.c SUNRPC: Add rpc_auth::au_ralign field 2019-02-14 11:48:36 -05:00
auth_unix.c SUNRPC: Use the client user namespace when encoding creds 2019-04-26 16:24:32 -04:00
auth.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
backchannel_rqst.c SUNRPC: Destroy the back channel when we destroy the host transport 2019-10-30 12:04:35 -04:00
cache.c sunrpc: fix crash when cache_head become valid before update 2019-10-11 13:13:49 -04:00
clnt.c NFSoRDMA Client Updates for Linux 5.5 2019-11-18 10:55:55 +01:00
debugfs.c NFS client updates for Linux 5.3 2019-07-18 14:32:33 -07:00
Kconfig SUNRPC: Drop redundant CONFIG_ from CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES 2019-07-06 14:54:53 -04:00
Makefile SUNRPC: remove generic cred code. 2018-12-19 13:52:46 -05:00
netns.h
rpc_pipe.c kernel/notifier.c: remove blocking_notifier_chain_cond_register() 2019-12-04 19:44:12 -08:00
rpcb_clnt.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
sched.c SUNRPC: Capture completion of all RPC tasks 2019-11-22 19:09:38 +01:00
socklib.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
stats.c Merge branch 'multipath_tcp' 2019-07-06 14:54:52 -04:00
sunrpc_syms.c treewide: Add SPDX license identifier for more missed files 2019-05-21 10:50:45 +02:00
sunrpc.h sunrpc: whitespace fixes 2018-07-31 12:53:40 -04:00
svc_xprt.c nfs: fix out-of-date connectathon talk URL 2019-07-03 17:52:09 -04:00
svc.c SUNRPC: Trace gssproxy upcall results 2019-10-30 16:32:07 -04:00
svcauth_unix.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
svcauth.c SUNRPC: Trace gssproxy upcall results 2019-10-30 16:32:07 -04:00
svcsock.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
sysctl.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
timer.c treewide: Add SPDX license identifier for missed files 2019-05-21 10:50:45 +02:00
xdr.c SUNRPC: Fix another issue with MIC buffer space 2019-11-18 11:05:42 +01:00
xprt.c NFSoRDMA Client Updates for Linux 5.5 2019-11-18 10:55:55 +01:00
xprtmultipath.c SUNRPC: Optimise transport balancing code 2019-07-18 14:43:52 -04:00
xprtsock.c This is a relatively quiet cycle for nfsd, mainly various bugfixes. 2019-12-07 16:56:00 -08:00