linux

Author	SHA1	Message	Date
Sagi Grimberg	aaeadd7075	nvmet: fix false keep-alive timeout when a controller is torn down Controller teardown flow may take some time in case it has many I/O queues, and the host may not send us keep-alive during this period. Hence reset the traffic based keep-alive timer so we don't trigger a controller teardown as a result of a keep-alive expiration. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-26 16:18:22 +02:00
Hou Pu	25df1acd2d	nvmet-tcp: fix inline data size comparison in nvmet_tcp_queue_response Using "<=" instead "<" to compare inline data size. Fixes: `bdaf132791` ("nvmet-tcp: fix a segmentation fault during io parsing error") Signed-off-by: Hou Pu <houpu.main@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-26 16:18:22 +02:00
Sagi Grimberg	042a3eaad6	nvme-tcp: remove incorrect Kconfig dep in BLK_DEV_NVME We need to select NVME_CORE. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-26 16:18:22 +02:00
Hannes Reinecke	4d9442bf26	nvme-fabrics: decode host pathing error for connect Add an additional decoding for 'host pathing error' during connect. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-25 09:21:16 +02:00
Hannes Reinecke	f25f8ef70c	nvme-fc: short-circuit reconnect retries Returning an nvme status from nvme_fc_create_association() indicates that the association is established, and we should honour the DNR bit. If it's set a reconnect attempt will just return the same error, so we can short-circuit the reconnect attempts and fail the connection directly. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-25 09:21:15 +02:00
Guoqing Jiang	3596a06583	nvme: fix potential memory leaks in nvme_cdev_add We need to call put_device if cdev_device_add failed, otherwise kmemleak has below report. [<0000000024c71758>] kmem_cache_alloc_trace+0x233/0x480 [<00000000ad2813ed>] device_add+0x7ff/0xe10 [<0000000035bc54c4>] cdev_device_add+0x72/0xa0 [<000000006c9aa1e8>] nvme_cdev_add+0xa9/0xf0 [nvme_core] [<000000003c4d492d>] nvme_mpath_set_live+0x251/0x290 [nvme_core] [<00000000889a58da>] nvme_mpath_add_disk+0x268/0x320 [nvme_core] [<00000000192e7161>] nvme_alloc_ns+0x669/0xac0 [nvme_core] [<000000007a1a6041>] nvme_validate_or_alloc_ns+0x156/0x280 [nvme_core] [<000000003a763c35>] nvme_scan_work+0x221/0x3c0 [nvme_core] [<000000009ff10706>] process_one_work+0x5cf/0xb10 [<000000000644ee25>] worker_thread+0x7a/0x680 [<00000000285ebd2f>] kthread+0x1c6/0x210 [<00000000e297c6ea>] ret_from_fork+0x22/0x30 Fixes: `2637baed78` ("nvme: introduce generic per-namespace chardev") Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-25 09:21:15 +02:00
James Smart	a7d139145a	nvme-fc: clear q_live at beginning of association teardown The __nvmf_check_ready() routine used to bounce all filesystem io if the controller state isn't LIVE. However, a later patch changed the logic so that it rejection ends up being based on the Q live check. The FC transport has a slightly different sequence from rdma and tcp for shutting down queues/marking them non-live. FC marks its queue non-live after aborting all ios and waiting for their termination, leaving a rather large window for filesystem io to continue to hit the transport. Unfortunately this resulted in filesystem I/O or applications seeing I/O errors. Change the FC transport to mark the queues non-live at the first sign of teardown for the association (when I/O is initially terminated). Fixes: `73a5379937` ("nvme-fabrics: allow to queue requests for live queues") Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:40:24 +02:00
Keith Busch	a0fdd14180	nvme-tcp: rerun io_work if req_list is not empty A possible race condition exists where the request to send data is enqueued from nvme_tcp_handle_r2t()'s will not be observed by nvme_tcp_send_all() if it happens to be running. The driver relies on io_work to send the enqueued request when it is runs again, but the concurrently running nvme_tcp_send_all() may not have released the send_mutex at that time. If no future commands are enqueued to re-kick the io_work, the request will timeout in the SEND_H2C state, resulting in a timeout error like: nvme nvme0: queue 1: timeout request 0x3 type 6 Ensure the io_work continues to run as long as the req_list is not empty. Fixes: `db5ad6b7f8` ("nvme-tcp: try to send request in queue_rq context") Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:42 +02:00
Sagi Grimberg	825619b09a	nvme-tcp: fix possible use-after-completion Commit `db5ad6b7f8` ("nvme-tcp: try to send request in queue_rq context") added a second context that may perform a network send. This means that now RX and TX are not serialized in nvme_tcp_io_work and can run concurrently. While there is correct mutual exclusion in the TX path (where the send_mutex protect the queue socket send activity) RX activity, and more specifically request completion may run concurrently. This means we must guarantee that any mutation of the request state related to its lifetime, bytes sent must not be accessed when a completion may have possibly arrived back (and processed). The race may trigger when a request completion arrives, processed _and_ reused as a fresh new request, exactly in the (relatively short) window between the last data payload sent and before the request iov_iter is advanced. Consider the following race: 1. 16K write request is queued 2. The nvme command and the data is sent to the controller (in-capsule or solicited by r2t) 3. After the last payload is sent but before the req.iter is advanced, the controller sends back a completion. 4. The completion is processed, the request is completed, and reused to transfer a new request (write or read) 5. The new request is queued, and the driver reset the request parameters (nvme_tcp_setup_cmd_pdu). 6. Now context in (2) resumes execution and advances the req.iter ==> use-after-completion as this is already a new request. Fix this by making sure the request is not advanced after the last data payload send, knowing that a completion may have arrived already. An alternative solution would have been to delay the request completion or state change waiting for reference counting on the TX path, but besides adding atomic operations to the hot-path, it may present challenges in multi-stage R2T scenarios where a r2t handler needs to be deferred to an async execution. Reported-by: Narayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com> Tested-by: Anil Mishra <anil.mishra@wdc.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Cc: stable@vger.kernel.org # v5.8+ Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:42 +02:00
Wu Bo	03504e3b54	nvme-loop: fix memory leak in nvme_loop_create_ctrl() When creating loop ctrl in nvme_loop_create_ctrl(), if nvme_init_ctrl() fails, the loop ctrl should be freed before jumping to the "out" label. Fixes: `3a85a5de29` ("nvme-loop: add a NVMe loopback host driver") Signed-off-by: Wu Bo <wubo40@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:42 +02:00
Wu Bo	fec356a61a	nvmet: fix memory leak in nvmet_alloc_ctrl() When creating ctrl in nvmet_alloc_ctrl(), if the cntlid_min is larger than cntlid_max of the subsystem, and jumps to the "out_free_changed_ns_list" label, but the ctrl->sqs lack of be freed. Fix this by jumping to the "out_free_sqs" label. Fixes: `94a39d61f8` ("nvmet: make ctrl-id configurable") Signed-off-by: Wu Bo <wubo40@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-19 08:33:39 +02:00
Hou Pu	e181811bd0	nvmet: use new ana_log_size instead the old one The new ana_log_size should be used instead of the old one. Or kernel NULL pointer dereference will happen like below: [ 38.957849][ T69] BUG: kernel NULL pointer dereference, address: 000000000000003c [ 38.975550][ T69] #PF: supervisor write access in kernel mode [ 38.975955][ T69] #PF: error_code(0x0002) - not-present page [ 38.976905][ T69] PGD 0 P4D 0 [ 38.979388][ T69] Oops: 0002 [#1] SMP NOPTI [ 38.980488][ T69] CPU: 0 PID: 69 Comm: kworker/0:2 Not tainted 5.12.0+ #54 [ 38.981254][ T69] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 38.982502][ T69] Workqueue: events nvme_loop_execute_work [ 38.985219][ T69] RIP: 0010:memcpy_orig+0x68/0x10f [ 38.986203][ T69] Code: 83 c2 20 eb 44 48 01 d6 48 01 d7 48 83 ea 20 0f 1f 00 48 83 ea 20 4c 8b 46 f8 4c 8b 4e f0 4c 8b 56 e8 4c 8b 5e e0 48 8d 76 e0 <4c> 89 47 f8 4c 89 4f f0 4c 89 57 e8 4c 89 5f e0 48 8d 7f e0 73 d2 [ 38.987677][ T69] RSP: 0018:ffffc900001b7d48 EFLAGS: 00000287 [ 38.987996][ T69] RAX: 0000000000000020 RBX: 0000000000000024 RCX: 0000000000000010 [ 38.988327][ T69] RDX: ffffffffffffffe4 RSI: ffff8881084bc004 RDI: 0000000000000044 [ 38.988620][ T69] RBP: 0000000000000024 R08: 0000000100000000 R09: 0000000000000000 [ 38.988991][ T69] R10: 0000000100000000 R11: 0000000000000001 R12: 0000000000000024 [ 38.989289][ T69] R13: ffff8881084bc000 R14: 0000000000000000 R15: 0000000000000024 [ 38.989845][ T69] FS: 0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 [ 38.990234][ T69] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 38.990490][ T69] CR2: 000000000000003c CR3: 00000001085b2000 CR4: 00000000000006f0 [ 38.991105][ T69] Call Trace: [ 38.994157][ T69] sg_copy_buffer+0xb8/0xf0 [ 38.995357][ T69] nvmet_copy_to_sgl+0x48/0x6d [ 38.995565][ T69] nvmet_execute_get_log_page_ana+0xd4/0x1cb [ 38.995792][ T69] nvmet_execute_get_log_page+0xc9/0x146 [ 38.995992][ T69] nvme_loop_execute_work+0x3e/0x44 [ 38.996181][ T69] process_one_work+0x1c3/0x3c0 [ 38.996393][ T69] worker_thread+0x44/0x3d0 [ 38.996600][ T69] ? cancel_delayed_work+0x90/0x90 [ 38.996804][ T69] kthread+0xf7/0x130 [ 38.996961][ T69] ? kthread_create_worker_on_cpu+0x70/0x70 [ 38.997171][ T69] ret_from_fork+0x22/0x30 [ 38.997705][ T69] Modules linked in: [ 38.998741][ T69] CR2: 000000000000003c [ 39.000104][ T69] ---[ end trace e719927b609d0fa0 ]--- Fixes: `5e1f689913` ("nvme-multipath: fix double initialization of ANA state") Signed-off-by: Hou Pu <houpu.main@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-13 16:33:32 +02:00
Daniel Wagner	85428beac8	nvmet: seset ns->file when open fails Reset the ns->file value to NULL also in the error case in nvmet_file_ns_enable(). The ns->file variable points either to file object or contains the error code after the filp_open() call. This can lead to following problem: When the user first setups an invalid file backend and tries to enable the ns, it will fail. Then the user switches over to a bdev backend and enables successfully the ns. The first received I/O will crash the system because the IO backend is chosen based on the ns->file value: static u16 nvmet_parse_io_cmd(struct nvmet_req *req) { [...] if (req->ns->file) return nvmet_file_parse_io_cmd(req); return nvmet_bdev_parse_io_cmd(req); } Reported-by: Enzo Matsumiya <ematsumiya@suse.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-12 19:48:48 +02:00
Chaitanya Kulkarni	7a4ffd20ec	nvmet: demote fabrics cmd parse err msg to debug Host can send invalid commands and flood the target with error messages. Demote the error message from pr_err() to pr_debug() in nvmet_parse_fabrics_cmd() and nvmet_parse_connect_cmd(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:46 +02:00
Chaitanya Kulkarni	4c2dab2bf5	nvmet: use helper to remove the duplicate code Use the helper nvmet_report_invalid_opcode() to report invalid opcode so we can remove the duplicate code. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Chaitanya Kulkarni	3651aaacd1	nvmet: demote discovery cmd parse err msg to debug Host can send invalid commands and flood the target with error messages for the discovery controller. Demote the error message from pr_err() to pr_debug( in nvmet_parse_discovery_cmd(). Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Michal Kalderon	8cc365f955	nvmet-rdma: Fix NULL deref when SEND is completed with error When running some traffic and taking down the link on peer, a retry counter exceeded error is received. This leads to nvmet_rdma_error_comp which tried accessing the cq_context to obtain the queue. The cq_context is no longer valid after the fix to use shared CQ mechanism and should be obtained similar to how it is obtained in other functions from the wc->qp. [ 905.786331] nvmet_rdma: SEND for CQE 0x00000000e3337f90 failed with status transport retry counter exceeded (12). [ 905.832048] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 905.839919] PGD 0 P4D 0 [ 905.842464] Oops: 0000 1 SMP NOPTI [ 905.846144] CPU: 13 PID: 1557 Comm: kworker/13:1H Kdump: loaded Tainted: G OE --------- - - 4.18.0-304.el8.x86_64 #1 [ 905.872135] RIP: 0010:nvmet_rdma_error_comp+0x5/0x1b [nvmet_rdma] [ 905.878259] Code: 19 4f c0 e8 89 b3 a5 f6 e9 5b e0 ff ff 0f b7 75 14 4c 89 ea 48 c7 c7 08 1a 4f c0 e8 71 b3 a5 f6 e9 4b e0 ff ff 0f 1f 44 00 00 <48> 8b 47 48 48 85 c0 74 08 48 89 c7 e9 98 bf 49 00 e9 c3 e3 ff ff [ 905.897135] RSP: 0018:ffffab601c45fe28 EFLAGS: 00010246 [ 905.902387] RAX: 0000000000000065 RBX: ffff9e729ea2f800 RCX: 0000000000000000 [ 905.909558] RDX: 0000000000000000 RSI: ffff9e72df9567c8 RDI: 0000000000000000 [ 905.916731] RBP: ffff9e729ea2b400 R08: 000000000000074d R09: 0000000000000074 [ 905.923903] R10: 0000000000000000 R11: ffffab601c45fcc0 R12: 0000000000000010 [ 905.931074] R13: 0000000000000000 R14: 0000000000000010 R15: ffff9e729ea2f400 [ 905.938247] FS: 0000000000000000(0000) GS:ffff9e72df940000(0000) knlGS:0000000000000000 [ 905.938249] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 905.950067] nvmet_rdma: SEND for CQE 0x00000000c7356cca failed with status transport retry counter exceeded (12). [ 905.961855] CR2: 0000000000000048 CR3: 000000678d010004 CR4: 00000000007706e0 [ 905.961855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 905.961856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 905.961857] PKRU: 55555554 [ 906.010315] Call Trace: [ 906.012778] __ib_process_cq+0x89/0x170 [ib_core] [ 906.017509] ib_cq_poll_work+0x26/0x80 [ib_core] [ 906.022152] process_one_work+0x1a7/0x360 [ 906.026182] ? create_worker+0x1a0/0x1a0 [ 906.030123] worker_thread+0x30/0x390 [ 906.033802] ? create_worker+0x1a0/0x1a0 [ 906.037744] kthread+0x116/0x130 [ 906.040988] ? kthread_flush_work_fn+0x10/0x10 [ 906.045456] ret_from_fork+0x1f/0x40 Fixes: `ca0f1a8055` ("nvmet-rdma: use new shared CQ mechanism") Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Chaitanya Kulkarni	ab96de5def	nvmet: fix inline bio check for passthru When handling passthru commands, for inline bio allocation we only consider the transfer size. This works well when req->sg_cnt fits into the req->inline_bvec, but it will result in the early return from bio_add_hw_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC. Consider an I/O of size 32768 and first buffer is not aligned to the page boundary, then I/O is split in following manner :- [ 2206.256140] nvmet: sg->length 3440 sg->offset 656 [ 2206.256144] nvmet: sg->length 4096 sg->offset 0 [ 2206.256148] nvmet: sg->length 4096 sg->offset 0 [ 2206.256152] nvmet: sg->length 4096 sg->offset 0 [ 2206.256155] nvmet: sg->length 4096 sg->offset 0 [ 2206.256159] nvmet: sg->length 4096 sg->offset 0 [ 2206.256163] nvmet: sg->length 4096 sg->offset 0 [ 2206.256166] nvmet: sg->length 4096 sg->offset 0 [ 2206.256170] nvmet: sg->length 656 sg->offset 0 Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8). This will result in early return in the following code path :- nvmet_bdev_execute_rw() bio_add_pc_page() bio_add_hw_page() if (bio_full(bio, len)) return 0; Use previously introduced helper nvmet_use_inline_bvec() to consider req->sg_cnt when using inline bio. This only affects nvme-loop transport. Fixes: `dab3902b19` ("nvmet: use inline bio for passthru fast path") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Chaitanya Kulkarni	608a969046	nvmet: fix inline bio check for bdev-ns When handling rw commands, for inline bio case we only consider transfer size. This works well when req->sg_cnt fits into the req->inline_bvec, but it will result in the warning in __bio_add_page() when req->sg_cnt > NVMET_MAX_INLINE_BVEC. Consider an I/O size 32768 and first page is not aligned to the page boundary, then I/O is split in following manner :- [ 2206.256140] nvmet: sg->length 3440 sg->offset 656 [ 2206.256144] nvmet: sg->length 4096 sg->offset 0 [ 2206.256148] nvmet: sg->length 4096 sg->offset 0 [ 2206.256152] nvmet: sg->length 4096 sg->offset 0 [ 2206.256155] nvmet: sg->length 4096 sg->offset 0 [ 2206.256159] nvmet: sg->length 4096 sg->offset 0 [ 2206.256163] nvmet: sg->length 4096 sg->offset 0 [ 2206.256166] nvmet: sg->length 4096 sg->offset 0 [ 2206.256170] nvmet: sg->length 656 sg->offset 0 Now the req->transfer_size == NVMET_MAX_INLINE_DATA_LEN i.e. 32768, but the req->sg_cnt is (9) > NVMET_MAX_INLINE_BIOVEC which is (8). This will result in the following warning message :- nvmet_bdev_execute_rw() bio_add_page() __bio_add_page() WARN_ON_ONCE(bio_full(bio, len)); This scenario is very hard to reproduce on the nvme-loop transport only with rw commands issued with the passthru IOCTL interface from the host application and the data buffer is allocated with the malloc() and not the posix_memalign(). Fixes: `73383adfad` ("nvmet: don't split large I/Os unconditionally") Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-11 18:30:45 +02:00
Christoph Hellwig	5e1f689913	nvme-multipath: fix double initialization of ANA state nvme_init_identify and thus nvme_mpath_init can be called multiple times and thus must not overwrite potentially initialized or in-use fields. Split out a helper for the basic initialization when the controller is initialized and make sure the init_identify path does not blindly change in-use data structures. Fixes: `0d0b660f21` ("nvme: add ANA support") Reported-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de>	2021-05-11 18:30:45 +02:00
Keith Busch	4a20342572	nvmet: remove unsupported command noise Nothing can stop a host from submitting invalid commands. The target just needs to respond with an appropriate status, but that's not a target error. Demote invalid command messages to the debug level so these events don't spam the kernel logs. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:39:26 +02:00
Daniel Wagner	ce86dad222	nvme-multipath: reset bdev to ns head when failover When a request finally completes in end_io() after it has failed over, the bdev pointer can be stale and thus the system can crash. Set the bdev back to ns head, so the request is map to an active path when resubmitted. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:39:23 +02:00
Tao Chiu	d4060d2be1	nvme-pci: fix controller reset hang when racing with nvme_timeout reset_work() in nvme-pci may hang forever in the following scenario: 1) A reset caused by a command timeout occurs due to a controller being temporarily irresponsive. 2) nvme_reset_work() restarts admin queue at nvme_alloc_admin_tags(). At the same time, a user-submitted admin command is queued and waiting for completion. Then, reset_work() changes its state to CONNECTING, and submits an identify command. 3) However, the controller does still not respond to any command, causing a timeout being fired at the user-submitted command. Unfortunately, nvme_timeout() does not see the completion on cq, and any timeout that takes place under CONNECTING state causes a controller shutdown. 4) Normally, the identify command in reset_work() would be canceled with SC_HOST_ABORTED by nvme_dev_disable(), then reset_work can tear down the controller accordingly. But the controller happens to return online and respond the identify command before nvme_dev_disable() should have been reaped it off. 5) reset_work() continues to setup_io_queues() as it observes no error in init_identify(). However, the admin queue has already been quiesced in dev_disable(). Thus, any following commands would be blocked forever in blk_execute_rq(). This can be fixed by restricting usercmd commands when controller is not in a LIVE state in nvme_queue_rq(), as what has been done previously in fabrics. ``` nvme_reset_work(): \| nvme_alloc_admin_tags() \| \| nvme_submit_user_cmd(): nvme_init_identify(): \| ... __nvme_submit_sync_cmd(): \| ... \| ... ---------------------------------------> nvme_timeout(): (Controller starts reponding commands) \| nvme_dev_disable(, true): nvme_setup_io_queues(): \| __nvme_submit_sync_cmd(): \| (hung in blk_execute_rq \| since run_hw_queue sees \| queue quiesced) \| ``` Signed-off-by: Tao Chiu <taochiu@synology.com> Signed-off-by: Cody Wong <codywong@synology.com> Reviewed-by: Leon Chien <leonchien@synology.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:50 +02:00
Tao Chiu	a97157440e	nvme: move the fabrics queue ready check routines to core queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready, e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow restarts the admin queue, users are able to submit admin commands to a controller before reset_work() completes. Commands submitted under this condition may interfere with commands that performs identify, IO queue setup in reset_work(), and may result in a hang described in the following patch. As seen in the fabrics, user commands are prevented from being executed under inproper controller states. We may reuse this logic to maintain a clear admin queue during reset_work(). Signed-off-by: Tao Chiu <taochiu@synology.com> Signed-off-by: Cody Wong <codywong@synology.com> Reviewed-by: Leon Chien <leonchien@synology.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:49 +02:00
Kanchan Joshi	51ad06cd69	nvme: avoid memset for passthrough requests nvme_clear_nvme_request() clears the nvme_command, which is unncessary for passthrough requests as nvme_command is overwritten immediately. Move clearing part from this helper to the caller, so that double memset for passthrough requests is avoided. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:49 +02:00
Kanchan Joshi	4c74d1f803	nvme: add nvme_get_ns helper Add a helper to avoid opencoding ns->kref increment. Decrement is already done via nvme_put_ns helper. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:48 +02:00
Minwoo Im	48145b6256	nvme: fix controller ioctl through ns_head In multipath case, we should consider namespace attachment with controllers in a subsystem when we find out the live controller for the namespace. This patch manually reverted the commit `3557a44097` ("nvme: don't bother to look up a namespace for controller ioctls") with few more updates to nvme_ns_head_chr_ioctl which has been newly updated. Fixes: `3557a44097` ("nvme: don't bother to look up a namespace for controller ioctls") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-05-04 09:35:47 +02:00
Linus Torvalds	fc05860628	for-5.13/drivers-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJYcQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpieWD/92qbtWl/z+9oCY212xV+YMoMqj/vGROX+U 9i/FQJ3AIC/AUoNjZeW3NIbiaNqde5mrLlUSCHgn6RLsHK7p0GQJ4ohpbIGFG5+i 2+Efm+vjlCxLVGrkeZEwMtsht7w/NbOYDr1Rgv9b4lQ6iWI11Mg8E337Whl1me1k h6bEXaioK9yqxYtsLgcn9I1qQ2p7gok0HX7zFU/XxEUZylqH6E4vQhj2+NL8UUqE 7siFHADZE99Z7LXtOkl8YyOlGU52RCUzqDHWydvkipKjgYBi95HLXGT64Z+WCEvz HI54oVDRWr+uWdqDFfy+ncHm8pNeP0GV9JPhDz4ELRTSndoxB2il7wRLvp6wxV9d 8Y4j7vb30i+8GGbM0c79dnlG76D9r5ivbTKixcXFKB128NusQR6JymIv1pKlSKhk H871/iOarrepAAUwVR5CtldDDJCy/q1Hks+7UXbaM3F9iNitxsJNZryQq9xdTu/N ThFOTz+VECG4RJLxIwmsWGiLgwr52/ybAl2MBcn+s7uC4jM/TFKpdQBfQnOAiINb MLlfuYRRSMg1Osb2fYZneR2ifmSNOMRdDJb+tsZGz4xWmZcj0uL4QgqcsOvuiOEQ veF/Ky50qw57hWtiEhvqa7/WIxzNF3G3wejqqA8hpT9Qifu0QawYTnXGUttYNBB1 mO9R3/ccaw== =c0x4 -----END PGP SIGNATURE----- Merge tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: - MD changes via Song: - raid5 POWER fix - raid1 failure fix - UAF fix for md cluster - mddev_find_or_alloc() clean up - Fix NULL pointer deref with external bitmap - Performance improvement for raid10 discard requests - Fix missing information of /proc/mdstat - rsxx const qualifier removal (Arnd) - Expose allocated brd pages (Calvin) - rnbd via Gioh Kim: - Change maintainer - Change domain address of maintainers' email - Add polling IO mode and document update - Fix memory leak and some bug detected by static code analysis tools - Code refactoring - Series of floppy cleanups/fixes (Denis) - s390 dasd fixes (Julian) - kerneldoc fixes (Lee) - null_blk double free (Lv) - null_blk virtual boundary addition (Max) - Remove xsysace driver (Michal) - umem driver removal (Davidlohr) - ataflop fixes (Dan) - Revalidate disk removal (Christoph) - Bounce buffer cleanups (Christoph) - Mark lightnvm as deprecated (Christoph) - mtip32xx init cleanups (Shixin) - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang) * tag 'for-5.13/drivers-2021-04-27' of git://git.kernel.dk/linux-block: (143 commits) async_xor: increase src_offs when dropping destination page drivers/block/null_blk/main: Fix a double free in null_init. md/raid1: properly indicate failure when ending a failed write request md-cluster: fix use-after-free issue when removing rdev nvme: introduce generic per-namespace chardev nvme: cleanup nvme_configure_apst nvme: do not try to reconfigure APST when the controller is not live nvme: add 'kato' sysfs attribute nvme: sanitize KATO setting nvmet: avoid queuing keep-alive timer if it is disabled brd: expose number of allocated pages in debugfs ataflop: fix off by one in ataflop_probe() ataflop: potential out of bounds in do_format() drbd: Fix fall-through warnings for Clang block/rnbd: Use strscpy instead of strlcpy block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name block/rnbd-clt: Remove max_segment_size block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev Documentation/ABI/rnbd-clt: Add description for nr_poll_queues ...	2021-04-28 14:39:37 -07:00
Linus Torvalds	6c00292113	for-5.13/block-2021-04-27 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmCIJW0QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpr8sD/4qP+MsFTB1IFUu8fW7BjBPdduoK8Vq9o3S HB8iF/yhJZ73nLecMMdn/jTO8SCW0Iw+okywW3BugGnNPbwXo0UQ4jLhzbTts76P JvZaguZFhBsF3ceFOt3CRCQDOeoDfMp3sitLUVivkN+2vwMs9vJpVNaEeUjcCC1Z 8QjlpqYSMuakTwEn7QhlnKxVWn1V2B6PDjZMcf48ONRZGsCkoOXH1SE4Ge8nxjqa KHKO5bvwgRzGhKpvdHEIl8dmFL9WEWElBVoY3vE2EHL0SPE32zHlxtYLS0NAhY2M aprkJ0QP0Rgl8HpYiCstwAnJGKDg4a0ArWhf/CJTuLAWmTNFR7v5n7vw2SilJHTG 0FtiFiOnpvvBmUC0B1PUEQX8AiFcdXueLb6xboExcp2WtxIAe8wPoGFl6T1tobBY qsfWggGs/vD1RVrJISPC+20cJemcRyeakMV48w+n3Lt/ES3IEv/LXx6PO/PbXvOo B7HJXTofkoaX52A/1+NxraGapwzhYouhi6Sb6Fc++X59/a/oBuOUGuur0eZ+/oWA 9787mUUDmW/sahfZUgZh5AxqKo2jJULjeggANCICW9/RN6duV8TBQVOLW1/0Wddp 9lndiA9ZMveWF+J19+sjBoiYMYawLmURaOlDK77ctTCcR/ji3l4GZ+2KvBEMeIT8 O1OYEnwaIQ== =oza6 -----END PGP SIGNATURE----- Merge tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "Pretty quiet round this time, which is nice. In detail: - Series revamping bounce buffer support (Christoph) - Dead code removal (Christoph, Bart) - Partition iteration revamp, now using xarray (Christoph) - Passthrough request scheduler improvements (Lin) - Series of BFQ improvements (Paolo) - Fix ioprio task iteration (Peter) - Various little tweaks and fixes (Tejun, Saravanan, Bhaskar, Max, Nikolay)" * tag 'for-5.13/block-2021-04-27' of git://git.kernel.dk/linux-block: (41 commits) blk-iocost: don't ignore vrate_min on QD contention blk-mq: Fix spurious debugfs directory creation during initialization bfq/mq-deadline: remove redundant check for passthrough request blk-mq: bypass IO scheduler's limit_depth for passthrough request block: Remove an obsolete comment from sg_io() block: move bio_list_copy_data to pktcdvd block: remove zero_fill_bio_iter block: add queue_to_disk() to get gendisk from request_queue block: remove an incorrect check from blk_rq_append_bio block: initialize ret in bdev_disk_changed block: Fix sys_ioprio_set(.which=IOPRIO_WHO_PGRP) task iteration block: remove disk_part_iter block: simplify diskstats_show block: simplify show_partition block: simplify printk_all_partitions block: simplify partition_overlaps block: simplify partition removal block: take bd_mutex around delete_partitions in del_gendisk block: refactor blk_drop_partitions block: move more syncing and invalidation to delete_partition ...	2021-04-28 14:27:12 -07:00
Minwoo Im	2637baed78	nvme: introduce generic per-namespace chardev Userspace has not been allowed to I/O to device that's failed to be initialized. This patch introduces generic per-namespace character device to allow userspace to I/O regardless the block device is there or not. The chardev naming convention will similar to the existing blkdev naming, using a ng prefix instead of nvme, i.e. - /dev/ngXnY It also supports multipath which means it will not expose chardev for the hidden namespace blkdevs (e.g., nvmeXcYnZ). If /dev/ngXnY is created for a ns_head, then I/O request will be routed to a specific controller selected by the iopolicy of the subsystem. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-22 07:25:17 +02:00
Christoph Hellwig	60df5de9b0	nvme: cleanup nvme_configure_apst Remove a level of indentation from the main code implementating the table search by using a goto for the APST not supported case. Also move the main comment above the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>	2021-04-21 19:13:16 +02:00
Christoph Hellwig	53fe2a30bc	nvme: do not try to reconfigure APST when the controller is not live Do not call nvme_configure_apst when the controller is not live, given that nvme_configure_apst will fail due the lack of an admin queue when the controller is being torn down and nvme_set_latency_tolerance is called from dev_pm_qos_hide_latency_tolerance. Fixes: 510a405d945b("nvme: fix memory leak for power latency tolerance") Reported-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>	2021-04-21 19:13:16 +02:00
Hannes Reinecke	74c22990f0	nvme: add 'kato' sysfs attribute Add a 'kato' controller sysfs attribute to display the current keep-alive timeout value (if any). This allows userspace to identify persistent discovery controllers, as these will have a non-zero KATO value. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-21 19:13:15 +02:00
Hannes Reinecke	a70b81bd4d	nvme: sanitize KATO setting According to the NVMe base spec the KATO commands should be sent at half of the KATO interval, to properly account for round-trip times. As we now will only ever send one KATO command per connection we can easily use the recommended values. This also fixes a potential issue where the request timeout for the KATO command does not match the value in the connect command, which might be causing spurious connection drops from the target. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-21 19:13:15 +02:00
Hou Pu	8f864c595b	nvmet: avoid queuing keep-alive timer if it is disabled Issue following command: nvme set-feature -f 0xf -v 0 /dev/nvme1n1 # disable keep-alive timer nvme admin-passthru -o 0x18 /dev/nvme1n1 # send keep-alive command will make keep-alive timer fired and thus delete the controller like below: [247459.907635] nvmet: ctrl 1 keep-alive timer (0 seconds) expired! [247459.930294] nvmet: ctrl 1 fatal error occurred! Avoid this by not queuing delayed keep-alive if it is disabled when keep-alive command is received from the admin queue. Signed-off-by: Hou Pu <houpu.main@gmail.com> Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-21 19:13:15 +02:00
Gopal Tiwari	d6609084b0	nvme: fix NULL derefence in nvme_ctrl_fast_io_fail_tmo_show/store Adding entry for dev_attr_fast_io_fail_tmo to avoid the kernel crash while reading and writing the fast_io_fail_tmo. Fixes: `09fbed6363` (nvme: export fast_io_fail_tmo to sysfs) Signed-off-by: Gopal Tiwari <gtiwari@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:56 +02:00
Christoph Hellwig	a9e0e6bc72	nvme: let namespace probing continue for unsupported features Instead of failing to scan the namespace entirely when unsupported features are detected, just mark the gendisk hidden but allow other access like the upcoming per-namespace character device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:56 +02:00
Christoph Hellwig	f5b9a51db2	nvme: factor out nvme_ns_open and nvme_ns_release helpers These will be reused for the per-namespace character devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	1496bd4936	nvme: move nvme_ns_head_ops to multipath.c Move the multipath block_device_operations to multipath.c, where they belong. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	871ca3ef13	nvme: factor out a nvme_tryget_ns_head helper Add a helper to avoid opencoding ns_head->ref manipulations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	2405252a68	nvme: move the ioctl code to a separate file Split out the ioctl code from core.c into a new file. Also update copyrights while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	3557a44097	nvme: don't bother to look up a namespace for controller ioctls Don't bother to look up a namespace just to drop if after retreiving the controller for the multipath case. Just look up a live controller for the subsystem directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	2f907f7f96	nvme: simplify block device ioctl handling for the !multipath case Only use the existing ioctl handler for the multipath case, and add a simpler one that reverts to the pre-multipath case for not shared use case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	89b3d6e605	nvme: simplify the compat ioctl handling Don't bother defining a separate compat_ioctl handler, and just handle the NVME_IOCTL_SUBMIT_IO32 case inline. Also only defined it for those ABIs (currently just i386 vs x86_64) that are affected. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:55 +02:00
Christoph Hellwig	a5d737f100	nvme: factor out a nvme_ns_ioctl helper Factor out a helper for the namespace based ioctls. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:54 +02:00
Christoph Hellwig	d7790d3739	nvme: pass a user pointer to nvme_nvm_ioctl Pass the proper user pointer instead of the not all that useful integer representation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:54 +02:00
Christoph Hellwig	9953ab0c5a	nvme: cleanup setting the disk name Return false from nvme_set_disk_name and let the caller set the non-multipath name instead of duplicating the naming information in two places. Also remove the pointless local variables for the disk name and flags and the not needed ctrl argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com>	2021-04-15 08:12:54 +02:00
Minwoo Im	3089738868	nvme: add a nvme_ns_head_multipath helper Move the multipath gendisk out of #ifdef CONFIG_NVME_MULTIPATH and add a new nvme_ns_head_multipath that uses it to check if a ns_head has a multipath device associated with it. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> [hch: added the IS_ENABLED, converted a few existing users] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Javier González <javier.gonz@samsung.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>	2021-04-15 08:12:54 +02:00
Niklas Cassel	95d54bd1a4	nvme: remove single trailing whitespace There is a single trailing whitespace in core.c. Since this is just a single whitespace, the chances of this affecting backports to stable should be quite low, so let's just remove it. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:54 +02:00
Niklas Cassel	e234f1f8bb	nvme-multipath: remove single trailing whitespace There is a single trailing whitespace in multipath.c. Since this is just a single whitespace, the chances of this affecting backports to stable should be quite low, so let's just remove it. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2021-04-15 08:12:54 +02:00

1 2 3 4 5 ...

2363 Commits