From 76dd298094f484c6250ebd076fa53287477b2328 Mon Sep 17 00:00:00 2001 From: Yu Kuai Date: Tue, 11 Oct 2022 22:22:53 +0800 Subject: [PATCH 01/17] blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping() Our syzkaller report a null pointer dereference, root cause is following: __blk_mq_alloc_map_and_rqs set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs blk_mq_alloc_map_and_rqs blk_mq_alloc_rqs // failed due to oom alloc_pages_node // set->tags[hctx_idx] is still NULL blk_mq_free_rqs drv_tags = set->tags[hctx_idx]; // null pointer dereference is triggered blk_mq_clear_rq_mapping(drv_tags, ...) This is because commit 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") merged the two steps: 1) set->tags[hctx_idx] = blk_mq_alloc_rq_map() 2) blk_mq_alloc_rqs(..., set->tags[hctx_idx]) into one step: set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs() Since tags is not initialized yet in this case, fix the problem by checking if tags is NULL pointer in blk_mq_clear_rq_mapping(). Fixes: 63064be150e4 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") Signed-off-by: Yu Kuai Reviewed-by: John Garry Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe --- block/blk-mq.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8070b6c10e8d..33292c01875d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3112,8 +3112,11 @@ static void blk_mq_clear_rq_mapping(struct blk_mq_tags *drv_tags, struct page *page; unsigned long flags; - /* There is no need to clear a driver tags own mapping */ - if (drv_tags == tags) + /* + * There is no need to clear mapping if driver tags is not initialized + * or the mapping belongs to the driver tags. + */ + if (!drv_tags || drv_tags == tags) return; list_for_each_entry(page, &tags->page_list, lru) { From e0539ae012ba5d618eb19665ff990b87b960c643 Mon Sep 17 00:00:00 2001 From: ZiyangZhang Date: Tue, 18 Oct 2022 12:53:46 +0800 Subject: [PATCH 02/17] Documentation: document ublk user recovery feature Add documentation for user recovery feature of ublk subsystem. Signed-off-by: ZiyangZhang Reviewed-by: Ming Lei Link: https://lore.kernel.org/r/20221018045346.99706-2-ZiyangZhang@linux.alibaba.com Signed-off-by: Jens Axboe --- Documentation/block/ublk.rst | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst index 2122d1a4a541..ba45c46cc0da 100644 --- a/Documentation/block/ublk.rst +++ b/Documentation/block/ublk.rst @@ -144,6 +144,42 @@ managing and controlling ublk devices with help of several control commands: For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's responsibility to save IO target specific info in userspace. +- ``UBLK_CMD_START_USER_RECOVERY`` + + This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This + command is accepted after the old process has exited, ublk device is quiesced + and ``/dev/ublkc*`` is released. User should send this command before he starts + a new process which re-opens ``/dev/ublkc*``. When this command returns, the + ublk device is ready for the new process. + +- ``UBLK_CMD_END_USER_RECOVERY`` + + This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This + command is accepted after ublk device is quiesced and a new process has + opened ``/dev/ublkc*`` and get all ublk queues be ready. When this command + returns, ublk device is unquiesced and new I/O requests are passed to the + new process. + +- user recovery feature description + + Two new features are added for user recovery: ``UBLK_F_USER_RECOVERY`` and + ``UBLK_F_USER_RECOVERY_REISSUE``. + + With ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io + handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole + recovery stage and ublk device ID is kept. It is ublk server's + responsibility to recover the device context by its own knowledge. + Requests which have not been issued to userspace are requeued. Requests + which have been issued to userspace are aborted. + + With ``UBLK_F_USER_RECOVERY_REISSUE`` set, after one ubq_daemon(ublk + server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, + requests which have been issued to userspace are requeued and will be + re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``. + ``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate + double-write since the driver may issue the same I/O request twice. It + might be useful to a read-only FS or a VM backend. + Data plane ---------- From 4739824e2d7878dcea88397a6758e31e3c5c124e Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Sat, 15 Oct 2022 11:25:56 +0300 Subject: [PATCH 03/17] nvme: fix error pointer dereference in error handling There is typo here so it releases the wrong variable. "ctrl->admin_q" was intended instead of "ctrl->fabrics_q". Fixes: fe60e8c53411 ("nvme: add common helpers to allocate and free tagsets") Signed-off-by: Dan Carpenter Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 059737c1a2c1..9cbe7854d488 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4846,7 +4846,7 @@ int nvme_alloc_admin_tag_set(struct nvme_ctrl *ctrl, struct blk_mq_tag_set *set, return 0; out_cleanup_admin_q: - blk_mq_destroy_queue(ctrl->fabrics_q); + blk_mq_destroy_queue(ctrl->admin_q); out_free_tagset: blk_mq_free_tag_set(ctrl->admin_tagset); return ret; From ac9b57d4e1e3ecf0122e915bbba1bd4c90ec3031 Mon Sep 17 00:00:00 2001 From: Xander Li Date: Tue, 11 Oct 2022 04:06:42 -0700 Subject: [PATCH 04/17] nvme-pci: disable write zeroes on various Kingston SSD Kingston SSDs do support NVMe Write_Zeroes cmd but take long time to process. The firmware version is locked by these SSDs, we can not expect firmware improvement, so disable Write_Zeroes cmd. Signed-off-by: Xander Li Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index bcbef6bc5672..31e577b01257 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3511,6 +3511,16 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, { PCI_DEVICE(0x2646, 0x2263), /* KINGSTON A2000 NVMe SSD */ .driver_data = NVME_QUIRK_NO_DEEPEST_PS, }, + { PCI_DEVICE(0x2646, 0x5018), /* KINGSTON OM8SFP4xxxxP OS21012 NVMe SSD */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x2646, 0x5016), /* KINGSTON OM3PGP4xxxxP OS21011 NVMe SSD */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x2646, 0x501A), /* KINGSTON OM8PGP4xxxxP OS21005 NVMe SSD */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x2646, 0x501B), /* KINGSTON OM8PGP4xxxxQ OS21005 NVMe SSD */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, + { PCI_DEVICE(0x2646, 0x501E), /* KINGSTON OM3PGP4xxxxQ OS21011 NVMe SSD */ + .driver_data = NVME_QUIRK_DISABLE_WRITE_ZEROES, }, { PCI_DEVICE(0x1e4B, 0x1001), /* MAXIO MAP1001 */ .driver_data = NVME_QUIRK_BOGUS_NID, }, { PCI_DEVICE(0x1e4B, 0x1002), /* MAXIO MAP1002 */ From d622f8477a8018974f8df961440dca58224f9c6b Mon Sep 17 00:00:00 2001 From: "Russell King (Oracle)" Date: Wed, 12 Oct 2022 12:46:06 +0100 Subject: [PATCH 05/17] nvme-apple: don't limit DMA segement size NVMe uses PRPs for data transfers and has no specific limit for a single DMA segement. Limiting the size will cause problems because the block layer assumes PRP-ish devices using a virt boundary mask don't have a segment limit. And while this is true, we also really need to tell the DMA mapping layer about it, otherwise dma-debug will trip over it. Fixes: 5bd2927aceba ("nvme-apple: Add initial Apple SoC NVMe driver") Suggested-by: Sven Peter Signed-off-by: Russell King (Oracle) [hch: rewrote the commit message based on the PCIe commit] Signed-off-by: Christoph Hellwig Reviewed-by: Eric Curtin Reviewed-by: Sven Peter --- drivers/nvme/host/apple.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c index 5fc5ea196b40..ff8b083dc5c6 100644 --- a/drivers/nvme/host/apple.c +++ b/drivers/nvme/host/apple.c @@ -1039,6 +1039,8 @@ static void apple_nvme_reset_work(struct work_struct *work) dma_max_mapping_size(anv->dev) >> 9); anv->ctrl.max_segments = NVME_MAX_SEGS; + dma_set_max_seg_size(anv->dev, 0xffffffff); + /* * Enable NVMMU and linear submission queues. * While we could keep those disabled and pretend this is slightly From 6ff5ba97960821fb872ad981eb30374f5cee1fd9 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 18 Oct 2022 16:59:16 +0200 Subject: [PATCH 06/17] nvme: add Guenther as nvme-hwmon maintainer Given that non of the overall NVMe maintainers knows this code very deeply it probably makes sense to add Guenther as an additional MAINTAINER for it. Signed-off-by: Christoph Hellwig Reviewed-by: Sagi Grimberg Acked-by: Guenter Roeck --- MAINTAINERS | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 12984711f2fe..fde92782fbbd 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14640,6 +14640,12 @@ F: drivers/nvme/target/auth.c F: drivers/nvme/target/fabrics-cmd-auth.c F: include/linux/nvme-auth.h +NVM EXPRESS HARDWARE MONITORING SUPPORT +M: Guenter Roeck +L: linux-nvme@lists.infradead.org +S: Supported +F: drivers/nvme/host/hwmon.c + NVM EXPRESS FC TRANSPORT DRIVERS M: James Smart L: linux-nvme@lists.infradead.org From 6b8cf94005187952f794c0c4ed3920a1e8accfa3 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 18 Oct 2022 16:55:55 +0200 Subject: [PATCH 07/17] nvme-hwmon: consistently ignore errors from nvme_hwmon_init An NVMe controller works perfectly fine even when the hwmon initialization fails. Stop returning errors that do not come from a controller reset from nvme_hwmon_init to handle this case consistently. Signed-off-by: Christoph Hellwig Reviewed-by: Guenter Roeck Reviewed-by: Serge Semin --- drivers/nvme/host/core.c | 6 +++++- drivers/nvme/host/hwmon.c | 13 ++++++++----- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 9cbe7854d488..dc4220600585 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3262,8 +3262,12 @@ int nvme_init_ctrl_finish(struct nvme_ctrl *ctrl) return ret; if (!ctrl->identified && !nvme_discovery_ctrl(ctrl)) { + /* + * Do not return errors unless we are in a controller reset, + * the controller works perfectly fine without hwmon. + */ ret = nvme_hwmon_init(ctrl); - if (ret < 0) + if (ret == -EINTR) return ret; } diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c index 0a586d712920..23918bb7bdca 100644 --- a/drivers/nvme/host/hwmon.c +++ b/drivers/nvme/host/hwmon.c @@ -230,7 +230,7 @@ int nvme_hwmon_init(struct nvme_ctrl *ctrl) data = kzalloc(sizeof(*data), GFP_KERNEL); if (!data) - return 0; + return -ENOMEM; data->ctrl = ctrl; mutex_init(&data->read_lock); @@ -238,8 +238,7 @@ int nvme_hwmon_init(struct nvme_ctrl *ctrl) err = nvme_hwmon_get_smart_log(data); if (err) { dev_warn(dev, "Failed to read smart log (error %d)\n", err); - kfree(data); - return err; + goto err_free_data; } hwmon = hwmon_device_register_with_info(dev, "nvme", @@ -247,11 +246,15 @@ int nvme_hwmon_init(struct nvme_ctrl *ctrl) NULL); if (IS_ERR(hwmon)) { dev_warn(dev, "Failed to instantiate hwmon device\n"); - kfree(data); - return PTR_ERR(hwmon); + err = PTR_ERR(hwmon); + goto err_free_data; } ctrl->hwmon_device = hwmon; return 0; + +err_free_data: + kfree(data); + return err; } void nvme_hwmon_exit(struct nvme_ctrl *ctrl) From c94b7f9bab22ac504f9153767676e659988575ad Mon Sep 17 00:00:00 2001 From: Serge Semin Date: Tue, 18 Oct 2022 17:33:52 +0200 Subject: [PATCH 08/17] nvme-hwmon: kmalloc the NVME SMART log buffer Recent commit 52fde2c07da6 ("nvme: set dma alignment to dword") has caused a regression on our platform. It turned out that the nvme_get_log() method invocation caused the nvme_hwmon_data structure instance corruption. In particular the nvme_hwmon_data.ctrl pointer was overwritten either with zeros or with garbage. After some research we discovered that the problem happened even before the actual NVME DMA execution, but during the buffer mapping. Since our platform is DMA-noncoherent, the mapping implied the cache-line invalidations or write-backs depending on the DMA-direction parameter. In case of the NVME SMART log getting the DMA was performed from-device-to-memory, thus the cache-invalidation was activated during the buffer mapping. Since the log-buffer isn't cache-line aligned, the cache-invalidation caused the neighbour data to be discarded. The neighbouring data turned to be the data surrounding the buffer in the framework of the nvme_hwmon_data structure. In order to fix that we need to make sure that the whole log-buffer is defined within the cache-line-aligned memory region so the cache-invalidation procedure wouldn't involve the adjacent data. One of the option to guarantee that is to kmalloc the DMA-buffer [1]. Seeing the rest of the NVME core driver prefer that method it has been chosen to fix this problem too. Note after a deeper researches we found out that the denoted commit wasn't a root cause of the problem. It just revealed the invalidity by activating the DMA-based NVME SMART log getting performed in the framework of the NVME hwmon driver. The problem was here since the initial commit of the driver. [1] Documentation/core-api/dma-api-howto.rst Fixes: 400b6a7b13a3 ("nvme: Add hardware monitoring support") Signed-off-by: Serge Semin Signed-off-by: Christoph Hellwig --- drivers/nvme/host/hwmon.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c index 23918bb7bdca..9e6e56c20ec9 100644 --- a/drivers/nvme/host/hwmon.c +++ b/drivers/nvme/host/hwmon.c @@ -12,7 +12,7 @@ struct nvme_hwmon_data { struct nvme_ctrl *ctrl; - struct nvme_smart_log log; + struct nvme_smart_log *log; struct mutex read_lock; }; @@ -60,14 +60,14 @@ static int nvme_set_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, static int nvme_hwmon_get_smart_log(struct nvme_hwmon_data *data) { return nvme_get_log(data->ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0, - NVME_CSI_NVM, &data->log, sizeof(data->log), 0); + NVME_CSI_NVM, data->log, sizeof(*data->log), 0); } static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, u32 attr, int channel, long *val) { struct nvme_hwmon_data *data = dev_get_drvdata(dev); - struct nvme_smart_log *log = &data->log; + struct nvme_smart_log *log = data->log; int temp; int err; @@ -163,7 +163,7 @@ static umode_t nvme_hwmon_is_visible(const void *_data, case hwmon_temp_max: case hwmon_temp_min: if ((!channel && data->ctrl->wctemp) || - (channel && data->log.temp_sensor[channel - 1])) { + (channel && data->log->temp_sensor[channel - 1])) { if (data->ctrl->quirks & NVME_QUIRK_NO_TEMP_THRESH_CHANGE) return 0444; @@ -176,7 +176,7 @@ static umode_t nvme_hwmon_is_visible(const void *_data, break; case hwmon_temp_input: case hwmon_temp_label: - if (!channel || data->log.temp_sensor[channel - 1]) + if (!channel || data->log->temp_sensor[channel - 1]) return 0444; break; default: @@ -232,13 +232,19 @@ int nvme_hwmon_init(struct nvme_ctrl *ctrl) if (!data) return -ENOMEM; + data->log = kzalloc(sizeof(*data->log), GFP_KERNEL); + if (!data->log) { + err = -ENOMEM; + goto err_free_data; + } + data->ctrl = ctrl; mutex_init(&data->read_lock); err = nvme_hwmon_get_smart_log(data); if (err) { dev_warn(dev, "Failed to read smart log (error %d)\n", err); - goto err_free_data; + goto err_free_log; } hwmon = hwmon_device_register_with_info(dev, "nvme", @@ -247,11 +253,13 @@ int nvme_hwmon_init(struct nvme_ctrl *ctrl) if (IS_ERR(hwmon)) { dev_warn(dev, "Failed to instantiate hwmon device\n"); err = PTR_ERR(hwmon); - goto err_free_data; + goto err_free_log; } ctrl->hwmon_device = hwmon; return 0; +err_free_log: + kfree(data->log); err_free_data: kfree(data); return err; @@ -265,6 +273,7 @@ void nvme_hwmon_exit(struct nvme_ctrl *ctrl) hwmon_device_unregister(ctrl->hwmon_device); ctrl->hwmon_device = NULL; + kfree(data->log); kfree(data); } } From ddd2b8de9f85b388925e7dc46b3890fc1a0d8d24 Mon Sep 17 00:00:00 2001 From: Sagi Grimberg Date: Wed, 28 Sep 2022 09:39:10 +0300 Subject: [PATCH 09/17] nvmet: fix workqueue MEM_RECLAIM flushing dependency The keep alive timer needs to stay on nvmet_wq, and not modified to reschedule on the system_wq. This fixes a warning: ------------[ cut here ]------------ workqueue: WQ_MEM_RECLAIM nvmet-wq:nvmet_rdma_release_queue_work [nvmet_rdma] is flushing !WQ_MEM_RECLAIM events:nvmet_keep_alive_timer [nvmet] WARNING: CPU: 3 PID: 1086 at kernel/workqueue.c:2628 check_flush_dependency+0x16c/0x1e0 Reported-by: Yi Zhang Fixes: 8832cf922151 ("nvmet: use a private workqueue instead of the system workqueue") Signed-off-by: Sagi Grimberg Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 14677145bbba..aecb5853f8da 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -1176,7 +1176,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl) * reset the keep alive timer when the controller is enabled. */ if (ctrl->kato) - mod_delayed_work(system_wq, &ctrl->ka_work, ctrl->kato * HZ); + mod_delayed_work(nvmet_wq, &ctrl->ka_work, ctrl->kato * HZ); } static void nvmet_clear_ctrl(struct nvmet_ctrl *ctrl) From 94f5a06884074dcd99606d7b329e133ee65ea6ad Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Fri, 7 Oct 2022 09:29:34 +0200 Subject: [PATCH 10/17] nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show The item passed into nvmet_subsys_attr_qid_max_show is not a member of struct nvmet_port, it is part of nvmet_subsys. Hence, don't try to dereference it as struct nvme_ctrl pointer. Fixes: 3e980f5995e0 ("nvmet: Expose max queues to configfs") Reported-by: Shinichiro Kawasaki Link: https://lore.kernel.org/r/20220913064203.133536-1-dwagner@suse.de Signed-off-by: Daniel Wagner Reviewed-by: Hannes Reinecke Acked-by: Sagi Grimberg Signed-off-by: Christoph Hellwig --- drivers/nvme/target/configfs.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index e34a2896fedb..9443ee1d4ae3 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -1290,12 +1290,8 @@ static ssize_t nvmet_subsys_attr_qid_max_show(struct config_item *item, static ssize_t nvmet_subsys_attr_qid_max_store(struct config_item *item, const char *page, size_t cnt) { - struct nvmet_port *port = to_nvmet_port(item); u16 qid_max; - if (nvmet_is_port_enabled(port, __func__)) - return -EACCES; - if (sscanf(page, "%hu\n", &qid_max) != 1) return -EINVAL; From 72495b5ab456ec9f05d587238d1e2fa8e9ea63ec Mon Sep 17 00:00:00 2001 From: Yushan Zhou Date: Tue, 18 Oct 2022 18:01:32 +0800 Subject: [PATCH 11/17] ublk_drv: use flexible-array member instead of zero-length array Eliminate the following coccicheck warning: ./drivers/block/ublk_drv.c:127:16-19: WARNING use flexible-array member instead Signed-off-by: Yushan Zhou Link: https://lore.kernel.org/r/20221018100132.355393-1-zys.zljxml@gmail.com Reviewed-by: Ming Lei Signed-off-by: Jens Axboe --- drivers/block/ublk_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 2651bf41dde3..5afce6ffaadf 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -124,7 +124,7 @@ struct ublk_queue { bool force_abort; unsigned short nr_io_ready; /* how many ios setup */ struct ublk_device *dev; - struct ublk_io ios[0]; + struct ublk_io ios[]; }; #define UBLK_DAEMON_MONITOR_PERIOD (5 * HZ) From 6d42ddf7f27b6723549ee6d4c8b1b418b59bf6b5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Christoph=20B=C3=B6hmwalder?= Date: Thu, 20 Oct 2022 10:52:05 +0200 Subject: [PATCH 12/17] drbd: only clone bio if we have a backing device MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit c347a787e34cb (drbd: set ->bi_bdev in drbd_req_new) moved a bio_set_dev call (which has since been removed) to "earlier", from drbd_request_prepare to drbd_req_new. The problem is that this accesses device->ldev->backing_bdev, which is not NULL-checked at this point. When we don't have an ldev (i.e. when the DRBD device is diskless), this leads to a null pointer deref. So, only allocate the private_bio if we actually have a disk. This is also a small optimization, since we don't clone the bio to only to immediately free it again in the diskless case. Fixes: c347a787e34cb ("drbd: set ->bi_bdev in drbd_req_new") Co-developed-by: Christoph Böhmwalder Signed-off-by: Christoph Böhmwalder Co-developed-by: Joel Colledge Signed-off-by: Joel Colledge Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20221020085205.129090-1-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe --- drivers/block/drbd/drbd_req.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c index 8f7f144e54f3..7f9bcc82fc9c 100644 --- a/drivers/block/drbd/drbd_req.c +++ b/drivers/block/drbd/drbd_req.c @@ -30,11 +30,6 @@ static struct drbd_request *drbd_req_new(struct drbd_device *device, struct bio return NULL; memset(req, 0, sizeof(*req)); - req->private_bio = bio_alloc_clone(device->ldev->backing_bdev, bio_src, - GFP_NOIO, &drbd_io_bio_set); - req->private_bio->bi_private = req; - req->private_bio->bi_end_io = drbd_request_endio; - req->rq_state = (bio_data_dir(bio_src) == WRITE ? RQ_WRITE : 0) | (bio_op(bio_src) == REQ_OP_WRITE_ZEROES ? RQ_ZEROES : 0) | (bio_op(bio_src) == REQ_OP_DISCARD ? RQ_UNMAP : 0); @@ -1219,9 +1214,12 @@ drbd_request_prepare(struct drbd_device *device, struct bio *bio) /* Update disk stats */ req->start_jif = bio_start_io_acct(req->master_bio); - if (!get_ldev(device)) { - bio_put(req->private_bio); - req->private_bio = NULL; + if (get_ldev(device)) { + req->private_bio = bio_alloc_clone(device->ldev->backing_bdev, + bio, GFP_NOIO, + &drbd_io_bio_set); + req->private_bio->bi_private = req; + req->private_bio->bi_end_io = drbd_request_endio; } /* process discards always from our submitter thread */ From 33566f92cd5f1c1d462920978f6dc102c744270d Mon Sep 17 00:00:00 2001 From: Yuwei Guan Date: Tue, 18 Oct 2022 11:01:39 +0800 Subject: [PATCH 13/17] block, bfq: remove unused variable for bfq_queue it defined in d0edc2473be9d, but there's nowhere to use it, so remove it. Signed-off-by: Yuwei Guan Acked-by: Paolo Valente Link: https://lore.kernel.org/r/20221018030139.159-1-Yuwei.Guan@zeekrlife.com Signed-off-by: Jens Axboe --- block/bfq-iosched.h | 4 ---- 1 file changed, 4 deletions(-) diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index 64ee618064ba..71f721670ab6 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -369,12 +369,8 @@ struct bfq_queue { unsigned long split_time; /* time of last split */ unsigned long first_IO_time; /* time of first I/O for this queue */ - unsigned long creation_time; /* when this queue is created */ - /* max service rate measured so far */ - u32 max_service_rate; - /* * Pointer to the waker queue for this queue, i.e., to the * queue Q such that this queue happens to get new I/O right From d4347d50407daea6237872281ece64c4bdf1ec99 Mon Sep 17 00:00:00 2001 From: Pavel Begunkov Date: Tue, 18 Oct 2022 20:50:55 +0100 Subject: [PATCH 14/17] bio: safeguard REQ_ALLOC_CACHE bio put bio_put() with REQ_ALLOC_CACHE assumes that it's executed not from an irq context. Let's add a warning if the invariant is not respected, especially since there is a couple of places removing REQ_POLLED by hand without also clearing REQ_ALLOC_CACHE. Signed-off-by: Pavel Begunkov Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/558d78313476c4e9c233902efa0092644c3d420a.1666122465.git.asml.silence@gmail.com Signed-off-by: Jens Axboe --- block/bio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/bio.c b/block/bio.c index 6c470a50a36d..0a14af923738 100644 --- a/block/bio.c +++ b/block/bio.c @@ -741,7 +741,7 @@ void bio_put(struct bio *bio) return; } - if (bio->bi_opf & REQ_ALLOC_CACHE) { + if ((bio->bi_opf & REQ_ALLOC_CACHE) && !WARN_ON_ONCE(in_interrupt())) { struct bio_alloc_cache *cache; bio_uninit(bio); From 60a9bb9048f9e95029df10a9bc346f6b066c593c Mon Sep 17 00:00:00 2001 From: Ye Bin Date: Wed, 19 Oct 2022 11:36:00 +0800 Subject: [PATCH 15/17] blktrace: introduce 'blk_trace_{start,stop}' helper Introduce 'blk_trace_{start,stop}' helper. No functional changed. Signed-off-by: Ye Bin Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20221019033602.752383-2-yebin@huaweicloud.com Signed-off-by: Jens Axboe --- kernel/trace/blktrace.c | 74 ++++++++++++++++++++--------------------- 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 7f5eb295fe19..50b6f241b5f7 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -346,6 +346,37 @@ static void put_probe_ref(void) mutex_unlock(&blk_probe_mutex); } +static int blk_trace_start(struct blk_trace *bt) +{ + if (bt->trace_state != Blktrace_setup && + bt->trace_state != Blktrace_stopped) + return -EINVAL; + + blktrace_seq++; + smp_mb(); + bt->trace_state = Blktrace_running; + raw_spin_lock_irq(&running_trace_lock); + list_add(&bt->running_list, &running_trace_list); + raw_spin_unlock_irq(&running_trace_lock); + trace_note_time(bt); + + return 0; +} + +static int blk_trace_stop(struct blk_trace *bt) +{ + if (bt->trace_state != Blktrace_running) + return -EINVAL; + + bt->trace_state = Blktrace_stopped; + raw_spin_lock_irq(&running_trace_lock); + list_del_init(&bt->running_list); + raw_spin_unlock_irq(&running_trace_lock); + relay_flush(bt->rchan); + + return 0; +} + static void blk_trace_cleanup(struct request_queue *q, struct blk_trace *bt) { synchronize_rcu(); @@ -658,7 +689,6 @@ static int compat_blk_trace_setup(struct request_queue *q, char *name, static int __blk_trace_startstop(struct request_queue *q, int start) { - int ret; struct blk_trace *bt; bt = rcu_dereference_protected(q->blk_trace, @@ -666,36 +696,10 @@ static int __blk_trace_startstop(struct request_queue *q, int start) if (bt == NULL) return -EINVAL; - /* - * For starting a trace, we can transition from a setup or stopped - * trace. For stopping a trace, the state must be running - */ - ret = -EINVAL; - if (start) { - if (bt->trace_state == Blktrace_setup || - bt->trace_state == Blktrace_stopped) { - blktrace_seq++; - smp_mb(); - bt->trace_state = Blktrace_running; - raw_spin_lock_irq(&running_trace_lock); - list_add(&bt->running_list, &running_trace_list); - raw_spin_unlock_irq(&running_trace_lock); - - trace_note_time(bt); - ret = 0; - } - } else { - if (bt->trace_state == Blktrace_running) { - bt->trace_state = Blktrace_stopped; - raw_spin_lock_irq(&running_trace_lock); - list_del_init(&bt->running_list); - raw_spin_unlock_irq(&running_trace_lock); - relay_flush(bt->rchan); - ret = 0; - } - } - - return ret; + if (start) + return blk_trace_start(bt); + else + return blk_trace_stop(bt); } int blk_trace_startstop(struct request_queue *q, int start) @@ -1614,13 +1618,7 @@ static int blk_trace_remove_queue(struct request_queue *q) if (bt == NULL) return -EINVAL; - if (bt->trace_state == Blktrace_running) { - bt->trace_state = Blktrace_stopped; - raw_spin_lock_irq(&running_trace_lock); - list_del_init(&bt->running_list); - raw_spin_unlock_irq(&running_trace_lock); - relay_flush(bt->rchan); - } + blk_trace_stop(bt); put_probe_ref(); synchronize_rcu(); From dcd1a59c62dc49da75539213611156d6db50ab5d Mon Sep 17 00:00:00 2001 From: Ye Bin Date: Wed, 19 Oct 2022 11:36:01 +0800 Subject: [PATCH 16/17] blktrace: fix possible memleak in '__blk_trace_remove' When test as follows: step1: ioctl(sda, BLKTRACESETUP, &arg) step2: ioctl(sda, BLKTRACESTART, NULL) step3: ioctl(sda, BLKTRACETEARDOWN, NULL) step4: ioctl(sda, BLKTRACESETUP, &arg) Got issue as follows: debugfs: File 'dropped' in directory 'sda' already present! debugfs: File 'msg' in directory 'sda' already present! debugfs: File 'trace0' in directory 'sda' already present! And also find syzkaller report issue like "KASAN: use-after-free Read in relay_switch_subbuf" "https://syzkaller.appspot.com/bug?id=13849f0d9b1b818b087341691be6cc3ac6a6bfb7" If remove block trace without stop(BLKTRACESTOP) block trace, '__blk_trace_remove' will just set 'q->blk_trace' with NULL. However, debugfs file isn't removed, so will report file already present when call BLKTRACESETUP. static int __blk_trace_remove(struct request_queue *q) { struct blk_trace *bt; bt = rcu_replace_pointer(q->blk_trace, NULL, lockdep_is_held(&q->debugfs_mutex)); if (!bt) return -EINVAL; if (bt->trace_state != Blktrace_running) blk_trace_cleanup(q, bt); return 0; } If do test as follows: step1: ioctl(sda, BLKTRACESETUP, &arg) step2: ioctl(sda, BLKTRACESTART, NULL) step3: ioctl(sda, BLKTRACETEARDOWN, NULL) step4: remove sda There will remove debugfs directory which will remove recursively all file under directory. >> blk_release_queue >> debugfs_remove_recursive(q->debugfs_dir) So all files which created in 'do_blk_trace_setup' are removed, and 'dentry->d_inode' is NULL. But 'q->blk_trace' is still in 'running_trace_lock', 'trace_note_tsk' will traverse 'running_trace_lock' all nodes. >>trace_note_tsk >> trace_note >> relay_reserve >> relay_switch_subbuf >> d_inode(buf->dentry)->i_size To solve above issues, reference commit '5afedf670caf', call 'blk_trace_cleanup' unconditionally in '__blk_trace_remove' and first stop block trace in 'blk_trace_cleanup'. Signed-off-by: Ye Bin Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20221019033602.752383-3-yebin@huaweicloud.com Signed-off-by: Jens Axboe --- kernel/trace/blktrace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 50b6f241b5f7..e17bba027a2c 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -379,6 +379,7 @@ static int blk_trace_stop(struct blk_trace *bt) static void blk_trace_cleanup(struct request_queue *q, struct blk_trace *bt) { + blk_trace_stop(bt); synchronize_rcu(); blk_trace_free(q, bt); put_probe_ref(); @@ -393,8 +394,7 @@ static int __blk_trace_remove(struct request_queue *q) if (!bt) return -EINVAL; - if (bt->trace_state != Blktrace_running) - blk_trace_cleanup(q, bt); + blk_trace_cleanup(q, bt); return 0; } From 2db96217e7e515071726ca4ec791742c4202a1b2 Mon Sep 17 00:00:00 2001 From: Ye Bin Date: Wed, 19 Oct 2022 11:36:02 +0800 Subject: [PATCH 17/17] blktrace: remove unnessary stop block trace in 'blk_trace_shutdown' As previous commit, 'blk_trace_cleanup' will stop block trace if block trace's state is 'Blktrace_running'. So remove unnessary stop block trace in 'blk_trace_shutdown'. Signed-off-by: Ye Bin Reviewed-by: Christoph Hellwig Link: https://lore.kernel.org/r/20221019033602.752383-4-yebin@huaweicloud.com Signed-off-by: Jens Axboe --- kernel/trace/blktrace.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index e17bba027a2c..a995ea1ef849 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -776,10 +776,8 @@ int blk_trace_ioctl(struct block_device *bdev, unsigned cmd, char __user *arg) void blk_trace_shutdown(struct request_queue *q) { if (rcu_dereference_protected(q->blk_trace, - lockdep_is_held(&q->debugfs_mutex))) { - __blk_trace_startstop(q, 0); + lockdep_is_held(&q->debugfs_mutex))) __blk_trace_remove(q); - } } #ifdef CONFIG_BLK_CGROUP