From 8184035805dc87dd826101b930d3dce97758f7b1 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Thu, 13 May 2021 15:37:30 -0500 Subject: [PATCH 01/78] rsxx: Use struct_size() in vmalloc() Make use of the struct_size() helper instead of an open-coded version, in order to avoid any potential type mistakes or integer overflows that, in the worst scenario, could lead to heap overflows. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva Link: https://lore.kernel.org/r/20210513203730.GA212128@embeddedor Signed-off-by: Jens Axboe --- drivers/block/rsxx/dma.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/block/rsxx/dma.c b/drivers/block/rsxx/dma.c index 0574f4495755..ed182f3dd054 100644 --- a/drivers/block/rsxx/dma.c +++ b/drivers/block/rsxx/dma.c @@ -74,9 +74,6 @@ struct dma_tracker { struct rsxx_dma *dma; }; -#define DMA_TRACKER_LIST_SIZE8 (sizeof(struct dma_tracker_list) + \ - (sizeof(struct dma_tracker) * RSXX_MAX_OUTSTANDING_CMDS)) - struct dma_tracker_list { spinlock_t lock; int head; @@ -808,7 +805,8 @@ static int rsxx_dma_ctrl_init(struct pci_dev *dev, memset(&ctrl->stats, 0, sizeof(ctrl->stats)); - ctrl->trackers = vmalloc(DMA_TRACKER_LIST_SIZE8); + ctrl->trackers = vmalloc(struct_size(ctrl->trackers, list, + RSXX_MAX_OUTSTANDING_CMDS)); if (!ctrl->trackers) return -ENOMEM; From 13ce7e625a3383004181217985a70d16c3cbe8be Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Thu, 13 May 2021 12:59:52 +0100 Subject: [PATCH 02/78] nvme: remove redundant initialization of variable ret The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Signed-off-by: Colin Ian King Signed-off-by: Christoph Hellwig --- drivers/nvme/host/rdma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 37943dc4c2c1..74bf2c7f2b80 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1088,7 +1088,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl) static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new) { - int ret = -EINVAL; + int ret; bool changed; ret = nvme_rdma_configure_admin_queue(ctrl, new); From ebd8a93aa4f50e9e013e6aa7fe601b4ce7565c28 Mon Sep 17 00:00:00 2001 From: Alexey Bogoslavsky Date: Wed, 28 Apr 2021 09:27:36 +0000 Subject: [PATCH 03/78] nvme: extend and modify the APST configuration algorithm The algorithm that was used until now for building the APST configuration table has been found to produce entries with excessively long ITPT (idle time prior to transition) for devices declaring relatively long entry and exit latencies for non-operational power states. This leads to unnecessary waste of power and, as a result, failure to pass mandatory power consumption tests on Chromebook platforms. The new algorithm is based on two predefined ITPT values and two predefined latency tolerances. Based on these values, as well as on exit and entry latencies reported by the device, the algorithm looks for up to 2 suitable non-operational power states to use as primary and secondary APST transition targets. The predefined values are supplied to the nvme driver as module parameters: - apst_primary_timeout_ms (default: 100) - apst_secondary_timeout_ms (default: 2000) - apst_primary_latency_tol_us (default: 15000) - apst_secondary_latency_tol_us (default: 100000) The algorithm echoes the approach used by Intel's and Microsoft's drivers on Windows. The specific default parameter values are also based on those drivers. Yet, this patch doesn't introduce the ability to dynamically regenerate the APST table in the event of switching the power source from AC to battery and back. Adding this functionality may be considered in the future. In the meantime, the timeouts and tolerances reflect a compromise between values used by Microsoft for AC and battery scenarios. In most NVMe devices the new algorithm causes them to implement a more aggressive power saving policy. While beneficial in most cases, this sometimes comes at the price of a higher IO processing latency in certain scenarios as well as at the price of a potential impact on the drive's endurance (due to more frequent context saving when entering deep non- operational states). So in order to provide a fallback for systems where these regressions cannot be tolerated, the patch allows to revert to the legacy behavior by setting either apst_primary_timeout_ms or apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after fine tuning the default values of the module parameters) the legacy behavior can be removed. TESTING. The new algorithm has been extensively tested. Initially, simulations were used to compare APST tables generated by old and new algorithms for a wide range of devices. After that, power consumption, performance and latencies were measured under different workloads on devices from multiple vendors (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests and the findings. General observations. The effect the patch has on the APST table varies depending on the entry and exit latencies advertised by the devices. For some devices, the effect is negligible (e.g. Kioxia KBG40ZNS), for some significant, making the transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g. SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed: the initial transition happens after a longer idle time, but takes the device to a lower power state. Workflows. In order to evaluate the patch's effect on the power consumption and latency, 7 workflows were used for each device. The workflows were designed to test the scenarios where significant differences between the old and new behaviors are most likely. Each workflow was tested twice: with the new and with the old APST table generation implementation. Power consumption, performance and latency were measured in the process. The following workflows were used: 1) Consecutive write at the maximum rate with IO depth of 2, with no pauses 2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms idle time 3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms idle time 4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms idle time 5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s idle time 6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s idle time 7) Repeated pattern of a single random read of a 4K packet followed by 150ms idle time Power consumption Actual power consumption measurements produced predictable results in accordance with the APST mechanism's theory of operation. Devices with long entry and exit latencies such as WD SN530 showed huge improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia KBG40ZNS where the resulting APST table looks virtually identical with both legacy and new algorithms, showed little or no change in the average power consumption on all workflows. Devices with extra short latencies such as Samsung PM991 showed moderate increase in power consumption of up to 18% in worst case scenarios. In addition, on Intel and Samsung devices a more complex impact was observed on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep non-operational states between the writes the devices start performing background operations leading to an increase of power consumption. With the old APST tables part of these operations are delayed until the scenario is over and a longer idle period begins, but eventually this extra power is consumed anyway. Performance. In terms of performance measured on sustained write or read scenarios, the effect of the patch is minimal as in this case the device doesn't enter low power states. Latency As expected, in devices where the patch causes a more aggressive power saving policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in certain scenarios. Workflow number 7, specifically designed to simulate the worst case scenario as far as latency is concerned, indeed shows a sharp increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on WD SN530). The latency increase on other workloads and other devices is much milder or non-existent. Signed-off-by: Alexey Bogoslavsky Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 89 +++++++++++++++++++++++++++++++++++----- 1 file changed, 78 insertions(+), 11 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 762125f2905f..e7441ccaa8db 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -57,6 +57,26 @@ static bool force_apst; module_param(force_apst, bool, 0644); MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off"); +static unsigned long apst_primary_timeout_ms = 100; +module_param(apst_primary_timeout_ms, ulong, 0644); +MODULE_PARM_DESC(apst_primary_timeout_ms, + "primary APST timeout in ms"); + +static unsigned long apst_secondary_timeout_ms = 2000; +module_param(apst_secondary_timeout_ms, ulong, 0644); +MODULE_PARM_DESC(apst_secondary_timeout_ms, + "secondary APST timeout in ms"); + +static unsigned long apst_primary_latency_tol_us = 15000; +module_param(apst_primary_latency_tol_us, ulong, 0644); +MODULE_PARM_DESC(apst_primary_latency_tol_us, + "primary APST latency tolerance in us"); + +static unsigned long apst_secondary_latency_tol_us = 100000; +module_param(apst_secondary_latency_tol_us, ulong, 0644); +MODULE_PARM_DESC(apst_secondary_latency_tol_us, + "secondary APST latency tolerance in us"); + static bool streams; module_param(streams, bool, 0644); MODULE_PARM_DESC(streams, "turn on support for Streams write directives"); @@ -2217,14 +2237,54 @@ static int nvme_configure_acre(struct nvme_ctrl *ctrl) return ret; } +/* + * The function checks whether the given total (exlat + enlat) latency of + * a power state allows the latter to be used as an APST transition target. + * It does so by comparing the latency to the primary and secondary latency + * tolerances defined by module params. If there's a match, the corresponding + * timeout value is returned and the matching tolerance index (1 or 2) is + * reported. + */ +static bool nvme_apst_get_transition_time(u64 total_latency, + u64 *transition_time, unsigned *last_index) +{ + if (total_latency <= apst_primary_latency_tol_us) { + if (*last_index == 1) + return false; + *last_index = 1; + *transition_time = apst_primary_timeout_ms; + return true; + } + if (apst_secondary_timeout_ms && + total_latency <= apst_secondary_latency_tol_us) { + if (*last_index <= 2) + return false; + *last_index = 2; + *transition_time = apst_secondary_timeout_ms; + return true; + } + return false; +} + /* * APST (Autonomous Power State Transition) lets us program a table of power * state transitions that the controller will perform automatically. - * We configure it with a simple heuristic: we are willing to spend at most 2% - * of the time transitioning between power states. Therefore, when running in - * any given state, we will enter the next lower-power non-operational state - * after waiting 50 * (enlat + exlat) microseconds, as long as that state's exit - * latency is under the requested maximum latency. + * + * Depending on module params, one of the two supported techniques will be used: + * + * - If the parameters provide explicit timeouts and tolerances, they will be + * used to build a table with up to 2 non-operational states to transition to. + * The default parameter values were selected based on the values used by + * Microsoft's and Intel's NVMe drivers. Yet, since we don't implement dynamic + * regeneration of the APST table in the event of switching between external + * and battery power, the timeouts and tolerances reflect a compromise + * between values used by Microsoft for AC and battery scenarios. + * - If not, we'll configure the table with a simple heuristic: we are willing + * to spend at most 2% of the time transitioning between power states. + * Therefore, when running in any given state, we will enter the next + * lower-power non-operational state after waiting 50 * (enlat + exlat) + * microseconds, as long as that state's exit latency is under the requested + * maximum latency. * * We will not autonomously enter any non-operational state for which the total * latency exceeds ps_max_latency_us. @@ -2240,6 +2300,7 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl) int max_ps = -1; int state; int ret; + unsigned last_lt_index = UINT_MAX; /* * If APST isn't supported or if we haven't been initialized yet, @@ -2298,13 +2359,19 @@ static int nvme_configure_apst(struct nvme_ctrl *ctrl) le32_to_cpu(ctrl->psd[state].entry_lat); /* - * This state is good. Use it as the APST idle target for - * higher power states. + * This state is good. It can be used as the APST idle target + * for higher power states. */ - transition_ms = total_latency_us + 19; - do_div(transition_ms, 20); - if (transition_ms > (1 << 24) - 1) - transition_ms = (1 << 24) - 1; + if (apst_primary_timeout_ms && apst_primary_latency_tol_us) { + if (!nvme_apst_get_transition_time(total_latency_us, + &transition_ms, &last_lt_index)) + continue; + } else { + transition_ms = total_latency_us + 19; + do_div(transition_ms, 20); + if (transition_ms > (1 << 24) - 1) + transition_ms = (1 << 24) - 1; + } target = cpu_to_le64((state << 3) | (transition_ms << 8)); if (max_ps == -1) From e21e0243e7b0f1c2a21d21f4d115f7b37175772a Mon Sep 17 00:00:00 2001 From: Mario Limonciello Date: Fri, 28 May 2021 11:02:34 -0500 Subject: [PATCH 04/78] nvme-pci: look for StorageD3Enable on companion ACPI device instead The documentation around the StorageD3Enable property hints that it should be made on the PCI device. This is where newer AMD systems set the property and it's required for S0i3 support. So rather than look for nodes of the root port only present on Intel systems, switch to the companion ACPI device for all systems. David Box from Intel indicated this should work on Intel as well. Link: https://lore.kernel.org/linux-nvme/YK6gmAWqaRmvpJXb@google.com/T/#m900552229fa455867ee29c33b854845fce80ba70 Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Fixes: df4f9bc4fb9c ("nvme-pci: add support for ACPI StorageD3Enable property") Suggested-by: Liang Prike Acked-by: Raul E Rangel Signed-off-by: Mario Limonciello Reviewed-by: David E. Box Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 24 +----------------------- 1 file changed, 1 insertion(+), 23 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index a29b170701fc..3aa7245a505f 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2831,10 +2831,7 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev) #ifdef CONFIG_ACPI static bool nvme_acpi_storage_d3(struct pci_dev *dev) { - struct acpi_device *adev; - struct pci_dev *root; - acpi_handle handle; - acpi_status status; + struct acpi_device *adev = ACPI_COMPANION(&dev->dev); u8 val; /* @@ -2842,28 +2839,9 @@ static bool nvme_acpi_storage_d3(struct pci_dev *dev) * must use D3 to support deep platform power savings during * suspend-to-idle. */ - root = pcie_find_root_port(dev); - if (!root) - return false; - adev = ACPI_COMPANION(&root->dev); if (!adev) return false; - - /* - * The property is defined in the PXSX device for South complex ports - * and in the PEGP device for North complex ports. - */ - status = acpi_get_handle(adev->handle, "PXSX", &handle); - if (ACPI_FAILURE(status)) { - status = acpi_get_handle(adev->handle, "PEGP", &handle); - if (ACPI_FAILURE(status)) - return false; - } - - if (acpi_bus_get_device(handle, &adev)) - return false; - if (fwnode_property_read_u8(acpi_fwnode_handle(adev), "StorageD3Enable", &val)) return false; From 3ede8f72a9a2825efca23a3552e80a1202ea88fd Mon Sep 17 00:00:00 2001 From: Martin Belanger Date: Thu, 20 May 2021 15:09:34 -0400 Subject: [PATCH 05/78] nvme-tcp: allow selecting the network interface for connections In our application, we need a way to force TCP connections to go out a specific IP interface instead of letting Linux select the interface based on the routing tables. Add the 'host-iface' option to allow specifying the interface to use. When the option host-iface is specified, the driver uses the specified interface to set the option SO_BINDTODEVICE on the TCP socket before connecting. This new option is needed in addtion to the existing host-traddr for the following reasons: Specifying an IP interface by its associated IP address is less intuitive than specifying the actual interface name and, in some cases, simply doesn't work. That's because the association between interfaces and IP addresses is not predictable. IP addresses can be changed or can change by themselves over time (e.g. DHCP). Interface names are predictable [1] and will persist over time. Consider the following configuration. 1: lo: mtu 65536 qdisc noqueue state ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 100.0.0.100/24 scope global lo valid_lft forever preferred_lft forever 2: enp0s3: mtu 1500 qdisc ... link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff inet 100.0.0.100/24 scope global enp0s3 valid_lft forever preferred_lft forever 3: enp0s8: mtu 1500 qdisc ... link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff inet 100.0.0.100/24 scope global enp0s8 valid_lft forever preferred_lft forever The above is a VM that I configured with the same IP address (100.0.0.100) on all interfaces. Doing a reverse lookup to identify the unique interface associated with 100.0.0.100 does not work here. And this is why the option host_iface is required. I understand that the above config does not represent a standard host system, but I'm using this to prove a point: "We can never know how users will configure their systems". By te way, The above configuration is perfectly fine by Linux. The current TCP implementation for host_traddr performs a bind()-before-connect(). This is a common construct to set the source IP address on a TCP socket before connecting. This has no effect on how Linux selects the interface for the connection. That's because Linux uses the Weak End System model as described in RFC1122 [2]. On the other hand, setting the Source IP Address has benefits and should be supported by linux-nvme. In fact, setting the Source IP Address is a mandatory FedGov requirement (e.g. connection to a RADIUS/TACACS+ server). Consider the following configuration. $ ip addr list dev enp0s8 3: enp0s8: mtu 1500 qdisc ... link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8 valid_lft 426sec preferred_lft 426sec inet 192.168.56.102/24 scope global secondary enp0s8 valid_lft forever preferred_lft forever inet 192.168.56.103/24 scope global secondary enp0s8 valid_lft forever preferred_lft forever inet 192.168.56.104/24 scope global secondary enp0s8 valid_lft forever preferred_lft forever Here we can see that several addresses are associated with interface enp0s8. By default, Linux always selects the default IP address, 192.168.56.101, as the source address when connecting over interface enp0s8. Some users, however, want the ability to specify a different source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option host_traddr can be used as-is to perform this function. In conclusion, I believe that we need 2 options for TCP connections. One that can be used to specify an interface (host-iface). And one that can be used to set the source address (host-traddr). Users should be allowed to use one or the other, or both, or none. Of course, the documentation for host_traddr will need some clarification. It should state that when used for TCP connection, this option only sets the source address. And the documentation for host_iface should say that this option is only available for TCP connections. References: [1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/ [2] https://tools.ietf.org/html/rfc1122 Tested both IPv4 and IPv6 connections. Signed-off-by: Martin Belanger Reviewed-by: Sagi Grimberg Reviewed-by: Hannes Reinecke Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 5 +++++ drivers/nvme/host/fabrics.c | 14 ++++++++++++++ drivers/nvme/host/fabrics.h | 6 +++++- drivers/nvme/host/tcp.c | 27 ++++++++++++++++++++++++++- 4 files changed, 50 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index e7441ccaa8db..bb8b242594f9 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4134,6 +4134,11 @@ static int nvme_class_uevent(struct device *dev, struct kobj_uevent_env *env) ret = add_uevent_var(env, "NVME_HOST_TRADDR=%s", opts->host_traddr ?: "none"); + if (ret) + return ret; + + ret = add_uevent_var(env, "NVME_HOST_IFACE=%s", + opts->host_iface ?: "none"); } return ret; } diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index a2bb7fc63a73..76dc3eaf46f3 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -112,6 +112,9 @@ int nvmf_get_address(struct nvme_ctrl *ctrl, char *buf, int size) if (ctrl->opts->mask & NVMF_OPT_HOST_TRADDR) len += scnprintf(buf + len, size - len, "%shost_traddr=%s", (len) ? "," : "", ctrl->opts->host_traddr); + if (ctrl->opts->mask & NVMF_OPT_HOST_IFACE) + len += scnprintf(buf + len, size - len, "%shost_iface=%s", + (len) ? "," : "", ctrl->opts->host_iface); len += scnprintf(buf + len, size - len, "\n"); return len; @@ -545,6 +548,7 @@ static const match_table_t opt_tokens = { { NVMF_OPT_KATO, "keep_alive_tmo=%d" }, { NVMF_OPT_HOSTNQN, "hostnqn=%s" }, { NVMF_OPT_HOST_TRADDR, "host_traddr=%s" }, + { NVMF_OPT_HOST_IFACE, "host_iface=%s" }, { NVMF_OPT_HOST_ID, "hostid=%s" }, { NVMF_OPT_DUP_CONNECT, "duplicate_connect" }, { NVMF_OPT_DISABLE_SQFLOW, "disable_sqflow" }, @@ -754,6 +758,15 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, kfree(opts->host_traddr); opts->host_traddr = p; break; + case NVMF_OPT_HOST_IFACE: + p = match_strdup(args); + if (!p) { + ret = -ENOMEM; + goto out; + } + kfree(opts->host_iface); + opts->host_iface = p; + break; case NVMF_OPT_HOST_ID: p = match_strdup(args); if (!p) { @@ -938,6 +951,7 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts) kfree(opts->trsvcid); kfree(opts->subsysnqn); kfree(opts->host_traddr); + kfree(opts->host_iface); kfree(opts); } EXPORT_SYMBOL_GPL(nvmf_free_options); diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index d7f7974dc208..c31dad69a773 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -66,6 +66,7 @@ enum { NVMF_OPT_NR_POLL_QUEUES = 1 << 18, NVMF_OPT_TOS = 1 << 19, NVMF_OPT_FAIL_FAST_TMO = 1 << 20, + NVMF_OPT_HOST_IFACE = 1 << 21, }; /** @@ -83,7 +84,9 @@ enum { * @trsvcid: The transport-specific TRSVCID field for a port on the * subsystem which is adding a controller. * @host_traddr: A transport-specific field identifying the NVME host port - * to use for the connection to the controller. + * to use for the connection to the controller. + * @host_iface: A transport-specific field identifying the NVME host + * interface to use for the connection to the controller. * @queue_size: Number of IO queue elements. * @nr_io_queues: Number of controller IO queues that will be established. * @reconnect_delay: Time between two consecutive reconnect attempts. @@ -108,6 +111,7 @@ struct nvmf_ctrl_options { char *traddr; char *trsvcid; char *host_traddr; + char *host_iface; size_t queue_size; unsigned int nr_io_queues; unsigned int reconnect_delay; diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 34f4b3402f7c..5fc6c568c626 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -123,6 +123,7 @@ struct nvme_tcp_ctrl { struct blk_mq_tag_set admin_tag_set; struct sockaddr_storage addr; struct sockaddr_storage src_addr; + struct net_device *ndev; struct nvme_ctrl ctrl; struct work_struct err_work; @@ -1455,6 +1456,20 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, } } + if (nctrl->opts->mask & NVMF_OPT_HOST_IFACE) { + char *iface = nctrl->opts->host_iface; + sockptr_t optval = KERNEL_SOCKPTR(iface); + + ret = sock_setsockopt(queue->sock, SOL_SOCKET, SO_BINDTODEVICE, + optval, strlen(iface)); + if (ret) { + dev_err(nctrl->device, + "failed to bind to interface %s queue %d err %d\n", + iface, qid, ret); + goto err_sock; + } + } + queue->hdr_digest = nctrl->opts->hdr_digest; queue->data_digest = nctrl->opts->data_digest; if (queue->hdr_digest || queue->data_digest) { @@ -2515,6 +2530,16 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev, } } + if (opts->mask & NVMF_OPT_HOST_IFACE) { + ctrl->ndev = dev_get_by_name(&init_net, opts->host_iface); + if (!ctrl->ndev) { + pr_err("invalid interface passed: %s\n", + opts->host_iface); + ret = -ENODEV; + goto out_free_ctrl; + } + } + if (!opts->duplicate_connect && nvme_tcp_existing_controller(opts)) { ret = -EALREADY; goto out_free_ctrl; @@ -2571,7 +2596,7 @@ static struct nvmf_transport_ops nvme_tcp_transport = { NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO | NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST | NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES | - NVMF_OPT_TOS, + NVMF_OPT_TOS | NVMF_OPT_HOST_IFACE, .create_ctrl = nvme_tcp_create_ctrl, }; From 25e1de8c40c57bb6be4ecd601641691cfd8a7923 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Fri, 21 May 2021 15:41:57 -0700 Subject: [PATCH 06/78] nvme-fabrics: fix the kerneldco comment for nvmf_log_connect_error() Fix the comment style that matches existing code. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 76dc3eaf46f3..1d20105bb283 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -257,19 +257,15 @@ int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) EXPORT_SYMBOL_GPL(nvmf_reg_write32); /** - * nvmf_log_connect_error() - Error-parsing-diagnostic print - * out function for connect() errors. - * - * @ctrl: the specific /dev/nvmeX device that had the error. - * - * @errval: Error code to be decoded in a more human-friendly - * printout. - * - * @offset: For use with the NVMe error code NVME_SC_CONNECT_INVALID_PARAM. - * - * @cmd: This is the SQE portion of a submission capsule. - * - * @data: This is the "Data" portion of a submission capsule. + * nvmf_log_connect_error() - Error-parsing-diagnostic print out function for + * connect() errors. + * @ctrl: The specific /dev/nvmeX device that had the error. + * @errval: Error code to be decoded in a more human-friendly + * printout. + * @offset: For use with the NVMe error code + * NVME_SC_CONNECT_INVALID_PARAM. + * @cmd: This is the SQE portion of a submission capsule. + * @data: This is the "Data" portion of a submission capsule. */ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, int errval, int offset, struct nvme_command *cmd, From 63d20f54a3d0cff17145716caff03a0d161abf44 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Thu, 3 Jun 2021 10:28:03 +0300 Subject: [PATCH 07/78] nvme-fabrics: remove extra new lines in the switch Remove the extra lines in the switch block that is not common practice in the kernel code. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 1d20105bb283..d71ffcbc3296 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -274,7 +274,6 @@ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, int err_sctype = errval & (~NVME_SC_DNR); switch (err_sctype) { - case (NVME_SC_CONNECT_INVALID_PARAM): if (offset >> 16) { char *inv_data = "Connect Invalid Data Parameter"; @@ -317,24 +316,24 @@ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, } } break; - case NVME_SC_CONNECT_INVALID_HOST: dev_err(ctrl->device, "Connect for subsystem %s is not allowed, hostnqn: %s\n", data->subsysnqn, data->hostnqn); break; - case NVME_SC_CONNECT_CTRL_BUSY: dev_err(ctrl->device, "Connect command failed: controller is busy or not available\n"); break; - case NVME_SC_CONNECT_FORMAT: dev_err(ctrl->device, "Connect incompatible format: %d", cmd->connect.recfmt); break; - + case NVME_SC_HOST_PATH_ERROR: + dev_err(ctrl->device, + "Connect command failed: host path error\n"); + break; default: dev_err(ctrl->device, "Connect command failed, error wo/DNR bit: %d\n", From 6f860c922532afaae33a968b0d1df3ddf9a8d8a7 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Fri, 21 May 2021 15:41:59 -0700 Subject: [PATCH 08/78] nvme-fabrics: remove an extra comment Remove the comment at the end of the switch that is not needed as function is small enough. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index d71ffcbc3296..78527690c947 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -339,7 +339,7 @@ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, "Connect command failed, error wo/DNR bit: %d\n", err_sctype); break; - } /* switch (err_sctype) */ + } } /** From 97ba6931ba881ea23f3758bbbde7a07a98bff4f9 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Fri, 21 May 2021 15:42:00 -0700 Subject: [PATCH 09/78] nvme-fabrics: remove extra braces No need to use the braces around ~ operator. No functionality change in this patch. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 78527690c947..1239a63e3ac2 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -271,7 +271,7 @@ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, int errval, int offset, struct nvme_command *cmd, struct nvmf_connect_data *data) { - int err_sctype = errval & (~NVME_SC_DNR); + int err_sctype = errval & ~NVME_SC_DNR; switch (err_sctype) { case (NVME_SC_CONNECT_INVALID_PARAM): From f423c85cd392241f1521887b1396038cd1e4c68e Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:02:59 +0200 Subject: [PATCH 10/78] nvme: open code nvme_put_ns_from_disk in nvme_ns_head_chr_ioctl nvme_ns_head_chr_ioctl is always used on multipath nodes, so just call srcu_read_unlock and consolidate the two unlock paths. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/ioctl.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 9557ead02de1..0341767ff2e7 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -419,21 +419,19 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd, container_of(cdev, struct nvme_ns_head, cdev); void __user *argp = (void __user *)arg; struct nvme_ns *ns; - int srcu_idx, ret; + int srcu_idx, ret = -EWOULDBLOCK; srcu_idx = srcu_read_lock(&head->srcu); ns = nvme_find_path(head); - if (!ns) { - srcu_read_unlock(&head->srcu, srcu_idx); - return -EWOULDBLOCK; - } + if (!ns) + goto out_unlock; if (is_ctrl_ioctl(cmd)) return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); ret = nvme_ns_ioctl(ns, cmd, argp); - nvme_put_ns_from_disk(head, srcu_idx); - +out_unlock: + srcu_read_unlock(&head->srcu, srcu_idx); return ret; } #endif /* CONFIG_NVME_MULTIPATH */ From 86b4284d98d6a47033b7bfc5b029a4fc45e4d370 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:04:26 +0200 Subject: [PATCH 11/78] nvme: open code nvme_{get,put}_ns_from_disk in nvme_ns_head_ioctl nvme_ns_head_ioctl is always used on multipath nodes, no need to deal with the de-multiplexers. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/ioctl.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 0341767ff2e7..3f84bd3b9259 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -387,14 +387,15 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode, unsigned int cmd, unsigned long arg) { - struct nvme_ns_head *head = NULL; + struct nvme_ns_head *head = bdev->bd_disk->private_data; void __user *argp = (void __user *)arg; struct nvme_ns *ns; - int srcu_idx, ret; + int srcu_idx, ret = -EWOULDBLOCK; - ns = nvme_get_ns_from_disk(bdev->bd_disk, &head, &srcu_idx); - if (unlikely(!ns)) - return -EWOULDBLOCK; + srcu_idx = srcu_read_lock(&head->srcu); + ns = nvme_find_path(head); + if (!ns) + goto out_unlock; /* * Handle ioctls that apply to the controller instead of the namespace @@ -402,12 +403,11 @@ int nvme_ns_head_ioctl(struct block_device *bdev, fmode_t mode, * deadlock when deleting namespaces using the passthrough interface. */ if (is_ctrl_ioctl(cmd)) - ret = nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); - else { - ret = nvme_ns_ioctl(ns, cmd, argp); - nvme_put_ns_from_disk(head, srcu_idx); - } + return nvme_ns_head_ctrl_ioctl(ns, cmd, argp, head, srcu_idx); + ret = nvme_ns_ioctl(ns, cmd, argp); +out_unlock: + srcu_read_unlock(&head->srcu, srcu_idx); return ret; } From 3e7d1a55165bdce2aaf1139ee8889e68eb29c263 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:08:41 +0200 Subject: [PATCH 12/78] nvme: open code nvme_put_ns_from_disk in nvme_ns_head_ctrl_ioctl nvme_ns_head_ctrl_ioctl is always used on multipath nodes, so just call srcu_read_unlock directly. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 3f84bd3b9259..2c6969ffe85c 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -377,7 +377,7 @@ static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, int ret; nvme_get_ctrl(ns->ctrl); - nvme_put_ns_from_disk(head, srcu_idx); + srcu_read_unlock(&head->srcu, srcu_idx); ret = nvme_ctrl_ioctl(ns->ctrl, cmd, argp); nvme_put_ctrl(ctrl); From 85b790a7ae0518dd745bbb97d532b83840d2db04 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:09:56 +0200 Subject: [PATCH 13/78] nvme: add a sparse annotation to nvme_ns_head_ctrl_ioctl Add the __releases annotation to tell sparse that nvme_ns_head_ctrl_ioctl is expected to unlock head->srcu. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/ioctl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 2c6969ffe85c..2e7780ea0354 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -372,6 +372,7 @@ long nvme_ns_chr_ioctl(struct file *file, unsigned int cmd, unsigned long arg) #ifdef CONFIG_NVME_MULTIPATH static int nvme_ns_head_ctrl_ioctl(struct nvme_ns *ns, unsigned int cmd, void __user *argp, struct nvme_ns_head *head, int srcu_idx) + __releases(&head->srcu) { struct nvme_ctrl *ctrl = ns->ctrl; int ret; From d8ca66e82191a9a95926f7f129028bd362202d5d Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:11:54 +0200 Subject: [PATCH 14/78] nvme: move the CSI sanity check into nvme_ns_report_zones Move the CSI check into nvme_ns_report_zones to clean up the code a little bit and prepare for further refactoring. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/zns.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index 475dd45c3db4..31e789ecd940 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -180,6 +180,9 @@ static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, unsigned int nz, i; size_t buflen; + if (ns->head->ids.csi != NVME_CSI_ZNS) + return -EINVAL; + report = nvme_zns_alloc_report_buffer(ns, nr_zones, &buflen); if (!report) return -ENOMEM; @@ -237,11 +240,7 @@ int nvme_report_zones(struct gendisk *disk, sector_t sector, ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx); if (unlikely(!ns)) return -EWOULDBLOCK; - - if (ns->head->ids.csi == NVME_CSI_ZNS) - ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data); - else - ret = -EINVAL; + ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data); nvme_put_ns_from_disk(head, srcu_idx); return ret; From 8b4fb0f968ffe73f619c06cb4040ecaa60327098 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:17:06 +0200 Subject: [PATCH 15/78] nvme: split nvme_report_zones Split multipath support out of nvme_report_zones into a separate helper and simplify the non-multipath version as a result. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/core.c | 11 +++++++++++ drivers/nvme/host/multipath.c | 21 ++++++++++++++++++++- drivers/nvme/host/nvme.h | 7 ++----- drivers/nvme/host/zns.c | 20 ++------------------ 4 files changed, 35 insertions(+), 24 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index bb8b242594f9..47cfc8a28e45 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2073,6 +2073,17 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, EXPORT_SYMBOL_GPL(nvme_sec_submit); #endif /* CONFIG_BLK_SED_OPAL */ +#ifdef CONFIG_BLK_DEV_ZONED +static int nvme_report_zones(struct gendisk *disk, sector_t sector, + unsigned int nr_zones, report_zones_cb cb, void *data) +{ + return nvme_ns_report_zones(disk->private_data, sector, nr_zones, cb, + data); +} +#else +#define nvme_report_zones NULL +#endif /* CONFIG_BLK_DEV_ZONED */ + static const struct block_device_operations nvme_bdev_ops = { .owner = THIS_MODULE, .ioctl = nvme_ioctl, diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f81871c7128a..127a17b4c13d 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -349,6 +349,25 @@ static void nvme_ns_head_release(struct gendisk *disk, fmode_t mode) nvme_put_ns_head(disk->private_data); } +#ifdef CONFIG_BLK_DEV_ZONED +static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector, + unsigned int nr_zones, report_zones_cb cb, void *data) +{ + struct nvme_ns_head *head = disk->private_data; + struct nvme_ns *ns; + int srcu_idx, ret = -EWOULDBLOCK; + + srcu_idx = srcu_read_lock(&head->srcu); + ns = nvme_find_path(head); + if (ns) + ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data); + srcu_read_unlock(&head->srcu, srcu_idx); + return ret; +} +#else +#define nvme_ns_head_report_zones NULL +#endif /* CONFIG_BLK_DEV_ZONED */ + const struct block_device_operations nvme_ns_head_ops = { .owner = THIS_MODULE, .submit_bio = nvme_ns_head_submit_bio, @@ -356,7 +375,7 @@ const struct block_device_operations nvme_ns_head_ops = { .release = nvme_ns_head_release, .ioctl = nvme_ns_head_ioctl, .getgeo = nvme_getgeo, - .report_zones = nvme_report_zones, + .report_zones = nvme_ns_head_report_zones, .pr_ops = &nvme_pr_ops, }; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 0015860ec12b..01f41b2bf915 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -810,17 +810,14 @@ static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys) #endif /* CONFIG_NVME_MULTIPATH */ int nvme_revalidate_zones(struct nvme_ns *ns); +int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, + unsigned int nr_zones, report_zones_cb cb, void *data); #ifdef CONFIG_BLK_DEV_ZONED int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf); -int nvme_report_zones(struct gendisk *disk, sector_t sector, - unsigned int nr_zones, report_zones_cb cb, void *data); - blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req, struct nvme_command *cmnd, enum nvme_zone_mgmt_action action); #else -#define nvme_report_zones NULL - static inline blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req, struct nvme_command *cmnd, enum nvme_zone_mgmt_action action) diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c index 31e789ecd940..d95010481fce 100644 --- a/drivers/nvme/host/zns.c +++ b/drivers/nvme/host/zns.c @@ -171,8 +171,8 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns, return cb(&zone, idx, data); } -static int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, - unsigned int nr_zones, report_zones_cb cb, void *data) +int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector, + unsigned int nr_zones, report_zones_cb cb, void *data) { struct nvme_zone_report *report; struct nvme_command c = { }; @@ -230,22 +230,6 @@ out_free: return ret; } -int nvme_report_zones(struct gendisk *disk, sector_t sector, - unsigned int nr_zones, report_zones_cb cb, void *data) -{ - struct nvme_ns_head *head = NULL; - struct nvme_ns *ns; - int srcu_idx, ret; - - ns = nvme_get_ns_from_disk(disk, &head, &srcu_idx); - if (unlikely(!ns)) - return -EWOULDBLOCK; - ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data); - nvme_put_ns_from_disk(head, srcu_idx); - - return ret; -} - blk_status_t nvme_setup_zone_mgmt_send(struct nvme_ns *ns, struct request *req, struct nvme_command *c, enum nvme_zone_mgmt_action action) { From f1cf35e17ec308c0e76f55c6bccf84fff1a2d71a Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Wed, 19 May 2021 09:22:35 +0200 Subject: [PATCH 16/78] nvme: remove nvme_{get,put}_ns_from_disk Now that only one caller is left remove the helpers by restructuring nvme_pr_command so that it has two helpers for sending a command of to a given nsid using either the ns_head for multipath, or the namespace stored in the gendisk. Signed-off-by: Christoph Hellwig Reviewed-by: Keith Busch Reviewed-by: Sagi Grimberg --- drivers/nvme/host/core.c | 68 ++++++++++++++++------------------------ drivers/nvme/host/nvme.h | 5 +-- 2 files changed, 28 insertions(+), 45 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 47cfc8a28e45..177cae44b612 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1542,36 +1542,6 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl) queue_work(nvme_wq, &ctrl->async_event_work); } -/* - * Issue ioctl requests on the first available path. Note that unlike normal - * block layer requests we will not retry failed request on another controller. - */ -struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk, - struct nvme_ns_head **head, int *srcu_idx) -{ -#ifdef CONFIG_NVME_MULTIPATH - if (disk->fops == &nvme_ns_head_ops) { - struct nvme_ns *ns; - - *head = disk->private_data; - *srcu_idx = srcu_read_lock(&(*head)->srcu); - ns = nvme_find_path(*head); - if (!ns) - srcu_read_unlock(&(*head)->srcu, *srcu_idx); - return ns; - } -#endif - *head = NULL; - *srcu_idx = -1; - return disk->private_data; -} - -void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx) -{ - if (head) - srcu_read_unlock(&head->srcu, idx); -} - static int nvme_ns_open(struct nvme_ns *ns) { @@ -1968,30 +1938,46 @@ static char nvme_pr_type(enum pr_type type) } }; +static int nvme_send_ns_head_pr_command(struct block_device *bdev, + struct nvme_command *c, u8 data[16]) +{ + struct nvme_ns_head *head = bdev->bd_disk->private_data; + int srcu_idx = srcu_read_lock(&head->srcu); + struct nvme_ns *ns = nvme_find_path(head); + int ret = -EWOULDBLOCK; + + if (ns) { + c->common.nsid = cpu_to_le32(ns->head->ns_id); + ret = nvme_submit_sync_cmd(ns->queue, c, data, 16); + } + srcu_read_unlock(&head->srcu, srcu_idx); + return ret; +} + +static int nvme_send_ns_pr_command(struct nvme_ns *ns, struct nvme_command *c, + u8 data[16]) +{ + c->common.nsid = cpu_to_le32(ns->head->ns_id); + return nvme_submit_sync_cmd(ns->queue, c, data, 16); +} + static int nvme_pr_command(struct block_device *bdev, u32 cdw10, u64 key, u64 sa_key, u8 op) { - struct nvme_ns_head *head = NULL; - struct nvme_ns *ns; struct nvme_command c; - int srcu_idx, ret; u8 data[16] = { 0, }; - ns = nvme_get_ns_from_disk(bdev->bd_disk, &head, &srcu_idx); - if (unlikely(!ns)) - return -EWOULDBLOCK; - put_unaligned_le64(key, &data[0]); put_unaligned_le64(sa_key, &data[8]); memset(&c, 0, sizeof(c)); c.common.opcode = op; - c.common.nsid = cpu_to_le32(ns->head->ns_id); c.common.cdw10 = cpu_to_le32(cdw10); - ret = nvme_submit_sync_cmd(ns->queue, &c, data, 16); - nvme_put_ns_from_disk(head, srcu_idx); - return ret; + if (IS_ENABLED(CONFIG_NVME_MULTIPATH) && + bdev->bd_disk->fops == &nvme_ns_head_ops) + return nvme_send_ns_head_pr_command(bdev, &c, data); + return nvme_send_ns_pr_command(bdev->bd_disk->private_data, &c, data); } static int nvme_pr_register(struct block_device *bdev, u64 old, diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 01f41b2bf915..1f397ecba16c 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -674,9 +674,6 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl); void nvme_queue_scan(struct nvme_ctrl *ctrl); int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp, u8 csi, void *log, size_t size, u64 offset); -struct nvme_ns *nvme_get_ns_from_disk(struct gendisk *disk, - struct nvme_ns_head **head, int *srcu_idx); -void nvme_put_ns_from_disk(struct nvme_ns_head *head, int idx); bool nvme_tryget_ns_head(struct nvme_ns_head *head); void nvme_put_ns_head(struct nvme_ns_head *head); int nvme_cdev_add(struct cdev *cdev, struct device *cdev_device, @@ -697,6 +694,7 @@ extern const struct attribute_group *nvme_ns_id_attr_groups[]; extern const struct pr_ops nvme_pr_ops; extern const struct block_device_operations nvme_ns_head_ops; +struct nvme_ns *nvme_find_path(struct nvme_ns_head *head); #ifdef CONFIG_NVME_MULTIPATH static inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl) { @@ -718,7 +716,6 @@ void nvme_mpath_uninit(struct nvme_ctrl *ctrl); void nvme_mpath_stop(struct nvme_ctrl *ctrl); bool nvme_mpath_clear_current_path(struct nvme_ns *ns); void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl); -struct nvme_ns *nvme_find_path(struct nvme_ns_head *head); static inline void nvme_mpath_check_last_path(struct nvme_ns *ns) { From f6e8bd59c4e84820fc5f6c404730ef872439548a Mon Sep 17 00:00:00 2001 From: Amit Engel Date: Thu, 22 Apr 2021 15:33:16 +0300 Subject: [PATCH 17/78] nvmet: move ka_work initialization to nvmet_alloc_ctrl Initialize keep-alive work only once, as part of alloc_ctrl and not each time that nvmet_start_keep_alive_timer is being called Signed-off-by: Amit Engel Reviewed-by: Hou Pu --- drivers/nvme/target/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 1853db38b682..4ae4bea6625d 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -412,7 +412,6 @@ void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl) pr_debug("ctrl %d start keep-alive timer for %d secs\n", ctrl->cntlid, ctrl->kato); - INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer); schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ); } @@ -1352,6 +1351,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, INIT_LIST_HEAD(&ctrl->async_events); INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL); INIT_WORK(&ctrl->fatal_err_work, nvmet_fatal_error_handler); + INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer); memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE); memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE); From 346ac785badf66120d8b4c7b48f87b0a536f691e Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 2 Jun 2021 17:37:58 -0700 Subject: [PATCH 18/78] nvmet: remove a superfluous variable Remove the superfluous variable "bdev" that is only used once in the nvmet_bdev_alloc_bip() and use req->ns->bdev that is used everywhere in the code to access the nvmet request's bdev. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/io-cmd-bdev.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 429263ca9b97..f673679d258a 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -174,11 +174,10 @@ static int nvmet_bdev_alloc_bip(struct nvmet_req *req, struct bio *bio, { struct blk_integrity *bi; struct bio_integrity_payload *bip; - struct block_device *bdev = req->ns->bdev; int rc; size_t resid, len; - bi = bdev_get_integrity(bdev); + bi = bdev_get_integrity(req->ns->bdev); if (unlikely(!bi)) { pr_err("Unable to locate bio_integrity\n"); return -ENODEV; From 76cdb09b38afb4ffb031b56ebc41cb33ddcd85fb Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 9 Jun 2021 20:11:25 +0800 Subject: [PATCH 19/78] aoe: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Signed-off-by: Zhen Lei Signed-off-by: Jens Axboe --- drivers/block/aoe/aoechr.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/block/aoe/aoechr.c b/drivers/block/aoe/aoechr.c index ab41be625a53..8eea2529da20 100644 --- a/drivers/block/aoe/aoechr.c +++ b/drivers/block/aoe/aoechr.c @@ -140,10 +140,8 @@ bail: spin_unlock_irqrestore(&emsgs_lock, flags); } mp = kmemdup(msg, n, GFP_ATOMIC); - if (mp == NULL) { - printk(KERN_ERR "aoe: allocation failure, len=%ld\n", n); + if (!mp) goto bail; - } em->msg = mp; em->flags |= EMFL_VALID; From 8404e19194813d850e89fb3504223c09aa9776f3 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 9 Jun 2021 20:14:26 +0800 Subject: [PATCH 20/78] drbd: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Signed-off-by: Zhen Lei Signed-off-by: Jens Axboe --- drivers/block/drbd/drbd_receiver.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c index 69284ebba786..1f740e42e457 100644 --- a/drivers/block/drbd/drbd_receiver.c +++ b/drivers/block/drbd/drbd_receiver.c @@ -3770,10 +3770,8 @@ static int receive_protocol(struct drbd_connection *connection, struct packet_in } new_net_conf = kmalloc(sizeof(struct net_conf), GFP_KERNEL); - if (!new_net_conf) { - drbd_err(connection, "Allocation of new net_conf failed\n"); + if (!new_net_conf) goto disconnect; - } mutex_lock(&connection->data.mutex); mutex_lock(&connection->resource->conf_update); @@ -4020,10 +4018,8 @@ static int receive_SyncParam(struct drbd_connection *connection, struct packet_i if (verify_tfm || csums_tfm) { new_net_conf = kzalloc(sizeof(struct net_conf), GFP_KERNEL); - if (!new_net_conf) { - drbd_err(device, "Allocation of new net_conf failed\n"); + if (!new_net_conf) goto disconnect; - } *new_net_conf = *old_net_conf; @@ -4161,7 +4157,6 @@ static int receive_sizes(struct drbd_connection *connection, struct packet_info new_disk_conf = kzalloc(sizeof(struct disk_conf), GFP_KERNEL); if (!new_disk_conf) { - drbd_err(device, "Allocation of new disk_conf failed\n"); put_ldev(device); return -ENOMEM; } @@ -4288,10 +4283,8 @@ static int receive_uuids(struct drbd_connection *connection, struct packet_info device = peer_device->device; p_uuid = kmalloc_array(UI_EXTENDED_SIZE, sizeof(*p_uuid), GFP_NOIO); - if (!p_uuid) { - drbd_err(device, "kmalloc of p_uuid failed\n"); + if (!p_uuid) return false; - } for (i = UI_CURRENT; i < UI_EXTENDED_SIZE; i++) p_uuid[i] = be64_to_cpu(p->uuid[i]); @@ -5484,8 +5477,7 @@ static int drbd_do_auth(struct drbd_connection *connection) } peers_ch = kmalloc(pi.size, GFP_NOIO); - if (peers_ch == NULL) { - drbd_err(connection, "kmalloc of peers_ch failed\n"); + if (!peers_ch) { rv = -1; goto fail; } @@ -5504,8 +5496,7 @@ static int drbd_do_auth(struct drbd_connection *connection) resp_size = crypto_shash_digestsize(connection->cram_hmac_tfm); response = kmalloc(resp_size, GFP_NOIO); - if (response == NULL) { - drbd_err(connection, "kmalloc of response failed\n"); + if (!response) { rv = -1; goto fail; } @@ -5552,8 +5543,7 @@ static int drbd_do_auth(struct drbd_connection *connection) } right_response = kmalloc(resp_size, GFP_NOIO); - if (right_response == NULL) { - drbd_err(connection, "kmalloc of right_response failed\n"); + if (!right_response) { rv = -1; goto fail; } From ce9a8ca68aec3fe2b817e38d169b792214e5fda0 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 9 Jun 2021 20:19:58 +0800 Subject: [PATCH 21/78] mtip32xx: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Signed-off-by: Zhen Lei Signed-off-by: Jens Axboe --- drivers/block/mtip32xx/mtip32xx.c | 26 +++++--------------------- 1 file changed, 5 insertions(+), 21 deletions(-) diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index 589cb0f1e030..ff3e7b3f5ad8 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -2238,7 +2238,6 @@ static ssize_t show_device_status(struct device_driver *drv, char *buf) static ssize_t mtip_hw_read_device_status(struct file *f, char __user *ubuf, size_t len, loff_t *offset) { - struct driver_data *dd = (struct driver_data *)f->private_data; int size = *offset; char *buf; int rv = 0; @@ -2247,11 +2246,8 @@ static ssize_t mtip_hw_read_device_status(struct file *f, char __user *ubuf, return 0; buf = kzalloc(MTIP_DFS_MAX_BUF_SIZE, GFP_KERNEL); - if (!buf) { - dev_err(&dd->pdev->dev, - "Memory allocation: status buffer\n"); + if (!buf) return -ENOMEM; - } size += show_device_status(NULL, buf); @@ -2277,11 +2273,8 @@ static ssize_t mtip_hw_read_registers(struct file *f, char __user *ubuf, return 0; buf = kzalloc(MTIP_DFS_MAX_BUF_SIZE, GFP_KERNEL); - if (!buf) { - dev_err(&dd->pdev->dev, - "Memory allocation: register buffer\n"); + if (!buf) return -ENOMEM; - } size += sprintf(&buf[size], "H/ S ACTive : [ 0x"); @@ -2343,11 +2336,8 @@ static ssize_t mtip_hw_read_flags(struct file *f, char __user *ubuf, return 0; buf = kzalloc(MTIP_DFS_MAX_BUF_SIZE, GFP_KERNEL); - if (!buf) { - dev_err(&dd->pdev->dev, - "Memory allocation: flag buffer\n"); + if (!buf) return -ENOMEM; - } size += sprintf(&buf[size], "Flag-port : [ %08lX ]\n", dd->port->flags); @@ -2884,11 +2874,8 @@ static int mtip_hw_init(struct driver_data *dd) dd->port = kzalloc_node(sizeof(struct mtip_port), GFP_KERNEL, dd->numa_node); - if (!dd->port) { - dev_err(&dd->pdev->dev, - "Memory allocation: port structure\n"); + if (!dd->port) return -ENOMEM; - } /* Continue workqueue setup */ for (i = 0; i < MTIP_MAX_SLOT_GROUPS; i++) @@ -4002,11 +3989,8 @@ static int mtip_pci_probe(struct pci_dev *pdev, cpu_to_node(raw_smp_processor_id()), raw_smp_processor_id()); dd = kzalloc_node(sizeof(struct driver_data), GFP_KERNEL, my_node); - if (dd == NULL) { - dev_err(&pdev->dev, - "Unable to allocate memory for driver data\n"); + if (!dd) return -ENOMEM; - } /* Attach the private data to this PCI device. */ pci_set_drvdata(pdev, dd); From 6597efa6c58fa9f02f624e3e99bb00e73c32bcb2 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 9 Jun 2021 20:23:27 +0800 Subject: [PATCH 22/78] sunvdc: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Signed-off-by: Zhen Lei Signed-off-by: Jens Axboe --- drivers/block/sunvdc.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/block/sunvdc.c b/drivers/block/sunvdc.c index 39aeebc6837d..448970491bf8 100644 --- a/drivers/block/sunvdc.c +++ b/drivers/block/sunvdc.c @@ -1001,9 +1001,8 @@ static int vdc_port_probe(struct vio_dev *vdev, const struct vio_device_id *id) } port = kzalloc(sizeof(*port), GFP_KERNEL); - err = -ENOMEM; if (!port) { - printk(KERN_ERR PFX "Cannot allocate vdc_port.\n"); + err = -ENOMEM; goto err_out_release_mdesc; } From c744b06254a3a163c6bcf70bb21f0241107271fc Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 9 Jun 2021 20:24:50 +0800 Subject: [PATCH 23/78] sx8: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Signed-off-by: Zhen Lei Signed-off-by: Jens Axboe --- drivers/block/sx8.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/block/sx8.c b/drivers/block/sx8.c index 2cdf2771f8e8..71dcfde042ab 100644 --- a/drivers/block/sx8.c +++ b/drivers/block/sx8.c @@ -1429,8 +1429,6 @@ static int carm_init_one (struct pci_dev *pdev, const struct pci_device_id *ent) host = kzalloc(sizeof(*host), GFP_KERNEL); if (!host) { - printk(KERN_ERR DRV_NAME "(%s): memory alloc failure\n", - pci_name(pdev)); rc = -ENOMEM; goto err_out_regions; } From ec1e7e8853b62cb971828d66a1f298a280182831 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 9 Jun 2021 20:27:39 +0800 Subject: [PATCH 24/78] z2ram: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Signed-off-by: Zhen Lei Signed-off-by: Jens Axboe --- drivers/block/z2ram.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/block/z2ram.c b/drivers/block/z2ram.c index c1d20818e649..dce119f697a7 100644 --- a/drivers/block/z2ram.c +++ b/drivers/block/z2ram.c @@ -236,11 +236,8 @@ static int z2_open(struct block_device *bdev, fmode_t mode) case Z2MINOR_Z2ONLY: z2ram_map = kmalloc(max_z2_map, GFP_KERNEL); - if (z2ram_map == NULL) { - printk(KERN_ERR DEVICE_NAME - ": cannot get mem for z2ram_map\n"); + if (!z2ram_map) goto err_out; - } get_z2ram(); @@ -253,11 +250,8 @@ static int z2_open(struct block_device *bdev, fmode_t mode) case Z2MINOR_CHIPONLY: z2ram_map = kmalloc(max_chip_map, GFP_KERNEL); - if (z2ram_map == NULL) { - printk(KERN_ERR DEVICE_NAME - ": cannot get mem for z2ram_map\n"); + if (!z2ram_map) goto err_out; - } get_chipram(); From d07f3b081ee632268786601f55e1334d1f68b997 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Tue, 8 Jun 2021 18:13:27 +0200 Subject: [PATCH 25/78] mark pstore-blk as broken pstore-blk just pokes directly into the pagecache for the block device without going through the file operations for that by faking up it's own file operations that do not match the block device ones. As this breaks the control of the block layer of it's page cache, and even now just works by accident only the best thing is to just disable this driver. Fixes: 17639f67c1d6 ("pstore/blk: Introduce backend for block devices") Signed-off-by: Christoph Hellwig Link: https://lore.kernel.org/r/20210608161327.1537919-1-hch@lst.de Signed-off-by: Jens Axboe --- fs/pstore/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig index 8adabde685f1..328da35da390 100644 --- a/fs/pstore/Kconfig +++ b/fs/pstore/Kconfig @@ -173,6 +173,7 @@ config PSTORE_BLK tristate "Log panic/oops to a block device" depends on PSTORE depends on BLOCK + depends on BROKEN select PSTORE_ZONE default n help From ad3fc798800fb7ca04c1dfc439dba946818048d8 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:16 +0800 Subject: [PATCH 26/78] md: revert io stats accounting The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause double fault problem per the report [1], and also it is not correct to change ->bi_end_io if md don't own it, so let's revert it. And io stats accounting will be replemented in later commits. [1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t Fixes: 41d2d848e5c0 ("md: improve io stats accounting") Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/md.c | 45 --------------------------------------------- drivers/md/md.h | 1 - 2 files changed, 46 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 49f897fbb89b..7ba00e4c862d 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -441,30 +441,6 @@ check_suspended: } EXPORT_SYMBOL(md_handle_request); -struct md_io { - struct mddev *mddev; - bio_end_io_t *orig_bi_end_io; - void *orig_bi_private; - struct block_device *orig_bi_bdev; - unsigned long start_time; -}; - -static void md_end_io(struct bio *bio) -{ - struct md_io *md_io = bio->bi_private; - struct mddev *mddev = md_io->mddev; - - bio_end_io_acct_remapped(bio, md_io->start_time, md_io->orig_bi_bdev); - - bio->bi_end_io = md_io->orig_bi_end_io; - bio->bi_private = md_io->orig_bi_private; - - mempool_free(md_io, &mddev->md_io_pool); - - if (bio->bi_end_io) - bio->bi_end_io(bio); -} - static blk_qc_t md_submit_bio(struct bio *bio) { const int rw = bio_data_dir(bio); @@ -489,21 +465,6 @@ static blk_qc_t md_submit_bio(struct bio *bio) return BLK_QC_T_NONE; } - if (bio->bi_end_io != md_end_io) { - struct md_io *md_io; - - md_io = mempool_alloc(&mddev->md_io_pool, GFP_NOIO); - md_io->mddev = mddev; - md_io->orig_bi_end_io = bio->bi_end_io; - md_io->orig_bi_private = bio->bi_private; - md_io->orig_bi_bdev = bio->bi_bdev; - - bio->bi_end_io = md_end_io; - bio->bi_private = md_io; - - md_io->start_time = bio_start_io_acct(bio); - } - /* bio could be mergeable after passing to underlayer */ bio->bi_opf &= ~REQ_NOMERGE; @@ -5608,7 +5569,6 @@ static void md_free(struct kobject *ko) bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); - mempool_exit(&mddev->md_io_pool); kfree(mddev); } @@ -5705,11 +5665,6 @@ static int md_alloc(dev_t dev, char *name) */ mddev->hold_active = UNTIL_STOP; - error = mempool_init_kmalloc_pool(&mddev->md_io_pool, BIO_POOL_SIZE, - sizeof(struct md_io)); - if (error) - goto abort; - error = -ENOMEM; mddev->queue = blk_alloc_queue(NUMA_NO_NODE); if (!mddev->queue) diff --git a/drivers/md/md.h b/drivers/md/md.h index fb7eab58cfd5..4da240ffe2c5 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -487,7 +487,6 @@ struct mddev { struct bio_set sync_set; /* for sync operations like * metadata and bitmap writes */ - mempool_t md_io_pool; /* Generic flush handling. * The last to finish preflush schedules a worker to submit From 10764815ff4728d2c57da677cd5d3dd6f446cf5f Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:17 +0800 Subject: [PATCH 27/78] md: add io accounting for raid0 and raid5 We introduce a new bioset (io_acct_set) for raid0 and raid5 since they don't own clone infrastructure to accounting io. And the bioset is added to mddev instead of to raid0 and raid5 layer, because with this way, we can put common functions to md.h and reuse them in raid0 and raid5. Also struct md_io_acct is added accordingly which includes io start_time, the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct to get related io status. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/md.c | 51 +++++++++++++++++++++++++++++++++++++++++++--- drivers/md/md.h | 8 ++++++++ drivers/md/raid0.c | 3 +++ drivers/md/raid5.c | 9 ++++++++ 4 files changed, 68 insertions(+), 3 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 7ba00e4c862d..843e13666e3f 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2340,7 +2340,8 @@ int md_integrity_register(struct mddev *mddev) bdev_get_integrity(reference->bdev)); pr_debug("md: data integrity enabled on %s\n", mdname(mddev)); - if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE)) { + if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE) || + bioset_integrity_create(&mddev->io_acct_set, BIO_POOL_SIZE)) { pr_err("md: failed to create integrity pool for %s\n", mdname(mddev)); return -EINVAL; @@ -5569,6 +5570,7 @@ static void md_free(struct kobject *ko) bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); + bioset_exit(&mddev->io_acct_set); kfree(mddev); } @@ -5862,7 +5864,13 @@ int md_run(struct mddev *mddev) if (!bioset_initialized(&mddev->sync_set)) { err = bioset_init(&mddev->sync_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); if (err) - return err; + goto exit_bio_set; + } + if (!bioset_initialized(&mddev->io_acct_set)) { + err = bioset_init(&mddev->io_acct_set, BIO_POOL_SIZE, + offsetof(struct md_io_acct, bio_clone), 0); + if (err) + goto exit_sync_set; } spin_lock(&pers_lock); @@ -5990,6 +5998,7 @@ int md_run(struct mddev *mddev) blk_queue_flag_set(QUEUE_FLAG_NONROT, mddev->queue); else blk_queue_flag_clear(QUEUE_FLAG_NONROT, mddev->queue); + blk_queue_flag_set(QUEUE_FLAG_IO_STAT, mddev->queue); } if (pers->sync_request) { if (mddev->kobj.sd && @@ -6039,8 +6048,11 @@ bitmap_abort: module_put(pers->owner); md_bitmap_destroy(mddev); abort: - bioset_exit(&mddev->bio_set); + bioset_exit(&mddev->io_acct_set); +exit_sync_set: bioset_exit(&mddev->sync_set); +exit_bio_set: + bioset_exit(&mddev->bio_set); return err; } EXPORT_SYMBOL_GPL(md_run); @@ -6264,6 +6276,7 @@ void md_stop(struct mddev *mddev) __md_stop(mddev); bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); + bioset_exit(&mddev->io_acct_set); } EXPORT_SYMBOL_GPL(md_stop); @@ -8568,6 +8581,38 @@ void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, } EXPORT_SYMBOL_GPL(md_submit_discard_bio); +static void md_end_io_acct(struct bio *bio) +{ + struct md_io_acct *md_io_acct = bio->bi_private; + struct bio *orig_bio = md_io_acct->orig_bio; + + orig_bio->bi_status = bio->bi_status; + + bio_end_io_acct(orig_bio, md_io_acct->start_time); + bio_put(bio); + bio_endio(orig_bio); +} + +/* used by personalities (raid0 and raid5) to account io stats */ +void md_account_bio(struct mddev *mddev, struct bio **bio) +{ + struct md_io_acct *md_io_acct; + struct bio *clone; + + if (!blk_queue_io_stat((*bio)->bi_bdev->bd_disk->queue)) + return; + + clone = bio_clone_fast(*bio, GFP_NOIO, &mddev->io_acct_set); + md_io_acct = container_of(clone, struct md_io_acct, bio_clone); + md_io_acct->orig_bio = *bio; + md_io_acct->start_time = bio_start_io_acct(*bio); + + clone->bi_end_io = md_end_io_acct; + clone->bi_private = md_io_acct; + *bio = clone; +} +EXPORT_SYMBOL_GPL(md_account_bio); + /* md_allow_write(mddev) * Calling this ensures that the array is marked 'active' so that writes * may proceed without blocking. It is important to call this before diff --git a/drivers/md/md.h b/drivers/md/md.h index 4da240ffe2c5..4191f22acce4 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -487,6 +487,7 @@ struct mddev { struct bio_set sync_set; /* for sync operations like * metadata and bitmap writes */ + struct bio_set io_acct_set; /* for raid0 and raid5 io accounting */ /* Generic flush handling. * The last to finish preflush schedules a worker to submit @@ -683,6 +684,12 @@ struct md_thread { void *private; }; +struct md_io_acct { + struct bio *orig_bio; + unsigned long start_time; + struct bio bio_clone; +}; + #define THREAD_WAKEUP 0 static inline void safe_put_page(struct page *p) @@ -714,6 +721,7 @@ extern void md_error(struct mddev *mddev, struct md_rdev *rdev); extern void md_finish_reshape(struct mddev *mddev); void md_submit_discard_bio(struct mddev *mddev, struct md_rdev *rdev, struct bio *bio, sector_t start, sector_t size); +void md_account_bio(struct mddev *mddev, struct bio **bio); extern bool __must_check md_flush_request(struct mddev *mddev, struct bio *bio); extern void md_super_write(struct mddev *mddev, struct md_rdev *rdev, diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index e5d7411cba9b..62c8b6adac70 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -546,6 +546,9 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) bio = split; } + if (bio->bi_pool != &mddev->bio_set) + md_account_bio(mddev, &bio); + orig_sector = sector; zone = find_zone(mddev->private, §or); switch (conf->layout) { diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 841e1c1aa5e6..58e9dbc0f683 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5468,6 +5468,7 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio) sector_t sector = raid_bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors; unsigned sectors = chunk_sects - (sector & (chunk_sects-1)); + struct r5conf *conf = mddev->private; if (sectors < bio_sectors(raid_bio)) { struct r5conf *conf = mddev->private; @@ -5477,6 +5478,9 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio) raid_bio = split; } + if (raid_bio->bi_pool != &conf->bio_split) + md_account_bio(mddev, &raid_bio); + if (!raid5_read_one_chunk(mddev, raid_bio)) return raid_bio; @@ -5756,6 +5760,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) DEFINE_WAIT(w); bool do_prepare; bool do_flush = false; + bool do_clone = false; if (unlikely(bi->bi_opf & REQ_PREFLUSH)) { int ret = log_handle_flush_request(conf, bi); @@ -5784,6 +5789,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) if (rw == READ && mddev->degraded == 0 && mddev->reshape_position == MaxSector) { bi = chunk_aligned_read(mddev, bi); + do_clone = true; if (!bi) return true; } @@ -5798,6 +5804,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) last_sector = bio_end_sector(bi); bi->bi_next = NULL; + if (!do_clone) + md_account_bio(mddev, &bi); + prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE); for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) { int previous; From c82aa1b76787c34fd02374e519b6f52cdeb2f54b Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:18 +0800 Subject: [PATCH 28/78] md/raid5: move checking badblock before clone bio in raid5_read_one_chunk We don't need to clone bio if the relevant region has badblock. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/raid5.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 58e9dbc0f683..5a05277f4be7 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5427,6 +5427,13 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) atomic_inc(&rdev->nr_pending); rcu_read_unlock(); + if (is_badblock(rdev, sector, bio_sectors(raid_bio), &first_bad, + &bad_sectors)) { + bio_put(raid_bio); + rdev_dec_pending(rdev, mddev); + return 0; + } + align_bio = bio_clone_fast(raid_bio, GFP_NOIO, &mddev->bio_set); bio_set_dev(align_bio, rdev->bdev); align_bio->bi_end_io = raid5_align_endio; @@ -5435,13 +5442,6 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) raid_bio->bi_next = (void *)rdev; - if (is_badblock(rdev, sector, bio_sectors(align_bio), &first_bad, - &bad_sectors)) { - bio_put(align_bio); - rdev_dec_pending(rdev, mddev); - return 0; - } - /* No reshape active, so we can trust rdev->data_offset */ align_bio->bi_iter.bi_sector += rdev->data_offset; From 1147f58e1010b8688bac1fd3bbab753b1379291d Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:19 +0800 Subject: [PATCH 29/78] md/raid5: avoid redundant bio clone in raid5_read_one_chunk After enable io accounting, chunk read bio could be cloned twice which is not good. To avoid such inefficiency, let's clone align_bio from io_acct_set too, then we need only call md_account_bio in make_request unconditionally. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/raid5.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 5a05277f4be7..f83623ac8c34 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5364,11 +5364,13 @@ static struct bio *remove_bio_from_retry(struct r5conf *conf, */ static void raid5_align_endio(struct bio *bi) { - struct bio* raid_bi = bi->bi_private; + struct md_io_acct *md_io_acct = bi->bi_private; + struct bio *raid_bi = md_io_acct->orig_bio; struct mddev *mddev; struct r5conf *conf; struct md_rdev *rdev; blk_status_t error = bi->bi_status; + unsigned long start_time = md_io_acct->start_time; bio_put(bi); @@ -5380,6 +5382,8 @@ static void raid5_align_endio(struct bio *bi) rdev_dec_pending(rdev, conf->mddev); if (!error) { + if (blk_queue_io_stat(raid_bi->bi_bdev->bd_disk->queue)) + bio_end_io_acct(raid_bi, start_time); bio_endio(raid_bi); if (atomic_dec_and_test(&conf->active_aligned_reads)) wake_up(&conf->wait_for_quiescent); @@ -5398,6 +5402,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) struct md_rdev *rdev; sector_t sector, end_sector, first_bad; int bad_sectors, dd_idx; + struct md_io_acct *md_io_acct; if (!in_chunk_boundary(mddev, raid_bio)) { pr_debug("%s: non aligned\n", __func__); @@ -5434,14 +5439,18 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) return 0; } - align_bio = bio_clone_fast(raid_bio, GFP_NOIO, &mddev->bio_set); + align_bio = bio_clone_fast(raid_bio, GFP_NOIO, &mddev->io_acct_set); + md_io_acct = container_of(align_bio, struct md_io_acct, bio_clone); + raid_bio->bi_next = (void *)rdev; + if (blk_queue_io_stat(raid_bio->bi_bdev->bd_disk->queue)) + md_io_acct->start_time = bio_start_io_acct(raid_bio); + md_io_acct->orig_bio = raid_bio; + bio_set_dev(align_bio, rdev->bdev); align_bio->bi_end_io = raid5_align_endio; - align_bio->bi_private = raid_bio; + align_bio->bi_private = md_io_acct; align_bio->bi_iter.bi_sector = sector; - raid_bio->bi_next = (void *)rdev; - /* No reshape active, so we can trust rdev->data_offset */ align_bio->bi_iter.bi_sector += rdev->data_offset; @@ -5468,7 +5477,6 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio) sector_t sector = raid_bio->bi_iter.bi_sector; unsigned chunk_sects = mddev->chunk_sectors; unsigned sectors = chunk_sects - (sector & (chunk_sects-1)); - struct r5conf *conf = mddev->private; if (sectors < bio_sectors(raid_bio)) { struct r5conf *conf = mddev->private; @@ -5478,9 +5486,6 @@ static struct bio *chunk_aligned_read(struct mddev *mddev, struct bio *raid_bio) raid_bio = split; } - if (raid_bio->bi_pool != &conf->bio_split) - md_account_bio(mddev, &raid_bio); - if (!raid5_read_one_chunk(mddev, raid_bio)) return raid_bio; @@ -5760,7 +5765,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) DEFINE_WAIT(w); bool do_prepare; bool do_flush = false; - bool do_clone = false; if (unlikely(bi->bi_opf & REQ_PREFLUSH)) { int ret = log_handle_flush_request(conf, bi); @@ -5789,7 +5793,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) if (rw == READ && mddev->degraded == 0 && mddev->reshape_position == MaxSector) { bi = chunk_aligned_read(mddev, bi); - do_clone = true; if (!bi) return true; } @@ -5804,9 +5807,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi) last_sector = bio_end_sector(bi); bi->bi_next = NULL; - if (!do_clone) - md_account_bio(mddev, &bi); - + md_account_bio(mddev, &bi); prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE); for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) { int previous; From 9b8ae7b938235229ccb112c4e887ff1bcc232836 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:20 +0800 Subject: [PATCH 30/78] md/raid1: rename print_msg with r1bio_existed The caller of raid1_read_request could pass NULL or a valid pointer for "struct r1bio *r1_bio", so it actually means whether r1_bio is existed or not. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/raid1.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index ced076ba560e..696da6b8b7ed 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -1210,7 +1210,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, const unsigned long do_sync = (bio->bi_opf & REQ_SYNC); int max_sectors; int rdisk; - bool print_msg = !!r1_bio; + bool r1bio_existed = !!r1_bio; char b[BDEVNAME_SIZE]; /* @@ -1220,7 +1220,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, */ gfp_t gfp = r1_bio ? (GFP_NOIO | __GFP_HIGH) : GFP_NOIO; - if (print_msg) { + if (r1bio_existed) { /* Need to get the block device name carefully */ struct md_rdev *rdev; rcu_read_lock(); @@ -1252,7 +1252,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, if (rdisk < 0) { /* couldn't find anywhere to read from */ - if (print_msg) { + if (r1bio_existed) { pr_crit_ratelimited("md/raid1:%s: %s: unrecoverable I/O read error for block %llu\n", mdname(mddev), b, @@ -1263,7 +1263,7 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, } mirror = conf->mirrors + rdisk; - if (print_msg) + if (r1bio_existed) pr_info_ratelimited("md/raid1:%s: redirecting sector %llu to other mirror: %s\n", mdname(mddev), (unsigned long long)r1_bio->sector, From a0159832e51e3af03b89ecc5d6b9db451e529b5f Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:21 +0800 Subject: [PATCH 31/78] md/raid1: enable io accounting For raid1, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r1bio accordingly. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/raid1.c | 7 +++++++ drivers/md/raid1.h | 1 + 2 files changed, 8 insertions(+) diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 696da6b8b7ed..51f2547c2007 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -300,6 +300,8 @@ static void call_bio_endio(struct r1bio *r1_bio) if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) bio->bi_status = BLK_STS_IOERR; + if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue)) + bio_end_io_acct(bio, r1_bio->start_time); bio_endio(bio); } @@ -1292,6 +1294,9 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio, r1_bio->read_disk = rdisk; + if (!r1bio_existed && blk_queue_io_stat(bio->bi_bdev->bd_disk->queue)) + r1_bio->start_time = bio_start_io_acct(bio); + read_bio = bio_clone_fast(bio, gfp, &mddev->bio_set); r1_bio->bios[rdisk] = read_bio; @@ -1461,6 +1466,8 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio, r1_bio->sectors = max_sectors; } + if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue)) + r1_bio->start_time = bio_start_io_acct(bio); atomic_set(&r1_bio->remaining, 1); atomic_set(&r1_bio->behind_remaining, 0); diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index b7eb09e8c025..ccf10e59b116 100644 --- a/drivers/md/raid1.h +++ b/drivers/md/raid1.h @@ -158,6 +158,7 @@ struct r1bio { sector_t sector; int sectors; unsigned long state; + unsigned long start_time; struct mddev *mddev; /* * original bio going to /dev/mdx From 528bc2cf2fccef2c2c17263f9932094bf81fee5a Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:22 +0800 Subject: [PATCH 32/78] md/raid10: enable io accounting For raid10, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r10bio accordingly. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/raid10.c | 6 ++++++ drivers/md/raid10.h | 1 + 2 files changed, 7 insertions(+) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 13f5e6b2a73d..16977e8e075d 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -297,6 +297,8 @@ static void raid_end_bio_io(struct r10bio *r10_bio) if (!test_bit(R10BIO_Uptodate, &r10_bio->state)) bio->bi_status = BLK_STS_IOERR; + if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue)) + bio_end_io_acct(bio, r10_bio->start_time); bio_endio(bio); /* * Wake up any possible resync thread that waits for the device @@ -1184,6 +1186,8 @@ static void raid10_read_request(struct mddev *mddev, struct bio *bio, } slot = r10_bio->read_slot; + if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue)) + r10_bio->start_time = bio_start_io_acct(bio); read_bio = bio_clone_fast(bio, gfp, &mddev->bio_set); r10_bio->devs[slot].bio = read_bio; @@ -1483,6 +1487,8 @@ static void raid10_write_request(struct mddev *mddev, struct bio *bio, r10_bio->master_bio = bio; } + if (blk_queue_io_stat(bio->bi_bdev->bd_disk->queue)) + r10_bio->start_time = bio_start_io_acct(bio); atomic_set(&r10_bio->remaining, 1); md_bitmap_startwrite(mddev->bitmap, r10_bio->sector, r10_bio->sectors, 0); diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h index 1461fd55311b..c34bb196790e 100644 --- a/drivers/md/raid10.h +++ b/drivers/md/raid10.h @@ -124,6 +124,7 @@ struct r10bio { sector_t sector; /* virtual sector number */ int sectors; unsigned long state; + unsigned long start_time; struct mddev *mddev; /* * original bio going to /dev/mdx From 608f52e30aae7dc8da836e5b7b112d50a2d00e43 Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Tue, 25 May 2021 17:46:23 +0800 Subject: [PATCH 33/78] md: mark some personalities as deprecated Mark the three personalities (linear, fault and multipath) as deprecated because: 1. people can use dm multipath or nvme multipath. 2. linear is already deprecated in MODULE_ALIAS. 3. no one actively using fault. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/Kconfig | 6 +++--- drivers/md/md-faulty.c | 2 +- drivers/md/md-linear.c | 2 +- drivers/md/md-multipath.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index f2014385d48b..0602e82a9516 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -47,7 +47,7 @@ config MD_AUTODETECT If unsure, say Y. config MD_LINEAR - tristate "Linear (append) mode" + tristate "Linear (append) mode (deprecated)" depends on BLK_DEV_MD help If you say Y here, then your multiple devices driver will be able to @@ -158,7 +158,7 @@ config MD_RAID456 If unsure, say Y. config MD_MULTIPATH - tristate "Multipath I/O support" + tristate "Multipath I/O support (deprecated)" depends on BLK_DEV_MD help MD_MULTIPATH provides a simple multi-path personality for use @@ -169,7 +169,7 @@ config MD_MULTIPATH If unsure, say N. config MD_FAULTY - tristate "Faulty test module for MD" + tristate "Faulty test module for MD (deprecated)" depends on BLK_DEV_MD help The "faulty" module allows for a block device that occasionally returns diff --git a/drivers/md/md-faulty.c b/drivers/md/md-faulty.c index fda4cb3f936f..c0dc6f2ef4a3 100644 --- a/drivers/md/md-faulty.c +++ b/drivers/md/md-faulty.c @@ -357,7 +357,7 @@ static void raid_exit(void) module_init(raid_init); module_exit(raid_exit); MODULE_LICENSE("GPL"); -MODULE_DESCRIPTION("Fault injection personality for MD"); +MODULE_DESCRIPTION("Fault injection personality for MD (deprecated)"); MODULE_ALIAS("md-personality-10"); /* faulty */ MODULE_ALIAS("md-faulty"); MODULE_ALIAS("md-level--5"); diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c index 63ed8329a98d..1ff51647a682 100644 --- a/drivers/md/md-linear.c +++ b/drivers/md/md-linear.c @@ -312,7 +312,7 @@ static void linear_exit (void) module_init(linear_init); module_exit(linear_exit); MODULE_LICENSE("GPL"); -MODULE_DESCRIPTION("Linear device concatenation personality for MD"); +MODULE_DESCRIPTION("Linear device concatenation personality for MD (deprecated)"); MODULE_ALIAS("md-personality-1"); /* LINEAR - deprecated*/ MODULE_ALIAS("md-linear"); MODULE_ALIAS("md-level--1"); diff --git a/drivers/md/md-multipath.c b/drivers/md/md-multipath.c index 776bbe542db5..e7d6486f090f 100644 --- a/drivers/md/md-multipath.c +++ b/drivers/md/md-multipath.c @@ -471,7 +471,7 @@ static void __exit multipath_exit (void) module_init(multipath_init); module_exit(multipath_exit); MODULE_LICENSE("GPL"); -MODULE_DESCRIPTION("simple multi-path personality for MD"); +MODULE_DESCRIPTION("simple multi-path personality for MD (deprecated)"); MODULE_ALIAS("md-personality-7"); /* MULTIPATH */ MODULE_ALIAS("md-multipath"); MODULE_ALIAS("md-level--4"); From c32dc04059c79ddb4f7cff94ad5de6e92ea2218d Mon Sep 17 00:00:00 2001 From: Rikard Falkeborn Date: Sat, 29 May 2021 12:30:49 +0200 Subject: [PATCH 34/78] md: Constify attribute_group structs The attribute_group structs are never modified, they're only passed to sysfs_create_group() and sysfs_remove_group(). Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn Signed-off-by: Song Liu --- drivers/md/md-bitmap.c | 2 +- drivers/md/md.c | 6 +++--- drivers/md/md.h | 4 ++-- drivers/md/raid5.c | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index ea3130e11680..e29c6298ef5c 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -2616,7 +2616,7 @@ static struct attribute *md_bitmap_attrs[] = { &max_backlog_used.attr, NULL }; -struct attribute_group md_bitmap_group = { +const struct attribute_group md_bitmap_group = { .name = "bitmap", .attrs = md_bitmap_attrs, }; diff --git a/drivers/md/md.c b/drivers/md/md.c index 843e13666e3f..32abcfb8bcad 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -785,7 +785,7 @@ out_free_new: return ERR_PTR(error); } -static struct attribute_group md_redundancy_group; +static const struct attribute_group md_redundancy_group; void mddev_unlock(struct mddev *mddev) { @@ -802,7 +802,7 @@ void mddev_unlock(struct mddev *mddev) * test it under the same mutex to ensure its correct value * is seen. */ - struct attribute_group *to_remove = mddev->to_remove; + const struct attribute_group *to_remove = mddev->to_remove; mddev->to_remove = NULL; mddev->sysfs_active = 1; mutex_unlock(&mddev->reconfig_mutex); @@ -5500,7 +5500,7 @@ static struct attribute *md_redundancy_attrs[] = { &md_degraded.attr, NULL, }; -static struct attribute_group md_redundancy_group = { +static const struct attribute_group md_redundancy_group = { .name = NULL, .attrs = md_redundancy_attrs, }; diff --git a/drivers/md/md.h b/drivers/md/md.h index 4191f22acce4..b9b7d2f992f3 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -481,7 +481,7 @@ struct mddev { atomic_t max_corr_read_errors; /* max read retries */ struct list_head all_mddevs; - struct attribute_group *to_remove; + const struct attribute_group *to_remove; struct bio_set bio_set; struct bio_set sync_set; /* for sync operations like @@ -613,7 +613,7 @@ struct md_sysfs_entry { ssize_t (*show)(struct mddev *, char *); ssize_t (*store)(struct mddev *, const char *, size_t); }; -extern struct attribute_group md_bitmap_group; +extern const struct attribute_group md_bitmap_group; static inline struct kernfs_node *sysfs_get_dirent_safe(struct kernfs_node *sd, char *name) { diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index f83623ac8c34..0ee9aa0113f3 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6940,7 +6940,7 @@ static struct attribute *raid5_attrs[] = { &ppl_write_hint.attr, NULL, }; -static struct attribute_group raid5_attrs_group = { +static const struct attribute_group raid5_attrs_group = { .name = NULL, .attrs = raid5_attrs, }; From daee2024715ddf430a069c0c4eab8417146934cf Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Thu, 3 Jun 2021 17:21:06 +0800 Subject: [PATCH 35/78] md: check level before create and exit io_acct_set The bio_set (io_acct_set) is used by personalities to clone bio and trace the timestamp of bio. Some personalities such as raid1/10 don't need the bio_set, so add check to not create it unconditionally. Also update the comment for md_account_bio to make it more clear. Suggested-by: Christoph Hellwig Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/md.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 32abcfb8bcad..56b606184c87 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2341,7 +2341,8 @@ int md_integrity_register(struct mddev *mddev) pr_debug("md: data integrity enabled on %s\n", mdname(mddev)); if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE) || - bioset_integrity_create(&mddev->io_acct_set, BIO_POOL_SIZE)) { + (mddev->level != 1 && mddev->level != 10 && + bioset_integrity_create(&mddev->io_acct_set, BIO_POOL_SIZE))) { pr_err("md: failed to create integrity pool for %s\n", mdname(mddev)); return -EINVAL; @@ -5570,7 +5571,8 @@ static void md_free(struct kobject *ko) bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); - bioset_exit(&mddev->io_acct_set); + if (mddev->level != 1 && mddev->level != 10) + bioset_exit(&mddev->io_acct_set); kfree(mddev); } @@ -5866,7 +5868,8 @@ int md_run(struct mddev *mddev) if (err) goto exit_bio_set; } - if (!bioset_initialized(&mddev->io_acct_set)) { + if (mddev->level != 1 && mddev->level != 10 && + !bioset_initialized(&mddev->io_acct_set)) { err = bioset_init(&mddev->io_acct_set, BIO_POOL_SIZE, offsetof(struct md_io_acct, bio_clone), 0); if (err) @@ -6048,7 +6051,8 @@ bitmap_abort: module_put(pers->owner); md_bitmap_destroy(mddev); abort: - bioset_exit(&mddev->io_acct_set); + if (mddev->level != 1 && mddev->level != 10) + bioset_exit(&mddev->io_acct_set); exit_sync_set: bioset_exit(&mddev->sync_set); exit_bio_set: @@ -6276,7 +6280,8 @@ void md_stop(struct mddev *mddev) __md_stop(mddev); bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set); - bioset_exit(&mddev->io_acct_set); + if (mddev->level != 1 && mddev->level != 10) + bioset_exit(&mddev->io_acct_set); } EXPORT_SYMBOL_GPL(md_stop); @@ -8593,7 +8598,10 @@ static void md_end_io_acct(struct bio *bio) bio_endio(orig_bio); } -/* used by personalities (raid0 and raid5) to account io stats */ +/* + * Used by personalities that don't already clone the bio and thus can't + * easily add the timestamp to their extended bio structure. + */ void md_account_bio(struct mddev *mddev, struct bio **bio) { struct md_io_acct *md_io_acct; From de3ea66e9d23a34eef5e17f960d6473f78a1c54b Mon Sep 17 00:00:00 2001 From: Guoqing Jiang Date: Thu, 3 Jun 2021 17:21:07 +0800 Subject: [PATCH 36/78] md: add comments in md_integrity_register Given it is not obvious for the error handling, let's try to add some comments here to make it clear. Signed-off-by: Guoqing Jiang Signed-off-by: Song Liu --- drivers/md/md.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/md/md.c b/drivers/md/md.c index 56b606184c87..2c69905dd5c0 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -2343,6 +2343,12 @@ int md_integrity_register(struct mddev *mddev) if (bioset_integrity_create(&mddev->bio_set, BIO_POOL_SIZE) || (mddev->level != 1 && mddev->level != 10 && bioset_integrity_create(&mddev->io_acct_set, BIO_POOL_SIZE))) { + /* + * No need to handle the failure of bioset_integrity_create, + * because the function is called by md_run() -> pers->run(), + * md_run calls bioset_exit -> bioset_integrity_free in case + * of failure case. + */ pr_err("md: failed to create integrity pool for %s\n", mdname(mddev)); return -EINVAL; From 97ae27252f4962d0fcc38ee1d9f913d817a2024e Mon Sep 17 00:00:00 2001 From: Gal Ofri Date: Mon, 7 Jun 2021 14:07:03 +0300 Subject: [PATCH 37/78] md/raid5: avoid device_lock in read_one_chunk() There is a lock contention on device_lock in read_one_chunk(). device_lock is taken to sync conf->active_aligned_reads and conf->quiesce. read_one_chunk() takes the lock, then waits for quiesce=0 (resumed) before incrementing active_aligned_reads. raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits for active_aligned_reads to be zero before setting quiesce=1 (suspended). Introduce a fast (lockless) path in read_one_chunk(): activate aligned read without taking device_lock. In case quiesce starts while activating the aligned-read in fast path, deactivate it and revert to old behavior (take device_lock and wait for quiesce to finish). Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to gaurantee that read_one_chunk() does not miss an ongoing quiesce. My setups: 1. 8 local nvme drives (each up to 250k iops). 2. 8 ram disks (brd). Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per socket) system. Record both iops and cpu spent on this contention with rand-read-4k. Record bw with sequential-read-128k. Note: in most cases cpu is still busy but due to "new" bottlenecks. nvme: | iops | cpu | bw ----------------------------------------------- without patch | 1.6M | ~50% | 5.5GB/s with patch | 2M (throttled) | 0% | 16GB/s (throttled) ram (brd): | iops | cpu | bw ----------------------------------------------- without patch | 2M | ~80% | 24GB/s with patch | 4M | 0% | 55GB/s CC: Song Liu CC: Neil Brown Reviewed-by: NeilBrown Signed-off-by: Gal Ofri Signed-off-by: Song Liu --- drivers/md/raid5.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 0ee9aa0113f3..e248532bb70a 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -5403,6 +5403,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) sector_t sector, end_sector, first_bad; int bad_sectors, dd_idx; struct md_io_acct *md_io_acct; + bool did_inc; if (!in_chunk_boundary(mddev, raid_bio)) { pr_debug("%s: non aligned\n", __func__); @@ -5454,11 +5455,24 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio) /* No reshape active, so we can trust rdev->data_offset */ align_bio->bi_iter.bi_sector += rdev->data_offset; - spin_lock_irq(&conf->device_lock); - wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0, - conf->device_lock); - atomic_inc(&conf->active_aligned_reads); - spin_unlock_irq(&conf->device_lock); + did_inc = false; + if (conf->quiesce == 0) { + atomic_inc(&conf->active_aligned_reads); + did_inc = true; + } + /* need a memory barrier to detect the race with raid5_quiesce() */ + if (!did_inc || smp_load_acquire(&conf->quiesce) != 0) { + /* quiesce is in progress, so we need to undo io activation and wait + * for it to finish + */ + if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads)) + wake_up(&conf->wait_for_quiescent); + spin_lock_irq(&conf->device_lock); + wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0, + conf->device_lock); + atomic_inc(&conf->active_aligned_reads); + spin_unlock_irq(&conf->device_lock); + } if (mddev->gendisk) trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk), @@ -8346,7 +8360,10 @@ static void raid5_quiesce(struct mddev *mddev, int quiesce) * active stripes can drain */ r5c_flush_cache(conf, INT_MAX); - conf->quiesce = 2; + /* need a memory barrier to make sure read_one_chunk() sees + * quiesce started and reverts to slow (locked) path. + */ + smp_store_release(&conf->quiesce, 2); wait_event_cmd(conf->wait_for_quiescent, atomic_read(&conf->active_stripes) == 0 && atomic_read(&conf->active_aligned_reads) == 0, From 30ab5db7ee787c88236376ce6c88b53d613fcae2 Mon Sep 17 00:00:00 2001 From: Jiapeng Chong Date: Fri, 30 Apr 2021 17:26:45 +0800 Subject: [PATCH 38/78] floppy: cleanup: remove redundant assignment to nr_sectors Variable nr_sectors is set to zero but this value is never read as it is overwritten later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: drivers/block/floppy.c:2333:2: warning: Value stored to 'nr_sectors' is never read [clang-analyzer-deadcode.DeadStores]. Link: https://lore.kernel.org/r/1619774805-121562-1-git-send-email-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot Signed-off-by: Jiapeng Chong Signed-off-by: Denis Efremov --- drivers/block/floppy.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 8a9d22207c59..e96ad5b2c35e 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -2330,7 +2330,6 @@ static void rw_interrupt(void) if (!drive_state[current_drive].first_read_date) drive_state[current_drive].first_read_date = jiffies; - nr_sectors = 0; ssize = DIV_ROUND_UP(1 << raw_cmd->cmd[SIZECODE], 4); if (reply_buffer[ST1] & ST1_EOC) From 2c9bdf6e4771a5966a4f0d6bea45a1c7f38312d7 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Fri, 28 May 2021 15:03:35 -0500 Subject: [PATCH 39/78] floppy: Fix fall-through warning for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a break statement instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Link: https://lore.kernel.org/linux-hardening/47bcd36a-6524-348b-e802-0691d1b3c429@kernel.dk/ Suggested-by: Jens Axboe Signed-off-by: Gustavo A. R. Silva Signed-off-by: Denis Efremov --- drivers/block/floppy.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index e96ad5b2c35e..cde70b0a55dd 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -2123,6 +2123,7 @@ static void format_interrupt(void) switch (interpret_errors()) { case 1: cont->error(); + break; case 2: break; case 0: From 2744d7a0733503931b71c00d156119ced002f22c Mon Sep 17 00:00:00 2001 From: Mario Limonciello Date: Wed, 9 Jun 2021 13:40:17 -0500 Subject: [PATCH 40/78] ACPI: Check StorageD3Enable _DSD property in ACPI code Although first implemented for NVME, this check may be usable by other drivers as well. Microsoft's specification explicitly mentions that is may be usable by SATA and AHCI devices. Google also indicates that they have used this with SDHCI in a downstream kernel tree that a user can plug a storage device into. Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Suggested-by: Keith Busch CC: Shyam-sundar S-k CC: Alexander Deucher CC: Rafael J. Wysocki CC: Prike Liang Signed-off-by: Mario Limonciello Reviewed-by: Rafael J. Wysocki Signed-off-by: Christoph Hellwig --- drivers/acpi/device_pm.c | 29 +++++++++++++++++++++++++++++ drivers/nvme/host/pci.c | 28 +--------------------------- include/linux/acpi.h | 5 +++++ 3 files changed, 35 insertions(+), 27 deletions(-) diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index d260bc1f3e6e..d76ab50c71dc 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1340,4 +1340,33 @@ int acpi_dev_pm_attach(struct device *dev, bool power_on) return 1; } EXPORT_SYMBOL_GPL(acpi_dev_pm_attach); + +/** + * acpi_storage_d3 - Check if D3 should be used in the suspend path + * @dev: Device to check + * + * Return %true if the platform firmware wants @dev to be programmed + * into D3hot or D3cold (if supported) in the suspend path, or %false + * when there is no specific preference. On some platforms, if this + * hint is ignored, @dev may remain unresponsive after suspending the + * platform as a whole. + * + * Although the property has storage in the name it actually is + * applied to the PCIe slot and plugging in a non-storage device the + * same platform restrictions will likely apply. + */ +bool acpi_storage_d3(struct device *dev) +{ + struct acpi_device *adev = ACPI_COMPANION(dev); + u8 val; + + if (!adev) + return false; + if (fwnode_property_read_u8(acpi_fwnode_handle(adev), "StorageD3Enable", + &val)) + return false; + return val == 1; +} +EXPORT_SYMBOL_GPL(acpi_storage_d3); + #endif /* CONFIG_PM */ diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 3aa7245a505f..8fbc4c87a0d8 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2828,32 +2828,6 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev) return 0; } -#ifdef CONFIG_ACPI -static bool nvme_acpi_storage_d3(struct pci_dev *dev) -{ - struct acpi_device *adev = ACPI_COMPANION(&dev->dev); - u8 val; - - /* - * Look for _DSD property specifying that the storage device on the port - * must use D3 to support deep platform power savings during - * suspend-to-idle. - */ - - if (!adev) - return false; - if (fwnode_property_read_u8(acpi_fwnode_handle(adev), "StorageD3Enable", - &val)) - return false; - return val == 1; -} -#else -static inline bool nvme_acpi_storage_d3(struct pci_dev *dev) -{ - return false; -} -#endif /* CONFIG_ACPI */ - static void nvme_async_probe(void *data, async_cookie_t cookie) { struct nvme_dev *dev = data; @@ -2903,7 +2877,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) quirks |= check_vendor_combination_bug(pdev); - if (!noacpi && nvme_acpi_storage_d3(pdev)) { + if (!noacpi && acpi_storage_d3(&pdev->dev)) { /* * Some systems use a bios work around to ask for D3 on * platforms that support kernel managed suspend. diff --git a/include/linux/acpi.h b/include/linux/acpi.h index c60745f657e9..dd0dafd21e33 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1004,6 +1004,7 @@ int acpi_dev_resume(struct device *dev); int acpi_subsys_runtime_suspend(struct device *dev); int acpi_subsys_runtime_resume(struct device *dev); int acpi_dev_pm_attach(struct device *dev, bool power_on); +bool acpi_storage_d3(struct device *dev); #else static inline int acpi_subsys_runtime_suspend(struct device *dev) { return 0; } static inline int acpi_subsys_runtime_resume(struct device *dev) { return 0; } @@ -1011,6 +1012,10 @@ static inline int acpi_dev_pm_attach(struct device *dev, bool power_on) { return 0; } +static inline bool acpi_storage_d3(struct device *dev) +{ + return false; +} #endif #if defined(CONFIG_ACPI) && defined(CONFIG_PM_SLEEP) From 6485fc18faa01e8845b1e5bb55118e633f84d1f2 Mon Sep 17 00:00:00 2001 From: Mario Limonciello Date: Wed, 9 Jun 2021 13:40:18 -0500 Subject: [PATCH 41/78] ACPI: Add quirks for AMD Renoir/Lucienne CPUs to force the D3 hint AMD systems from Renoir and Lucienne require that the NVME controller is put into D3 over a Modern Standby / suspend-to-idle cycle. This is "typically" accomplished using the `StorageD3Enable` property in the _DSD, but this property was introduced after many of these systems launched and most OEM systems don't have it in their BIOS. On AMD Renoir without these drives going into D3 over suspend-to-idle the resume will fail with the NVME controller being reset and a trace like this in the kernel logs: ``` [ 83.556118] nvme nvme0: I/O 161 QID 2 timeout, aborting [ 83.556178] nvme nvme0: I/O 162 QID 2 timeout, aborting [ 83.556187] nvme nvme0: I/O 163 QID 2 timeout, aborting [ 83.556196] nvme nvme0: I/O 164 QID 2 timeout, aborting [ 95.332114] nvme nvme0: I/O 25 QID 0 timeout, reset controller [ 95.332843] nvme nvme0: Abort status: 0x371 [ 95.332852] nvme nvme0: Abort status: 0x371 [ 95.332856] nvme nvme0: Abort status: 0x371 [ 95.332859] nvme nvme0: Abort status: 0x371 [ 95.332909] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -16 [ 95.332936] nvme 0000:03:00.0: PM: failed to resume async: error -16 ``` The Microsoft documentation for StorageD3Enable mentioned that Windows has a hardcoded allowlist for D3 support, which was used for these platforms. Introduce quirks to hardcode them for Linux as well. As this property is now "standardized", OEM systems using AMD Cezanne and newer APU's have adopted this property, and quirks like this should not be necessary. CC: Shyam-sundar S-k CC: Alexander Deucher CC: Prike Liang Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro Signed-off-by: Mario Limonciello Acked-by: Rafael J. Wysocki Tested-by: Julian Sikorski Signed-off-by: Christoph Hellwig --- drivers/acpi/device_pm.c | 3 +++ drivers/acpi/internal.h | 9 +++++++++ drivers/acpi/x86/utils.c | 25 +++++++++++++++++++++++++ 3 files changed, 37 insertions(+) diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index d76ab50c71dc..6dd9bd64903e 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -1360,6 +1360,9 @@ bool acpi_storage_d3(struct device *dev) struct acpi_device *adev = ACPI_COMPANION(dev); u8 val; + if (force_storage_d3()) + return true; + if (!adev) return false; if (fwnode_property_read_u8(acpi_fwnode_handle(adev), "StorageD3Enable", diff --git a/drivers/acpi/internal.h b/drivers/acpi/internal.h index f973bbe90e5e..e29ec463bb07 100644 --- a/drivers/acpi/internal.h +++ b/drivers/acpi/internal.h @@ -236,6 +236,15 @@ static inline int suspend_nvs_save(void) { return 0; } static inline void suspend_nvs_restore(void) {} #endif +#ifdef CONFIG_X86 +bool force_storage_d3(void); +#else +static inline bool force_storage_d3(void) +{ + return false; +} +#endif + /*-------------------------------------------------------------------------- Device properties -------------------------------------------------------------------------- */ diff --git a/drivers/acpi/x86/utils.c b/drivers/acpi/x86/utils.c index bdc1ba00aee9..f22f23933063 100644 --- a/drivers/acpi/x86/utils.c +++ b/drivers/acpi/x86/utils.c @@ -135,3 +135,28 @@ bool acpi_device_always_present(struct acpi_device *adev) return ret; } + +/* + * AMD systems from Renoir and Lucienne *require* that the NVME controller + * is put into D3 over a Modern Standby / suspend-to-idle cycle. + * + * This is "typically" accomplished using the `StorageD3Enable` + * property in the _DSD that is checked via the `acpi_storage_d3` function + * but this property was introduced after many of these systems launched + * and most OEM systems don't have it in their BIOS. + * + * The Microsoft documentation for StorageD3Enable mentioned that Windows has + * a hardcoded allowlist for D3 support, which was used for these platforms. + * + * This allows quirking on Linux in a similar fashion. + */ +static const struct x86_cpu_id storage_d3_cpu_ids[] = { + X86_MATCH_VENDOR_FAM_MODEL(AMD, 23, 96, NULL), /* Renoir */ + X86_MATCH_VENDOR_FAM_MODEL(AMD, 23, 104, NULL), /* Lucienne */ + {} +}; + +bool force_storage_d3(void) +{ + return x86_match_cpu(storage_d3_cpu_ids); +} From 120bb3624d55d65145f7c1bf12a839fd323cde29 Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Mon, 7 Jun 2021 10:56:56 +0200 Subject: [PATCH 42/78] nvme: verify MNAN value if ANA is enabled The controller is required to have a non-zero MNAN value if it supports ANA: If the controller supports Asymmetric Namespace Access Reporting, then this field shall be set to a non-zero value that is less than or equal to the NN value. Reviewed-by: Hannes Reinecke Reviewed-by: Chaitanya Kulkarni Signed-off-by: Daniel Wagner Signed-off-by: Christoph Hellwig --- drivers/nvme/host/multipath.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 127a17b4c13d..98426234d416 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -818,6 +818,13 @@ int nvme_mpath_init_identify(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) !(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA)) return 0; + if (!ctrl->max_namespaces || + ctrl->max_namespaces > le32_to_cpu(id->nn)) { + dev_err(ctrl->device, + "Invalid MNAN value %u\n", ctrl->max_namespaces); + return -EINVAL; + } + ctrl->anacap = id->anacap; ctrl->anatt = id->anatt; ctrl->nanagrpid = le32_to_cpu(id->nanagrpid); From 2411424143bdfad3027e82fe6a66c5aadce271ee Mon Sep 17 00:00:00 2001 From: Daniel Wagner Date: Mon, 7 Jun 2021 10:46:51 +0200 Subject: [PATCH 43/78] nvme: remove superfluous bio_set_dev in nvme_requeue_work Commit ce86dad222e9 ("nvme-multipath: reset bdev to ns head when failover") moved the reset code where the bio is added to the requeue_list for the failover path. But it left the original bio_set_dev in nvme_requeue_work. There is a second path to nvme_requee_work. It is via nvme_ns_head_submit_bio. Though we don't have to set bio->bi_bdev for this path either, as it points to the correct bdev already. Let's remove the bio_set_dev. It's updating the bio->bi_bdev with the same pointer and thus it's unnecessary. Signed-off-by: Daniel Wagner Signed-off-by: Christoph Hellwig --- drivers/nvme/host/multipath.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 98426234d416..23573fe3fc7d 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -435,11 +435,6 @@ static void nvme_requeue_work(struct work_struct *work) next = bio->bi_next; bio->bi_next = NULL; - /* - * Reset disk to the mpath node and resubmit to select a new - * path. - */ - bio_set_dev(bio, head->disk->part0); submit_bio_noacct(bio); } } From d399742cd02dca6d1ed17ae7db7a366192516591 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Mon, 14 Jun 2021 16:16:07 +0200 Subject: [PATCH 44/78] nvme: fix grammar in the CONFIG_NVME_MULTIPATH kconfig help text Fix a singular/plural mismatch in the CONFIG_NVME_MULTIPATH help text. Signed-off-by: Geert Uytterhoeven Signed-off-by: Christoph Hellwig --- drivers/nvme/host/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index a44d49d63968..102292289cdf 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -21,7 +21,7 @@ config NVME_MULTIPATH help This option enables support for multipath access to NVMe subsystems. If this option is enabled only a single - /dev/nvmeXnY device will show up for each NVMe namespaces, + /dev/nvmeXnY device will show up for each NVMe namespace, even if it is accessible through multiple controllers. config NVME_HWMON From e7d4b5493a2d5a6225fc572e01167e12f89c6a42 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Mon, 7 Jun 2021 12:54:50 -0700 Subject: [PATCH 45/78] nvme: factor out a nvme_validate_passthru_nsid helper Add a helper nvme_validate_passthru_nsid() to validate the nsid that removes the nsid validation and error message print code from nvme_user_cmd() and nvme_user_cmd64(). Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/ioctl.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index 2e7780ea0354..d93928d1e5bd 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -177,6 +177,20 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) metadata, meta_len, lower_32_bits(io.slba), NULL, 0); } +static bool nvme_validate_passthru_nsid(struct nvme_ctrl *ctrl, + struct nvme_ns *ns, __u32 nsid) +{ + if (ns && nsid != ns->head->ns_id) { + dev_err(ctrl->device, + "%s: nsid (%u) in cmd does not match nsid (%u)" + "of namespace\n", + current->comm, nsid, ns->head->ns_id); + return false; + } + + return true; +} + static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, struct nvme_passthru_cmd __user *ucmd) { @@ -192,12 +206,8 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, return -EFAULT; if (cmd.flags) return -EINVAL; - if (ns && cmd.nsid != ns->head->ns_id) { - dev_err(ctrl->device, - "%s: nsid (%u) in cmd does not match nsid (%u) of namespace\n", - current->comm, cmd.nsid, ns->head->ns_id); + if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid)) return -EINVAL; - } memset(&c, 0, sizeof(c)); c.common.opcode = cmd.opcode; @@ -242,12 +252,8 @@ static int nvme_user_cmd64(struct nvme_ctrl *ctrl, struct nvme_ns *ns, return -EFAULT; if (cmd.flags) return -EINVAL; - if (ns && cmd.nsid != ns->head->ns_id) { - dev_err(ctrl->device, - "%s: nsid (%u) in cmd does not match nsid (%u) of namespace\n", - current->comm, cmd.nsid, ns->head->ns_id); + if (!nvme_validate_passthru_nsid(ctrl, ns, cmd.nsid)) return -EINVAL; - } memset(&c, 0, sizeof(c)); c.common.opcode = cmd.opcode; From 522af60cb2f8e3658bda1902fb7f200dcf888a5c Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Sat, 5 Jun 2021 15:48:16 +0300 Subject: [PATCH 46/78] nvme-tcp: fix error codes in nvme_tcp_setup_ctrl() These error paths currently return success but they should return -EOPNOTSUPP. Fixes: 73ffcefcfca0 ("nvme-tcp: check sgl supported by target") Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Dan Carpenter Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/tcp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 5fc6c568c626..6a65b0516180 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1988,11 +1988,13 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) return ret; if (ctrl->icdoff) { + ret = -EOPNOTSUPP; dev_err(ctrl->device, "icdoff is not supported!\n"); goto destroy_admin; } if (!(ctrl->sgls & ((1 << 0) | (1 << 1)))) { + ret = -EOPNOTSUPP; dev_err(ctrl->device, "Mandatory sgls are not supported!\n"); goto destroy_admin; } From a0aac973a26d1ac814b9e131e209eb39472a67ce Mon Sep 17 00:00:00 2001 From: JK Kim Date: Thu, 17 Jun 2021 15:02:17 +0900 Subject: [PATCH 47/78] nvme-pci: fix var. type for increasing cq_head nvmeq->cq_head is compared with nvmeq->q_depth and changed the value and cq_phase for handling the next cq db. but, nvmeq->q_depth's type is u32 and max. value is 0x10000 when CQP.MSQE is 0xffff and io_queue_depth is 0x10000. current temp. variable for comparing with nvmeq->q_depth is overflowed when previous nvmeq->cq_head is 0xffff. in this case, nvmeq->cq_phase is not updated. so, fix data type for temp. variable to u32. Signed-off-by: JK Kim Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 8fbc4c87a0d8..5a72bdf5ad03 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1032,7 +1032,7 @@ static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx) static inline void nvme_update_cq_head(struct nvme_queue *nvmeq) { - u16 tmp = nvmeq->cq_head + 1; + u32 tmp = nvmeq->cq_head + 1; if (tmp == nvmeq->q_depth) { nvmeq->cq_head = 0; From cb1b10e7ac6c1438247ee3c7e4a2f2332a77ba07 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Mon, 7 Jun 2021 12:54:54 -0700 Subject: [PATCH 48/78] nvme-pci: remove trailing lines for helpers Remove the extra white line at the end of the functions. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 5a72bdf5ad03..138e7e7453dd 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -559,7 +559,6 @@ static void nvme_free_prps(struct nvme_dev *dev, struct request *req) dma_pool_free(dev->prp_page_pool, prp_list, dma_addr); dma_addr = next_dma_addr; } - } static void nvme_free_sgls(struct nvme_dev *dev, struct request *req) @@ -576,7 +575,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct request *req) dma_pool_free(dev->prp_page_pool, sg_list, dma_addr); dma_addr = next_dma_addr; } - } static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req) From 73eefc270afa1f27d82c42fdb34562d07a834b40 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:28:23 -0700 Subject: [PATCH 49/78] nvme: add a helper to check ctrl sgl support For various transports such as fc/tcp/pci it is common to check if NVMe SGLs are supported or not by the controller. In this preparation patch we add a helper to avoid the open coding of such checks in the various transport. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/nvme.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 1f397ecba16c..75420ceacc10 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -869,6 +869,11 @@ static inline void nvme_hwmon_exit(struct nvme_ctrl *ctrl) } #endif +static inline bool nvme_ctrl_sgl_supported(struct nvme_ctrl *ctrl) +{ + return ctrl->sgls & ((1 << 0) | (1 << 1)); +} + u32 nvme_command_effects(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode); void nvme_execute_passthru_rq(struct request *rq); From b61678bcd43c6686a6d0cf965934a54b4225821d Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:28:24 -0700 Subject: [PATCH 50/78] nvme-fc: use ctrl sgl check helper Use the helper to check NVMe controller's SGL support. Reviewed-by: James Smart Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 256e87721a01..8a3c4814d21b 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -3111,7 +3111,7 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl) } /* FC-NVME supports normal SGL Data Block Descriptors */ - if (!(ctrl->ctrl.sgls & ((1 << 0) | (1 << 1)))) { + if (!nvme_ctrl_sgl_supported(&ctrl->ctrl)) { dev_err(ctrl->ctrl.device, "Mandatory sgls are not supported!\n"); goto out_disconnect_admin_queue; From 253a0b76a12a4cce14095b3d74004e67a6434d79 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:28:25 -0700 Subject: [PATCH 51/78] nvme-pci: use ctrl sgl check helper Use the helper to check NVMe controller's SGL support. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 138e7e7453dd..12ffd58c27b1 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -536,7 +536,7 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req) avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg); - if (!(dev->ctrl.sgls & ((1 << 0) | (1 << 1)))) + if (!nvme_ctrl_sgl_supported(&dev->ctrl)) return false; if (!iod->nvmeq->qid) return false; @@ -853,7 +853,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, &cmnd->rw, &bv); if (iod->nvmeq->qid && sgl_threshold && - dev->ctrl.sgls & ((1 << 0) | (1 << 1))) + nvme_ctrl_sgl_supported(&dev->ctrl)) return nvme_setup_sgl_simple(dev, req, &cmnd->rw, &bv); } From 3b54064fbce73a4dada6019dd400f0ce28ab5eb9 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:28:26 -0700 Subject: [PATCH 52/78] nvme-tcp: use ctrl sgl check helper Use the helper to check NVMe controller's SGL support. Reviewed-by: Sagi Grimberg Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 6a65b0516180..c7bd37103cf4 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1993,7 +1993,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) goto destroy_admin; } - if (!(ctrl->sgls & ((1 << 0) | (1 << 1)))) { + if (!nvme_ctrl_sgl_supported(ctrl)) { ret = -EOPNOTSUPP; dev_err(ctrl->device, "Mandatory sgls are not supported!\n"); goto destroy_admin; From 2796a8e409429a67daeb813ed270eb645f56f817 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Mon, 14 Jun 2021 19:45:51 -0700 Subject: [PATCH 53/78] nvme-fabrics: remove memset in nvmf_reg_read64() Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 1239a63e3ac2..4753f1e5505e 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -190,11 +190,10 @@ EXPORT_SYMBOL_GPL(nvmf_reg_read32); */ int nvmf_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) { - struct nvme_command cmd; + struct nvme_command cmd = { }; union nvme_result res; int ret; - memset(&cmd, 0, sizeof(cmd)); cmd.prop_get.opcode = nvme_fabrics_command; cmd.prop_get.fctype = nvme_fabrics_type_property_get; cmd.prop_get.attrib = 1; From c22c2720133d51d95da608a77cd703f29d29747e Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Mon, 14 Jun 2021 19:45:52 -0700 Subject: [PATCH 54/78] nvme-fabrics: remove memset in nvmf_reg_write32() Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 4753f1e5505e..09fe3d97bf44 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -235,10 +235,9 @@ EXPORT_SYMBOL_GPL(nvmf_reg_read64); */ int nvmf_reg_write32(struct nvme_ctrl *ctrl, u32 off, u32 val) { - struct nvme_command cmd; + struct nvme_command cmd = { }; int ret; - memset(&cmd, 0, sizeof(cmd)); cmd.prop_set.opcode = nvme_fabrics_command; cmd.prop_set.fctype = nvme_fabrics_type_property_set; cmd.prop_set.attrib = 0; From bfa9d1222d6185a4aea603ebc7d74d75c747087c Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Mon, 14 Jun 2021 19:45:53 -0700 Subject: [PATCH 55/78] nvme-fabrics: remove memset in connect admin q Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 09fe3d97bf44..43c797e3380b 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -362,12 +362,11 @@ static void nvmf_log_connect_error(struct nvme_ctrl *ctrl, */ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl) { - struct nvme_command cmd; + struct nvme_command cmd = { }; union nvme_result res; struct nvmf_connect_data *data; int ret; - memset(&cmd, 0, sizeof(cmd)); cmd.connect.opcode = nvme_fabrics_command; cmd.connect.fctype = nvme_fabrics_type_connect; cmd.connect.qid = 0; From eff4423ec0b03fedb8b7b420549ed8e424d246f1 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Mon, 14 Jun 2021 19:45:54 -0700 Subject: [PATCH 56/78] nvme-fabrics: remove memset in connect io q Declare and initialize structure variable to the zero values so that we can get rid of the zeroout memset call. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/fabrics.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 43c797e3380b..1e6a7cc056ca 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -429,12 +429,11 @@ EXPORT_SYMBOL_GPL(nvmf_connect_admin_queue); */ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid, bool poll) { - struct nvme_command cmd; + struct nvme_command cmd = { }; struct nvmf_connect_data *data; union nvme_result res; int ret; - memset(&cmd, 0, sizeof(cmd)); cmd.connect.opcode = nvme_fabrics_command; cmd.connect.fctype = nvme_fabrics_type_connect; cmd.connect.qid = cpu_to_le16(qid); From 2a4a910aa4f0acc428dc8d10227c42e14ed21d10 Mon Sep 17 00:00:00 2001 From: Hannes Reinecke Date: Tue, 25 May 2021 14:54:14 +0200 Subject: [PATCH 57/78] nvmet-fc: do not check for invalid target port in nvmet_fc_handle_fcp_rqst() When parsing a request in nvmet_fc_handle_fcp_rqst() we should not check for invalid target ports; if we do the command is aborted from the fcp layer, causing the host to assume a transport error. Rather we should still forward this request to the nvmet layer, which will then correctly fail the command with an appropriate error status. Signed-off-by: Hannes Reinecke Reviewed-by: James Smart Signed-off-by: Christoph Hellwig --- drivers/nvme/target/fc.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 19e113240fff..22b5108168a6 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -2510,13 +2510,6 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport, u32 xfrlen = be32_to_cpu(cmdiu->data_len); int ret; - /* - * if there is no nvmet mapping to the targetport there - * shouldn't be requests. just terminate them. - */ - if (!tgtport->pe) - goto transport_error; - /* * Fused commands are currently not supported in the linux * implementation. @@ -2544,7 +2537,8 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport *tgtport, fod->req.cmd = &fod->cmdiubuf.sqe; fod->req.cqe = &fod->rspiubuf.cqe; - fod->req.port = tgtport->pe->port; + if (tgtport->pe) + fod->req.port = tgtport->pe->port; /* clear any response payload */ memset(&fod->rspiubuf, 0, sizeof(fod->rspiubuf)); From e13b061589ace0aee18bdbf86f3ddb2b6b5b5ab8 Mon Sep 17 00:00:00 2001 From: Noam Gottlieb Date: Mon, 7 Jun 2021 12:23:21 +0300 Subject: [PATCH 58/78] nvmet: change sn size and check validity According to the NVM specification, the serial_number should be 20 bytes (bytes 23:04 of the Identify Controller data structure), and should contain only ASCII characters. In accordance, the serial_number size is changed to 20 bytes and before any attempt to store a new value in serial_number we check that the input is valid - i.e. contains only ASCII characters, is not empty and does not exceed 20 bytes. Signed-off-by: Max Gurtovoy Signed-off-by: Noam Gottlieb Reviewed-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 4 +--- drivers/nvme/target/configfs.c | 33 +++++++++++++++++++++++---------- drivers/nvme/target/core.c | 4 +++- drivers/nvme/target/discovery.c | 4 +--- drivers/nvme/target/nvmet.h | 3 ++- 5 files changed, 30 insertions(+), 18 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index dcd49a72f2f3..9c73dbfb8228 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -357,9 +357,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req) id->vid = 0; id->ssvid = 0; - memset(id->sn, ' ', sizeof(id->sn)); - bin2hex(id->sn, &ctrl->subsys->serial, - min(sizeof(ctrl->subsys->serial), sizeof(id->sn) / 2)); + memcpy(id->sn, ctrl->subsys->serial, NVMET_SN_MAX_SIZE); memcpy_and_pad(id->mn, sizeof(id->mn), subsys->model_number, strlen(subsys->model_number), ' '); memcpy_and_pad(id->fr, sizeof(id->fr), diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 65a0cf99f557..027b28aaf7cd 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -1030,24 +1030,43 @@ static ssize_t nvmet_subsys_attr_version_store(struct config_item *item, } CONFIGFS_ATTR(nvmet_subsys_, attr_version); +/* See Section 1.5 of NVMe 1.4 */ +static bool nvmet_is_ascii(const char c) +{ + return c >= 0x20 && c <= 0x7e; +} + static ssize_t nvmet_subsys_attr_serial_show(struct config_item *item, char *page) { struct nvmet_subsys *subsys = to_subsys(item); - return snprintf(page, PAGE_SIZE, "%llx\n", subsys->serial); + return snprintf(page, PAGE_SIZE, "%s\n", subsys->serial); } static ssize_t nvmet_subsys_attr_serial_store(struct config_item *item, const char *page, size_t count) { - u64 serial; + struct nvmet_subsys *subsys = to_subsys(item); + int pos, len = strcspn(page, "\n"); - if (sscanf(page, "%llx\n", &serial) != 1) + if (!len || len > NVMET_SN_MAX_SIZE) { + pr_err("Serial Number can not be empty or exceed %d Bytes\n", + NVMET_SN_MAX_SIZE); return -EINVAL; + } + + for (pos = 0; pos < len; pos++) { + if (!nvmet_is_ascii(page[pos])) { + pr_err("Serial Number must contain only ASCII strings\n"); + return -EINVAL; + } + } down_write(&nvmet_config_sem); - to_subsys(item)->serial = serial; + mutex_lock(&subsys->lock); + memcpy_and_pad(subsys->serial, NVMET_SN_MAX_SIZE, page, len, ' '); + mutex_unlock(&subsys->lock); up_write(&nvmet_config_sem); return count; @@ -1128,12 +1147,6 @@ static ssize_t nvmet_subsys_attr_model_show(struct config_item *item, return ret; } -/* See Section 1.5 of NVMe 1.4 */ -static bool nvmet_is_ascii(const char c) -{ - return c >= 0x20 && c <= 0x7e; -} - static ssize_t nvmet_subsys_attr_model_store_locked(struct nvmet_subsys *subsys, const char *page, size_t count) { diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 4ae4bea6625d..213a0c2af4f7 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -1493,6 +1493,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, enum nvme_subsys_type type) { struct nvmet_subsys *subsys; + char serial[NVMET_SN_MAX_SIZE / 2]; subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) @@ -1500,7 +1501,8 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, subsys->ver = NVMET_DEFAULT_VS; /* generate a random serial number as our controllers are ephemeral: */ - get_random_bytes(&subsys->serial, sizeof(subsys->serial)); + get_random_bytes(&serial, sizeof(serial)); + bin2hex(subsys->serial, &serial, sizeof(serial)); switch (type) { case NVME_NQN_NVME: diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index fc3645fc2c24..b7fdad13094a 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -262,9 +262,7 @@ static void nvmet_execute_disc_identify(struct nvmet_req *req) goto out; } - memset(id->sn, ' ', sizeof(id->sn)); - bin2hex(id->sn, &ctrl->subsys->serial, - min(sizeof(ctrl->subsys->serial), sizeof(id->sn) / 2)); + memcpy(id->sn, ctrl->subsys->serial, NVMET_SN_MAX_SIZE); memset(id->fr, ' ', sizeof(id->fr)); memcpy_and_pad(id->mn, sizeof(id->mn), model, sizeof(model) - 1, ' '); memcpy_and_pad(id->fr, sizeof(id->fr), diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index d69a409515d6..0ae809ca428c 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -28,6 +28,7 @@ #define NVMET_NO_ERROR_LOC ((u16)-1) #define NVMET_DEFAULT_CTRL_MODEL "Linux" #define NVMET_MN_MAX_SIZE 40 +#define NVMET_SN_MAX_SIZE 20 /* * Supported optional AENs: @@ -229,7 +230,7 @@ struct nvmet_subsys { u16 max_qid; u64 ver; - u64 serial; + char serial[NVMET_SN_MAX_SIZE]; char *subsysnqn; bool pi_support; From 7ae023c5aa644211bde26db11018fe08b8408bd5 Mon Sep 17 00:00:00 2001 From: Noam Gottlieb Date: Mon, 7 Jun 2021 12:23:22 +0300 Subject: [PATCH 59/78] nvmet: make sn stable once connection was established Once some host has connected to the target, make sure that the serial number is stable and cannot be changed. Reviewed-by: Max Gurtovoy Signed-off-by: Noam Gottlieb Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 6 ++++++ drivers/nvme/target/configfs.c | 27 ++++++++++++++++++++++----- drivers/nvme/target/nvmet.h | 1 + 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 9c73dbfb8228..5f15c3285a5b 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -337,6 +337,12 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req) u32 cmd_capsule_size; u16 status = 0; + if (!subsys->subsys_discovered) { + mutex_lock(&subsys->lock); + subsys->subsys_discovered = true; + mutex_unlock(&subsys->lock); + } + /* * If there is no model number yet, set it now. It will then remain * stable for the life time of the subsystem. diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 027b28aaf7cd..a13da86fb374 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -1044,12 +1044,18 @@ static ssize_t nvmet_subsys_attr_serial_show(struct config_item *item, return snprintf(page, PAGE_SIZE, "%s\n", subsys->serial); } -static ssize_t nvmet_subsys_attr_serial_store(struct config_item *item, - const char *page, size_t count) +static ssize_t +nvmet_subsys_attr_serial_store_locked(struct nvmet_subsys *subsys, + const char *page, size_t count) { - struct nvmet_subsys *subsys = to_subsys(item); int pos, len = strcspn(page, "\n"); + if (subsys->subsys_discovered) { + pr_err("Can't set serial number. %s is already assigned\n", + subsys->serial); + return -EINVAL; + } + if (!len || len > NVMET_SN_MAX_SIZE) { pr_err("Serial Number can not be empty or exceed %d Bytes\n", NVMET_SN_MAX_SIZE); @@ -1063,13 +1069,24 @@ static ssize_t nvmet_subsys_attr_serial_store(struct config_item *item, } } + memcpy_and_pad(subsys->serial, NVMET_SN_MAX_SIZE, page, len, ' '); + + return count; +} + +static ssize_t nvmet_subsys_attr_serial_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_subsys *subsys = to_subsys(item); + ssize_t ret; + down_write(&nvmet_config_sem); mutex_lock(&subsys->lock); - memcpy_and_pad(subsys->serial, NVMET_SN_MAX_SIZE, page, len, ' '); + ret = nvmet_subsys_attr_serial_store_locked(subsys, page, count); mutex_unlock(&subsys->lock); up_write(&nvmet_config_sem); - return count; + return ret; } CONFIGFS_ATTR(nvmet_subsys_, attr_serial); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 0ae809ca428c..bd0a0b91d843 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -231,6 +231,7 @@ struct nvmet_subsys { u64 ver; char serial[NVMET_SN_MAX_SIZE]; + bool subsys_discovered; char *subsysnqn; bool pi_support; From 0d148efdf0f0414b2ed2dd9c31e71302bb9ce123 Mon Sep 17 00:00:00 2001 From: Noam Gottlieb Date: Mon, 7 Jun 2021 12:23:23 +0300 Subject: [PATCH 60/78] nvmet: allow mn change if subsys not discovered Currently, once the subsystem's model_number is set for the first time there is no way to change it. However, as long as no connection was established to nvmf target, there is no reason for such restriction and we should allow to change the subsystem's model_number as many times as needed. In addition, in order to simplfy the changes and make the model number flow more similar to the rest of the attributes in the Identify Controller data structure, we set a default value for the model number at the initiation of the subsystem. Reviewed-by: Max Gurtovoy Signed-off-by: Noam Gottlieb Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 26 -------------------------- drivers/nvme/target/configfs.c | 10 ++-------- drivers/nvme/target/core.c | 21 +++++++++++++++++---- drivers/nvme/target/discovery.c | 4 ++-- 4 files changed, 21 insertions(+), 40 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 5f15c3285a5b..cd60a8184d04 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -313,22 +313,6 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req) nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR); } -static u16 nvmet_set_model_number(struct nvmet_subsys *subsys) -{ - u16 status = 0; - - mutex_lock(&subsys->lock); - if (!subsys->model_number) { - subsys->model_number = - kstrdup(NVMET_DEFAULT_CTRL_MODEL, GFP_KERNEL); - if (!subsys->model_number) - status = NVME_SC_INTERNAL; - } - mutex_unlock(&subsys->lock); - - return status; -} - static void nvmet_execute_identify_ctrl(struct nvmet_req *req) { struct nvmet_ctrl *ctrl = req->sq->ctrl; @@ -343,16 +327,6 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req) mutex_unlock(&subsys->lock); } - /* - * If there is no model number yet, set it now. It will then remain - * stable for the life time of the subsystem. - */ - if (!subsys->model_number) { - status = nvmet_set_model_number(subsys); - if (status) - goto out; - } - id = kzalloc(sizeof(*id), GFP_KERNEL); if (!id) { status = NVME_SC_INTERNAL; diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index a13da86fb374..9ef8708b92c6 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -1154,14 +1154,8 @@ static ssize_t nvmet_subsys_attr_model_show(struct config_item *item, char *page) { struct nvmet_subsys *subsys = to_subsys(item); - int ret; - mutex_lock(&subsys->lock); - ret = snprintf(page, PAGE_SIZE, "%s\n", subsys->model_number ? - subsys->model_number : NVMET_DEFAULT_CTRL_MODEL); - mutex_unlock(&subsys->lock); - - return ret; + return snprintf(page, PAGE_SIZE, "%s\n", subsys->model_number); } static ssize_t nvmet_subsys_attr_model_store_locked(struct nvmet_subsys *subsys, @@ -1169,7 +1163,7 @@ static ssize_t nvmet_subsys_attr_model_store_locked(struct nvmet_subsys *subsys, { int pos = 0, len; - if (subsys->model_number) { + if (subsys->subsys_discovered) { pr_err("Can't set model number. %s is already assigned\n", subsys->model_number); return -EINVAL; diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 213a0c2af4f7..146909486b8f 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -1494,6 +1494,7 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, { struct nvmet_subsys *subsys; char serial[NVMET_SN_MAX_SIZE / 2]; + int ret; subsys = kzalloc(sizeof(*subsys), GFP_KERNEL); if (!subsys) @@ -1504,6 +1505,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, get_random_bytes(&serial, sizeof(serial)); bin2hex(subsys->serial, &serial, sizeof(serial)); + subsys->model_number = kstrdup(NVMET_DEFAULT_CTRL_MODEL, GFP_KERNEL); + if (!subsys->model_number) { + ret = -ENOMEM; + goto free_subsys; + } + switch (type) { case NVME_NQN_NVME: subsys->max_qid = NVMET_NR_QUEUES; @@ -1513,15 +1520,15 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, break; default: pr_err("%s: Unknown Subsystem type - %d\n", __func__, type); - kfree(subsys); - return ERR_PTR(-EINVAL); + ret = -EINVAL; + goto free_mn; } subsys->type = type; subsys->subsysnqn = kstrndup(subsysnqn, NVMF_NQN_SIZE, GFP_KERNEL); if (!subsys->subsysnqn) { - kfree(subsys); - return ERR_PTR(-ENOMEM); + ret = -ENOMEM; + goto free_mn; } subsys->cntlid_min = NVME_CNTLID_MIN; subsys->cntlid_max = NVME_CNTLID_MAX; @@ -1533,6 +1540,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn, INIT_LIST_HEAD(&subsys->hosts); return subsys; + +free_mn: + kfree(subsys->model_number); +free_subsys: + kfree(subsys); + return ERR_PTR(ret); } static void nvmet_subsys_free(struct kref *ref) diff --git a/drivers/nvme/target/discovery.c b/drivers/nvme/target/discovery.c index b7fdad13094a..7aa62bc6ae84 100644 --- a/drivers/nvme/target/discovery.c +++ b/drivers/nvme/target/discovery.c @@ -244,7 +244,6 @@ static void nvmet_execute_disc_identify(struct nvmet_req *req) { struct nvmet_ctrl *ctrl = req->sq->ctrl; struct nvme_id_ctrl *id; - const char model[] = "Linux"; u16 status = 0; if (!nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE)) @@ -264,7 +263,8 @@ static void nvmet_execute_disc_identify(struct nvmet_req *req) memcpy(id->sn, ctrl->subsys->serial, NVMET_SN_MAX_SIZE); memset(id->fr, ' ', sizeof(id->fr)); - memcpy_and_pad(id->mn, sizeof(id->mn), model, sizeof(model) - 1, ' '); + memcpy_and_pad(id->mn, sizeof(id->mn), ctrl->subsys->model_number, + strlen(ctrl->subsys->model_number), ' '); memcpy_and_pad(id->fr, sizeof(id->fr), UTS_RELEASE, strlen(UTS_RELEASE), ' '); From 87fd4cc1c0dda038c9a3617c9d07d5159326e80f Mon Sep 17 00:00:00 2001 From: Noam Gottlieb Date: Mon, 7 Jun 2021 12:23:24 +0300 Subject: [PATCH 61/78] nvmet: make ver stable once connection established Once some host has connected to the nvmf target, make sure that the version number is stable and cannot be changed. Signed-off-by: Max Gurtovoy Signed-off-by: Noam Gottlieb Signed-off-by: Christoph Hellwig --- drivers/nvme/target/configfs.c | 36 +++++++++++++++++++++++++++++----- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index 9ef8708b92c6..273555127188 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -1007,13 +1007,26 @@ static ssize_t nvmet_subsys_attr_version_show(struct config_item *item, NVME_MINOR(subsys->ver)); } -static ssize_t nvmet_subsys_attr_version_store(struct config_item *item, - const char *page, size_t count) +static ssize_t +nvmet_subsys_attr_version_store_locked(struct nvmet_subsys *subsys, + const char *page, size_t count) { - struct nvmet_subsys *subsys = to_subsys(item); int major, minor, tertiary = 0; int ret; + if (subsys->subsys_discovered) { + if (NVME_TERTIARY(subsys->ver)) + pr_err("Can't set version number. %llu.%llu.%llu is already assigned\n", + NVME_MAJOR(subsys->ver), + NVME_MINOR(subsys->ver), + NVME_TERTIARY(subsys->ver)); + else + pr_err("Can't set version number. %llu.%llu is already assigned\n", + NVME_MAJOR(subsys->ver), + NVME_MINOR(subsys->ver)); + return -EINVAL; + } + /* passthru subsystems use the underlying controller's version */ if (nvmet_passthru_ctrl(subsys)) return -EINVAL; @@ -1022,12 +1035,25 @@ static ssize_t nvmet_subsys_attr_version_store(struct config_item *item, if (ret != 2 && ret != 3) return -EINVAL; - down_write(&nvmet_config_sem); subsys->ver = NVME_VS(major, minor, tertiary); - up_write(&nvmet_config_sem); return count; } + +static ssize_t nvmet_subsys_attr_version_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_subsys *subsys = to_subsys(item); + ssize_t ret; + + down_write(&nvmet_config_sem); + mutex_lock(&subsys->lock); + ret = nvmet_subsys_attr_version_store_locked(subsys, page, count); + mutex_unlock(&subsys->lock); + up_write(&nvmet_config_sem); + + return ret; +} CONFIGFS_ATTR(nvmet_subsys_, attr_version); /* See Section 1.5 of NVMe 1.4 */ From 46eca4702d93dbb8ac1c8fa84e5838fc8a1e82a0 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 13 Jun 2021 18:58:46 -0700 Subject: [PATCH 62/78] nvmet: use req->cmd directly in bdev-ns fast path The function nvmet_bdev_parse_io_cmd() is called from the fast path. The local variable to that function cmd is only used once. Remove the local variable and use req->cmd directly. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/io-cmd-bdev.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index f673679d258a..5d998e5873d3 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -429,9 +429,7 @@ static void nvmet_bdev_execute_write_zeroes(struct nvmet_req *req) u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req) { - struct nvme_command *cmd = req->cmd; - - switch (cmd->common.opcode) { + switch (req->cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: req->execute = nvmet_bdev_execute_rw; From f3dce2add3e1a06f4e16616408aa70bf2f7c8431 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 13 Jun 2021 18:58:47 -0700 Subject: [PATCH 63/78] nvmet: use req->cmd directly in file-ns fast path The function nvmet_file_parse_io_cmd() is called from the fast path. The local variable to that function cmd is only used once. Remove the local variable and use req->cmd directly. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/io-cmd-file.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c index 7fdbdc496597..1dd1a0fe2e81 100644 --- a/drivers/nvme/target/io-cmd-file.c +++ b/drivers/nvme/target/io-cmd-file.c @@ -385,9 +385,7 @@ static void nvmet_file_execute_write_zeroes(struct nvmet_req *req) u16 nvmet_file_parse_io_cmd(struct nvmet_req *req) { - struct nvme_command *cmd = req->cmd; - - switch (cmd->common.opcode) { + switch (req->cmd->common.opcode) { case nvme_cmd_read: case nvme_cmd_write: req->execute = nvmet_file_execute_rw; From 86693c43bb01c2597b55ec2fac37214ed1094a49 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 13 Jun 2021 18:58:48 -0700 Subject: [PATCH 64/78] nvmet: use u32 for nvmet_subsys max_nsid Use u32 type for the nsid_max member of the nvmet_subsys structure. This avoids the type confusion when updating the subsys->nax_nsid from ns->nsid. This also matches the nvmet_ns->nsid member. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/nvmet.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index bd0a0b91d843..3468f25cb4b7 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -218,7 +218,7 @@ struct nvmet_subsys { struct xarray namespaces; unsigned int nr_namespaces; - unsigned int max_nsid; + u32 max_nsid; u16 cntlid_min; u16 cntlid_max; From 245067e37d52185a741d269e658afee40d467287 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 13 Jun 2021 18:58:49 -0700 Subject: [PATCH 65/78] nvmet: use u32 type for the local variable nsid In function nvmet_max_nsid() we calculate the max nsid by iterating over the XArray and store it in the variable nsid that has type of unsigned long. Since the value of this function is stored into the subsys->max_nsid which is of type u32, change the local variable nsid type and the return type of the same function to u32. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 146909486b8f..8494a132da35 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -122,11 +122,11 @@ u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len) return 0; } -static unsigned int nvmet_max_nsid(struct nvmet_subsys *subsys) +static u32 nvmet_max_nsid(struct nvmet_subsys *subsys) { - unsigned long nsid = 0; struct nvmet_ns *cur; unsigned long idx; + u32 nsid = 0; xa_for_each(&subsys->namespaces, idx, cur) nsid = cur->nsid; From 8bb6cb9b97ef0b0ae4a492db5a90f8156d2cbe85 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 13 Jun 2021 18:58:50 -0700 Subject: [PATCH 66/78] nvmet: use nvme status value directly There is no point in keeping the status variable that is used only once in the function nvmet_async_events_failall(). Remove the variable and use the value directly. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 8494a132da35..45a5b273b525 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -141,14 +141,13 @@ static u32 nvmet_async_event_result(struct nvmet_async_event *aen) static void nvmet_async_events_failall(struct nvmet_ctrl *ctrl) { - u16 status = NVME_SC_INTERNAL | NVME_SC_DNR; struct nvmet_req *req; mutex_lock(&ctrl->lock); while (ctrl->nr_async_event_cmds) { req = ctrl->async_event_cmds[--ctrl->nr_async_event_cmds]; mutex_unlock(&ctrl->lock); - nvmet_req_complete(req, status); + nvmet_req_complete(req, NVME_SC_INTERNAL | NVME_SC_DNR); mutex_lock(&ctrl->lock); } mutex_unlock(&ctrl->lock); From 7860569ad47f9fbd7c9f93a5c2b7d2a18e4af831 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 13 Jun 2021 18:58:51 -0700 Subject: [PATCH 67/78] nvmet: remove local variable In function errno_to_nvme_status() we store the value of the NVMe status into the local variable and don't do anything useful with that but just return. Remove the local variable and return the value directly from switch. This also removed extra break statements. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/core.c | 23 +++++++---------------- 1 file changed, 7 insertions(+), 16 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 45a5b273b525..c8708dcaeaa5 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -43,43 +43,34 @@ DECLARE_RWSEM(nvmet_ana_sem); inline u16 errno_to_nvme_status(struct nvmet_req *req, int errno) { - u16 status; - switch (errno) { case 0: - status = NVME_SC_SUCCESS; - break; + return NVME_SC_SUCCESS; case -ENOSPC: req->error_loc = offsetof(struct nvme_rw_command, length); - status = NVME_SC_CAP_EXCEEDED | NVME_SC_DNR; - break; + return NVME_SC_CAP_EXCEEDED | NVME_SC_DNR; case -EREMOTEIO: req->error_loc = offsetof(struct nvme_rw_command, slba); - status = NVME_SC_LBA_RANGE | NVME_SC_DNR; - break; + return NVME_SC_LBA_RANGE | NVME_SC_DNR; case -EOPNOTSUPP: req->error_loc = offsetof(struct nvme_common_command, opcode); switch (req->cmd->common.opcode) { case nvme_cmd_dsm: case nvme_cmd_write_zeroes: - status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR; - break; + return NVME_SC_ONCS_NOT_SUPPORTED | NVME_SC_DNR; default: - status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR; + return NVME_SC_INVALID_OPCODE | NVME_SC_DNR; } break; case -ENODATA: req->error_loc = offsetof(struct nvme_rw_command, nsid); - status = NVME_SC_ACCESS_DENIED; - break; + return NVME_SC_ACCESS_DENIED; case -EIO: fallthrough; default: req->error_loc = offsetof(struct nvme_common_command, opcode); - status = NVME_SC_INTERNAL | NVME_SC_DNR; + return NVME_SC_INTERNAL | NVME_SC_DNR; } - - return status; } u16 nvmet_report_invalid_opcode(struct nvmet_req *req) From c28a61471c5898e832c6e8634b2659249761b833 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:32:48 -0700 Subject: [PATCH 68/78] block: export blk_next_bio() The block layer provides emulation of zone management operations targeting all zones of a zoned block device only for the zone reset operation (REQ_OP_ZONE_RESET). In order to correctly implement exporting of zoned block devices with NVMeOF, emulating zone management operations targeting all zones of a device is also necessary for the open, close and finish zone operations (REQ_OP_ZONE_OPEN, REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH). Instead of duplicating the code, export the existing helper from block layer so we can use a bio chaining pattern that is present in the block layer for REQ_OP_ZONE RESET all emulation in the NVMeOF zoned block device backend. Reviewed-by: Damien Le Moal Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- block/blk-lib.c | 1 + include/linux/bio.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/block/blk-lib.c b/block/blk-lib.c index 7b256131b20b..9f09beadcbe3 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -21,6 +21,7 @@ struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp) return new; } +EXPORT_SYMBOL_GPL(blk_next_bio); int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, int flags, diff --git a/include/linux/bio.h b/include/linux/bio.h index a0b4cfdf62a4..b2491ead22a0 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -822,4 +822,6 @@ static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) bio->bi_opf |= REQ_NOWAIT; } +struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp); + #endif /* __LINUX_BIO_H */ From 6e597263f990a2db99e7380debc4044c38867971 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:32:49 -0700 Subject: [PATCH 69/78] nvmet: add req cns error complete helper We report error and complete the request when identify cns value is not handled in nvmet_execute_identify(). This error reporting is also needed for Zone Block Device backend for NVMeOF target. Add a helper nvmet_req_cns_error_compplete() to report an error and complete the request when idenitfy command cns not handled value. Signed-off-by: Chaitanya Kulkarni Reviewed-by: Damien Le Moal Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 5 +---- drivers/nvme/target/nvmet.h | 8 ++++++++ 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index cd60a8184d04..3de6a6c99b01 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -637,10 +637,7 @@ static void nvmet_execute_identify(struct nvmet_req *req) return nvmet_execute_identify_desclist(req); } - pr_debug("unhandled identify cns %d on qid %d\n", - req->cmd->identify.cns, req->sq->qid); - req->error_loc = offsetof(struct nvme_identify, cns); - nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR); + nvmet_req_cns_error_complete(req); } /* diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 3468f25cb4b7..002651f34c5e 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -624,4 +624,12 @@ static inline bool nvmet_use_inline_bvec(struct nvmet_req *req) req->sg_cnt <= NVMET_MAX_INLINE_BIOVEC; } +static inline void nvmet_req_cns_error_complete(struct nvmet_req *req) +{ + pr_debug("unhandled identify cns %d on qid %d\n", + req->cmd->identify.cns, req->sq->qid); + req->error_loc = offsetof(struct nvme_identify, cns); + nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR); +} + #endif /* _NVMET_H */ From 9a01b58c22ccabd00e990e9dc01c2de5865d6e4d Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:32:50 -0700 Subject: [PATCH 70/78] nvmet: add nvmet_req_bio put helper for backends In current code there exists two backends which are using inline bio optimization, that adds a duplicate code for freeing the bio. For Zoned Block Device backend we also use the same optimzation and it will lead to having duplicate code in the three backends: generic bdev, passsthru, and generic zns. Add a helper function to avoid duplicate code and update the respective backends. Signed-off-by: Chaitanya Kulkarni Reviewed-by: Damien Le Moal Signed-off-by: Christoph Hellwig --- drivers/nvme/target/io-cmd-bdev.c | 3 +-- drivers/nvme/target/nvmet.h | 6 ++++++ drivers/nvme/target/passthru.c | 3 +-- 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 5d998e5873d3..019cc994efcd 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -164,8 +164,7 @@ static void nvmet_bio_done(struct bio *bio) struct nvmet_req *req = bio->bi_private; nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); - if (bio != &req->b.inline_bio) - bio_put(bio); + nvmet_req_bio_put(req, bio); } #ifdef CONFIG_BLK_DEV_INTEGRITY diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 002651f34c5e..29b386de1d07 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -632,4 +632,10 @@ static inline void nvmet_req_cns_error_complete(struct nvmet_req *req) nvmet_req_complete(req, NVME_SC_INVALID_FIELD | NVME_SC_DNR); } +static inline void nvmet_req_bio_put(struct nvmet_req *req, struct bio *bio) +{ + if (bio != &req->b.inline_bio) + bio_put(bio); +} + #endif /* _NVMET_H */ diff --git a/drivers/nvme/target/passthru.c b/drivers/nvme/target/passthru.c index 39b1473f7204..fced52de33ce 100644 --- a/drivers/nvme/target/passthru.c +++ b/drivers/nvme/target/passthru.c @@ -206,8 +206,7 @@ static int nvmet_passthru_map_sg(struct nvmet_req *req, struct request *rq) for_each_sg(req->sg, sg, req->sg_cnt, i) { if (bio_add_pc_page(rq->q, bio, sg_page(sg), sg->length, sg->offset) < sg->length) { - if (bio != &req->p.inline_bio) - bio_put(bio); + nvmet_req_bio_put(req, bio); return -EINVAL; } } From ab5d0b38c0475d6ff59f1a6ccf7c668b9ec2e0a4 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:32:51 -0700 Subject: [PATCH 71/78] nvmet: add Command Set Identifier support NVMe TP 4056 allows controllers to support different command sets. NVMeoF target currently only supports namespaces that contain traditional logical blocks that may be randomly read and written. In some applications there is a value in exposing namespaces that contain logical blocks that have special access rules (e.g. sequentially write required namespace such as Zoned Namespace (ZNS)). In order to support the Zoned Block Devices (ZBD) backend, controllers need to have support for ZNS Command Set Identifier (CSI). In this preparation patch, we adjust the code such that it can now support the default command set identifier. We update the namespace data structure to store the CSI value which defaults to NVME_CSI_NVM that represents traditional logical blocks namespace type. The CSI support is required to implement the ZBD backend for NVMeOF with host side NVMe ZNS interface, since ZNS commands belong to the different command set than the default one. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 75 +++++++++++++++++++++++++++------ drivers/nvme/target/core.c | 28 +++++++++--- drivers/nvme/target/nvmet.h | 1 + include/linux/nvme.h | 1 + 4 files changed, 87 insertions(+), 18 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 3de6a6c99b01..93aaa7479e71 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -162,15 +162,8 @@ out: nvmet_req_complete(req, status); } -static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req) +static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log) { - u16 status = NVME_SC_INTERNAL; - struct nvme_effects_log *log; - - log = kzalloc(sizeof(*log), GFP_KERNEL); - if (!log) - goto out; - log->acs[nvme_admin_get_log_page] = cpu_to_le32(1 << 0); log->acs[nvme_admin_identify] = cpu_to_le32(1 << 0); log->acs[nvme_admin_abort_cmd] = cpu_to_le32(1 << 0); @@ -184,9 +177,30 @@ static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req) log->iocs[nvme_cmd_flush] = cpu_to_le32(1 << 0); log->iocs[nvme_cmd_dsm] = cpu_to_le32(1 << 0); log->iocs[nvme_cmd_write_zeroes] = cpu_to_le32(1 << 0); +} + +static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req) +{ + struct nvme_effects_log *log; + u16 status = NVME_SC_SUCCESS; + + log = kzalloc(sizeof(*log), GFP_KERNEL); + if (!log) { + status = NVME_SC_INTERNAL; + goto out; + } + + switch (req->cmd->get_log_page.csi) { + case NVME_CSI_NVM: + nvmet_get_cmd_effects_nvm(log); + break; + default: + status = NVME_SC_INVALID_LOG_PAGE; + goto free; + } status = nvmet_copy_to_sgl(req, 0, log, sizeof(*log)); - +free: kfree(log); out: nvmet_req_complete(req, status); @@ -613,6 +627,12 @@ static void nvmet_execute_identify_desclist(struct nvmet_req *req) goto out; } + status = nvmet_copy_ns_identifier(req, NVME_NIDT_CSI, + NVME_NIDT_CSI_LEN, + &req->ns->csi, &off); + if (status) + goto out; + if (sg_zero_buffer(req->sg, req->sg_cnt, NVME_IDENTIFY_DATA_SIZE - off, off) != NVME_IDENTIFY_DATA_SIZE - off) status = NVME_SC_INTERNAL | NVME_SC_DNR; @@ -621,6 +641,17 @@ out: nvmet_req_complete(req, status); } +static bool nvmet_handle_identify_desclist(struct nvmet_req *req) +{ + switch (req->cmd->identify.csi) { + case NVME_CSI_NVM: + nvmet_execute_identify_desclist(req); + return true; + default: + return false; + } +} + static void nvmet_execute_identify(struct nvmet_req *req) { if (!nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE)) @@ -628,13 +659,31 @@ static void nvmet_execute_identify(struct nvmet_req *req) switch (req->cmd->identify.cns) { case NVME_ID_CNS_NS: - return nvmet_execute_identify_ns(req); + switch (req->cmd->identify.csi) { + case NVME_CSI_NVM: + return nvmet_execute_identify_ns(req); + default: + break; + } + break; case NVME_ID_CNS_CTRL: - return nvmet_execute_identify_ctrl(req); + switch (req->cmd->identify.csi) { + case NVME_CSI_NVM: + return nvmet_execute_identify_ctrl(req); + } + break; case NVME_ID_CNS_NS_ACTIVE_LIST: - return nvmet_execute_identify_nslist(req); + switch (req->cmd->identify.csi) { + case NVME_CSI_NVM: + return nvmet_execute_identify_nslist(req); + default: + break; + } + break; case NVME_ID_CNS_NS_DESC_LIST: - return nvmet_execute_identify_desclist(req); + if (nvmet_handle_identify_desclist(req) == true) + return; + break; } nvmet_req_cns_error_complete(req); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index c8708dcaeaa5..77873d56cff5 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -682,6 +682,7 @@ struct nvmet_ns *nvmet_ns_alloc(struct nvmet_subsys *subsys, u32 nsid) uuid_gen(&ns->uuid); ns->buffered_io = false; + ns->csi = NVME_CSI_NVM; return ns; } @@ -877,10 +878,14 @@ static u16 nvmet_parse_io_cmd(struct nvmet_req *req) return ret; } - if (req->ns->file) - return nvmet_file_parse_io_cmd(req); - - return nvmet_bdev_parse_io_cmd(req); + switch (req->ns->csi) { + case NVME_CSI_NVM: + if (req->ns->file) + return nvmet_file_parse_io_cmd(req); + return nvmet_bdev_parse_io_cmd(req); + default: + return NVME_SC_INVALID_IO_CMD_SET; + } } bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, @@ -1102,6 +1107,17 @@ static inline u8 nvmet_cc_iocqes(u32 cc) return (cc >> NVME_CC_IOCQES_SHIFT) & 0xf; } +static inline bool nvmet_css_supported(u8 cc_css) +{ + switch (cc_css <<= NVME_CC_CSS_SHIFT) { + case NVME_CC_CSS_NVM: + case NVME_CC_CSS_CSI: + return true; + default: + return false; + } +} + static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl) { lockdep_assert_held(&ctrl->lock); @@ -1121,7 +1137,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl) if (nvmet_cc_mps(ctrl->cc) != 0 || nvmet_cc_ams(ctrl->cc) != 0 || - nvmet_cc_css(ctrl->cc) != 0) { + !nvmet_css_supported(nvmet_cc_css(ctrl->cc))) { ctrl->csts = NVME_CSTS_CFS; return; } @@ -1172,6 +1188,8 @@ static void nvmet_init_cap(struct nvmet_ctrl *ctrl) { /* command sets supported: NVMe command set: */ ctrl->cap = (1ULL << 37); + /* Controller supports one or more I/O Command Sets */ + ctrl->cap |= (1ULL << 43); /* CC.EN timeout in 500msec units: */ ctrl->cap |= (15ULL << 24); /* maximum queue entries supported: */ diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 29b386de1d07..cc71918ff8fe 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -83,6 +83,7 @@ struct nvmet_ns { struct pci_dev *p2p_dev; int pi_type; int metadata_size; + u8 csi; }; static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item) diff --git a/include/linux/nvme.h b/include/linux/nvme.h index edcbd60b88b9..c7ba83144d52 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -1504,6 +1504,7 @@ enum { NVME_SC_NS_WRITE_PROTECTED = 0x20, NVME_SC_CMD_INTERRUPTED = 0x21, NVME_SC_TRANSIENT_TR_ERR = 0x22, + NVME_SC_INVALID_IO_CMD_SET = 0x2C, NVME_SC_LBA_RANGE = 0x80, NVME_SC_CAP_EXCEEDED = 0x81, From aaf2e048af2704da5869f27b508b288f36d5c7b7 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 9 Jun 2021 18:32:52 -0700 Subject: [PATCH 72/78] nvmet: add ZBD over ZNS backend support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit NVMe TP 4053 – Zoned Namespaces (ZNS) allows host software to communicate with a non-volatile memory subsystem using zones for NVMe protocol-based controllers. NVMeOF already support the ZNS NVMe Protocol compliant devices on the target in the passthru mode. There are generic zoned block devices like Shingled Magnetic Recording (SMR) HDDs that are not based on the NVMe protocol. This patch adds ZNS backend support for non-ZNS zoned block devices as NVMeOF targets. This support includes implementing the new command set NVME_CSI_ZNS, adding different command handlers for ZNS command set such as NVMe Identify Controller, NVMe Identify Namespace, NVMe Zone Append, NVMe Zone Management Send and NVMe Zone Management Receive. With the new command set identifier, we also update the target command effects logs to reflect the ZNS compliant commands. Signed-off-by: Chaitanya Kulkarni Reviewed-by: Damien Le Moal Signed-off-by: Christoph Hellwig --- drivers/nvme/target/Makefile | 1 + drivers/nvme/target/admin-cmd.c | 41 ++ drivers/nvme/target/core.c | 15 +- drivers/nvme/target/io-cmd-bdev.c | 26 +- drivers/nvme/target/nvmet.h | 20 + drivers/nvme/target/zns.c | 615 ++++++++++++++++++++++++++++++ include/linux/nvme.h | 7 + 7 files changed, 714 insertions(+), 11 deletions(-) create mode 100644 drivers/nvme/target/zns.c diff --git a/drivers/nvme/target/Makefile b/drivers/nvme/target/Makefile index ebf91fc4c72e..9837e580fa7e 100644 --- a/drivers/nvme/target/Makefile +++ b/drivers/nvme/target/Makefile @@ -12,6 +12,7 @@ obj-$(CONFIG_NVME_TARGET_TCP) += nvmet-tcp.o nvmet-y += core.o configfs.o admin-cmd.o fabrics-cmd.o \ discovery.o io-cmd-file.o io-cmd-bdev.o nvmet-$(CONFIG_NVME_TARGET_PASSTHRU) += passthru.o +nvmet-$(CONFIG_BLK_DEV_ZONED) += zns.o nvme-loop-y += loop.o nvmet-rdma-y += rdma.o nvmet-fc-y += fc.o diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 93aaa7479e71..363e357d2f20 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -179,6 +179,13 @@ static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log) log->iocs[nvme_cmd_write_zeroes] = cpu_to_le32(1 << 0); } +static void nvmet_get_cmd_effects_zns(struct nvme_effects_log *log) +{ + log->iocs[nvme_cmd_zone_append] = cpu_to_le32(1 << 0); + log->iocs[nvme_cmd_zone_mgmt_send] = cpu_to_le32(1 << 0); + log->iocs[nvme_cmd_zone_mgmt_recv] = cpu_to_le32(1 << 0); +} + static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req) { struct nvme_effects_log *log; @@ -194,6 +201,14 @@ static void nvmet_execute_get_log_cmd_effects_ns(struct nvmet_req *req) case NVME_CSI_NVM: nvmet_get_cmd_effects_nvm(log); break; + case NVME_CSI_ZNS: + if (!IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + status = NVME_SC_INVALID_IO_CMD_SET; + goto free; + } + nvmet_get_cmd_effects_nvm(log); + nvmet_get_cmd_effects_zns(log); + break; default: status = NVME_SC_INVALID_LOG_PAGE; goto free; @@ -647,6 +662,12 @@ static bool nvmet_handle_identify_desclist(struct nvmet_req *req) case NVME_CSI_NVM: nvmet_execute_identify_desclist(req); return true; + case NVME_CSI_ZNS: + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + nvmet_execute_identify_desclist(req); + return true; + } + return false; default: return false; } @@ -666,12 +687,32 @@ static void nvmet_execute_identify(struct nvmet_req *req) break; } break; + case NVME_ID_CNS_CS_NS: + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + switch (req->cmd->identify.csi) { + case NVME_CSI_ZNS: + return nvmet_execute_identify_cns_cs_ns(req); + default: + break; + } + } + break; case NVME_ID_CNS_CTRL: switch (req->cmd->identify.csi) { case NVME_CSI_NVM: return nvmet_execute_identify_ctrl(req); } break; + case NVME_ID_CNS_CS_CTRL: + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + switch (req->cmd->identify.csi) { + case NVME_CSI_ZNS: + return nvmet_execute_identify_cns_cs_ctrl(req); + default: + break; + } + } + break; case NVME_ID_CNS_NS_ACTIVE_LIST: switch (req->cmd->identify.csi) { case NVME_CSI_NVM: diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 77873d56cff5..dd16704c9b6b 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -16,6 +16,7 @@ #include "nvmet.h" struct workqueue_struct *buffered_io_wq; +struct workqueue_struct *zbd_wq; static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX]; static DEFINE_IDA(cntlid_ida); @@ -883,6 +884,10 @@ static u16 nvmet_parse_io_cmd(struct nvmet_req *req) if (req->ns->file) return nvmet_file_parse_io_cmd(req); return nvmet_bdev_parse_io_cmd(req); + case NVME_CSI_ZNS: + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) + return nvmet_bdev_zns_parse_io_cmd(req); + return NVME_SC_INVALID_IO_CMD_SET; default: return NVME_SC_INVALID_IO_CMD_SET; } @@ -1592,11 +1597,15 @@ static int __init nvmet_init(void) nvmet_ana_group_enabled[NVMET_DEFAULT_ANA_GRPID] = 1; + zbd_wq = alloc_workqueue("nvmet-zbd-wq", WQ_MEM_RECLAIM, 0); + if (!zbd_wq) + return -ENOMEM; + buffered_io_wq = alloc_workqueue("nvmet-buffered-io-wq", WQ_MEM_RECLAIM, 0); if (!buffered_io_wq) { error = -ENOMEM; - goto out; + goto out_free_zbd_work_queue; } error = nvmet_init_discovery(); @@ -1612,7 +1621,8 @@ out_exit_discovery: nvmet_exit_discovery(); out_free_work_queue: destroy_workqueue(buffered_io_wq); -out: +out_free_zbd_work_queue: + destroy_workqueue(zbd_wq); return error; } @@ -1622,6 +1632,7 @@ static void __exit nvmet_exit(void) nvmet_exit_discovery(); ida_destroy(&cntlid_ida); destroy_workqueue(buffered_io_wq); + destroy_workqueue(zbd_wq); BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_entry) != 1024); BUILD_BUG_ON(sizeof(struct nvmf_disc_rsp_page_hdr) != 1024); diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 019cc994efcd..0fc2781ab970 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -47,6 +47,14 @@ void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id) id->nows = to0based(ql->io_opt / ql->logical_block_size); } +void nvmet_bdev_ns_disable(struct nvmet_ns *ns) +{ + if (ns->bdev) { + blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); + ns->bdev = NULL; + } +} + static void nvmet_bdev_ns_enable_integrity(struct nvmet_ns *ns) { struct blk_integrity *bi = bdev_get_integrity(ns->bdev); @@ -86,15 +94,15 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns) if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY_T10)) nvmet_bdev_ns_enable_integrity(ns); - return 0; -} - -void nvmet_bdev_ns_disable(struct nvmet_ns *ns) -{ - if (ns->bdev) { - blkdev_put(ns->bdev, FMODE_WRITE | FMODE_READ); - ns->bdev = NULL; + if (bdev_is_zoned(ns->bdev)) { + if (!nvmet_bdev_zns_enable(ns)) { + nvmet_bdev_ns_disable(ns); + return -EINVAL; + } + ns->csi = NVME_CSI_ZNS; } + + return 0; } void nvmet_bdev_ns_revalidate(struct nvmet_ns *ns) @@ -102,7 +110,7 @@ void nvmet_bdev_ns_revalidate(struct nvmet_ns *ns) ns->size = i_size_read(ns->bdev->bd_inode); } -static u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) +u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) { u16 status = NVME_SC_SUCCESS; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index cc71918ff8fe..d719a1cd5dda 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -250,6 +250,10 @@ struct nvmet_subsys { unsigned int admin_timeout; unsigned int io_timeout; #endif /* CONFIG_NVME_TARGET_PASSTHRU */ + +#ifdef CONFIG_BLK_DEV_ZONED + u8 zasl; +#endif /* CONFIG_BLK_DEV_ZONED */ }; static inline struct nvmet_subsys *to_subsys(struct config_item *item) @@ -335,6 +339,12 @@ struct nvmet_req { struct work_struct work; bool use_workqueue; } p; +#ifdef CONFIG_BLK_DEV_ZONED + struct { + struct bio inline_bio; + struct work_struct zmgmt_work; + } z; +#endif /* CONFIG_BLK_DEV_ZONED */ }; int sg_cnt; int metadata_sg_cnt; @@ -354,6 +364,7 @@ struct nvmet_req { }; extern struct workqueue_struct *buffered_io_wq; +extern struct workqueue_struct *zbd_wq; static inline void nvmet_set_result(struct nvmet_req *req, u32 result) { @@ -403,6 +414,7 @@ u16 nvmet_parse_connect_cmd(struct nvmet_req *req); void nvmet_bdev_set_limits(struct block_device *bdev, struct nvme_id_ns *id); u16 nvmet_bdev_parse_io_cmd(struct nvmet_req *req); u16 nvmet_file_parse_io_cmd(struct nvmet_req *req); +u16 nvmet_bdev_zns_parse_io_cmd(struct nvmet_req *req); u16 nvmet_parse_admin_cmd(struct nvmet_req *req); u16 nvmet_parse_discovery_cmd(struct nvmet_req *req); u16 nvmet_parse_fabrics_cmd(struct nvmet_req *req); @@ -530,6 +542,14 @@ void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid); void nvmet_bdev_ns_revalidate(struct nvmet_ns *ns); int nvmet_file_ns_revalidate(struct nvmet_ns *ns); void nvmet_ns_revalidate(struct nvmet_ns *ns); +u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts); + +bool nvmet_bdev_zns_enable(struct nvmet_ns *ns); +void nvmet_execute_identify_cns_cs_ctrl(struct nvmet_req *req); +void nvmet_execute_identify_cns_cs_ns(struct nvmet_req *req); +void nvmet_bdev_execute_zone_mgmt_recv(struct nvmet_req *req); +void nvmet_bdev_execute_zone_mgmt_send(struct nvmet_req *req); +void nvmet_bdev_execute_zone_append(struct nvmet_req *req); static inline u32 nvmet_rw_data_len(struct nvmet_req *req) { diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c new file mode 100644 index 000000000000..17f8b7a45f21 --- /dev/null +++ b/drivers/nvme/target/zns.c @@ -0,0 +1,615 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NVMe ZNS-ZBD command implementation. + * Copyright (C) 2021 Western Digital Corporation or its affiliates. + */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include +#include +#include "nvmet.h" + +/* + * We set the Memory Page Size Minimum (MPSMIN) for target controller to 0 + * which gets added by 12 in the nvme_enable_ctrl() which results in 2^12 = 4k + * as page_shift value. When calculating the ZASL use shift by 12. + */ +#define NVMET_MPSMIN_SHIFT 12 + +static inline u8 nvmet_zasl(unsigned int zone_append_sects) +{ + /* + * Zone Append Size Limit (zasl) is expressed as a power of 2 value + * with the minimum memory page size (i.e. 12) as unit. + */ + return ilog2(zone_append_sects >> (NVMET_MPSMIN_SHIFT - 9)); +} + +static int validate_conv_zones_cb(struct blk_zone *z, + unsigned int i, void *data) +{ + if (z->type == BLK_ZONE_TYPE_CONVENTIONAL) + return -EOPNOTSUPP; + return 0; +} + +bool nvmet_bdev_zns_enable(struct nvmet_ns *ns) +{ + struct request_queue *q = ns->bdev->bd_disk->queue; + u8 zasl = nvmet_zasl(queue_max_zone_append_sectors(q)); + struct gendisk *bd_disk = ns->bdev->bd_disk; + int ret; + + if (ns->subsys->zasl) { + if (ns->subsys->zasl > zasl) + return false; + } + ns->subsys->zasl = zasl; + + /* + * Generic zoned block devices may have a smaller last zone which is + * not supported by ZNS. Exclude zoned drives that have such smaller + * last zone. + */ + if (get_capacity(bd_disk) & (bdev_zone_sectors(ns->bdev) - 1)) + return false; + /* + * ZNS does not define a conventional zone type. If the underlying + * device has a bitmap set indicating the existence of conventional + * zones, reject the device. Otherwise, use report zones to detect if + * the device has conventional zones. + */ + if (ns->bdev->bd_disk->queue->conv_zones_bitmap) + return false; + + ret = blkdev_report_zones(ns->bdev, 0, blkdev_nr_zones(bd_disk), + validate_conv_zones_cb, NULL); + if (ret < 0) + return false; + + ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev)); + + return true; +} + +void nvmet_execute_identify_cns_cs_ctrl(struct nvmet_req *req) +{ + u8 zasl = req->sq->ctrl->subsys->zasl; + struct nvmet_ctrl *ctrl = req->sq->ctrl; + struct nvme_id_ctrl_zns *id; + u16 status; + + id = kzalloc(sizeof(*id), GFP_KERNEL); + if (!id) { + status = NVME_SC_INTERNAL; + goto out; + } + + if (ctrl->ops->get_mdts) + id->zasl = min_t(u8, ctrl->ops->get_mdts(ctrl), zasl); + else + id->zasl = zasl; + + status = nvmet_copy_to_sgl(req, 0, id, sizeof(*id)); + + kfree(id); +out: + nvmet_req_complete(req, status); +} + +void nvmet_execute_identify_cns_cs_ns(struct nvmet_req *req) +{ + struct nvme_id_ns_zns *id_zns; + u64 zsze; + u16 status; + + if (le32_to_cpu(req->cmd->identify.nsid) == NVME_NSID_ALL) { + req->error_loc = offsetof(struct nvme_identify, nsid); + status = NVME_SC_INVALID_NS | NVME_SC_DNR; + goto out; + } + + id_zns = kzalloc(sizeof(*id_zns), GFP_KERNEL); + if (!id_zns) { + status = NVME_SC_INTERNAL; + goto out; + } + + status = nvmet_req_find_ns(req); + if (status) { + status = NVME_SC_INTERNAL; + goto done; + } + + if (!bdev_is_zoned(req->ns->bdev)) { + req->error_loc = offsetof(struct nvme_identify, nsid); + status = NVME_SC_INVALID_NS | NVME_SC_DNR; + goto done; + } + + nvmet_ns_revalidate(req->ns); + zsze = (bdev_zone_sectors(req->ns->bdev) << 9) >> + req->ns->blksize_shift; + id_zns->lbafe[0].zsze = cpu_to_le64(zsze); + id_zns->mor = cpu_to_le32(bdev_max_open_zones(req->ns->bdev)); + id_zns->mar = cpu_to_le32(bdev_max_active_zones(req->ns->bdev)); + +done: + status = nvmet_copy_to_sgl(req, 0, id_zns, sizeof(*id_zns)); + kfree(id_zns); +out: + nvmet_req_complete(req, status); +} + +static u16 nvmet_bdev_validate_zone_mgmt_recv(struct nvmet_req *req) +{ + sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba); + u32 out_bufsize = (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2; + + if (sect >= get_capacity(req->ns->bdev->bd_disk)) { + req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, slba); + return NVME_SC_LBA_RANGE | NVME_SC_DNR; + } + + if (out_bufsize < sizeof(struct nvme_zone_report)) { + req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, numd); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + } + + if (req->cmd->zmr.zra != NVME_ZRA_ZONE_REPORT) { + req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, zra); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + } + + switch (req->cmd->zmr.pr) { + case 0: + case 1: + break; + default: + req->error_loc = offsetof(struct nvme_zone_mgmt_recv_cmd, pr); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + } + + switch (req->cmd->zmr.zrasf) { + case NVME_ZRASF_ZONE_REPORT_ALL: + case NVME_ZRASF_ZONE_STATE_EMPTY: + case NVME_ZRASF_ZONE_STATE_IMP_OPEN: + case NVME_ZRASF_ZONE_STATE_EXP_OPEN: + case NVME_ZRASF_ZONE_STATE_CLOSED: + case NVME_ZRASF_ZONE_STATE_FULL: + case NVME_ZRASF_ZONE_STATE_READONLY: + case NVME_ZRASF_ZONE_STATE_OFFLINE: + break; + default: + req->error_loc = + offsetof(struct nvme_zone_mgmt_recv_cmd, zrasf); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + } + + return NVME_SC_SUCCESS; +} + +struct nvmet_report_zone_data { + struct nvmet_req *req; + u64 out_buf_offset; + u64 out_nr_zones; + u64 nr_zones; + u8 zrasf; +}; + +static int nvmet_bdev_report_zone_cb(struct blk_zone *z, unsigned i, void *d) +{ + static const unsigned int nvme_zrasf_to_blk_zcond[] = { + [NVME_ZRASF_ZONE_STATE_EMPTY] = BLK_ZONE_COND_EMPTY, + [NVME_ZRASF_ZONE_STATE_IMP_OPEN] = BLK_ZONE_COND_IMP_OPEN, + [NVME_ZRASF_ZONE_STATE_EXP_OPEN] = BLK_ZONE_COND_EXP_OPEN, + [NVME_ZRASF_ZONE_STATE_CLOSED] = BLK_ZONE_COND_CLOSED, + [NVME_ZRASF_ZONE_STATE_READONLY] = BLK_ZONE_COND_READONLY, + [NVME_ZRASF_ZONE_STATE_FULL] = BLK_ZONE_COND_FULL, + [NVME_ZRASF_ZONE_STATE_OFFLINE] = BLK_ZONE_COND_OFFLINE, + }; + struct nvmet_report_zone_data *rz = d; + + if (rz->zrasf != NVME_ZRASF_ZONE_REPORT_ALL && + z->cond != nvme_zrasf_to_blk_zcond[rz->zrasf]) + return 0; + + if (rz->nr_zones < rz->out_nr_zones) { + struct nvme_zone_descriptor zdesc = { }; + u16 status; + + zdesc.zcap = nvmet_sect_to_lba(rz->req->ns, z->capacity); + zdesc.zslba = nvmet_sect_to_lba(rz->req->ns, z->start); + zdesc.wp = nvmet_sect_to_lba(rz->req->ns, z->wp); + zdesc.za = z->reset ? 1 << 2 : 0; + zdesc.zs = z->cond << 4; + zdesc.zt = z->type; + + status = nvmet_copy_to_sgl(rz->req, rz->out_buf_offset, &zdesc, + sizeof(zdesc)); + if (status) + return -EINVAL; + + rz->out_buf_offset += sizeof(zdesc); + } + + rz->nr_zones++; + + return 0; +} + +static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req) +{ + unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba); + + return blkdev_nr_zones(req->ns->bdev->bd_disk) - + (sect >> ilog2(bdev_zone_sectors(req->ns->bdev))); +} + +static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize) +{ + if (bufsize <= sizeof(struct nvme_zone_report)) + return 0; + + return (bufsize - sizeof(struct nvme_zone_report)) / + sizeof(struct nvme_zone_descriptor); +} + +static void nvmet_bdev_zone_zmgmt_recv_work(struct work_struct *w) +{ + struct nvmet_req *req = container_of(w, struct nvmet_req, z.zmgmt_work); + sector_t start_sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba); + unsigned long req_slba_nr_zones = nvmet_req_nr_zones_from_slba(req); + u32 out_bufsize = (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2; + __le64 nr_zones; + u16 status; + int ret; + struct nvmet_report_zone_data rz_data = { + .out_nr_zones = get_nr_zones_from_buf(req, out_bufsize), + /* leave the place for report zone header */ + .out_buf_offset = sizeof(struct nvme_zone_report), + .zrasf = req->cmd->zmr.zrasf, + .nr_zones = 0, + .req = req, + }; + + status = nvmet_bdev_validate_zone_mgmt_recv(req); + if (status) + goto out; + + if (!req_slba_nr_zones) { + status = NVME_SC_SUCCESS; + goto out; + } + + ret = blkdev_report_zones(req->ns->bdev, start_sect, req_slba_nr_zones, + nvmet_bdev_report_zone_cb, &rz_data); + if (ret < 0) { + status = NVME_SC_INTERNAL; + goto out; + } + + /* + * When partial bit is set nr_zones must indicate the number of zone + * descriptors actually transferred. + */ + if (req->cmd->zmr.pr) + rz_data.nr_zones = min(rz_data.nr_zones, rz_data.out_nr_zones); + + nr_zones = cpu_to_le64(rz_data.nr_zones); + status = nvmet_copy_to_sgl(req, 0, &nr_zones, sizeof(nr_zones)); + +out: + nvmet_req_complete(req, status); +} + +void nvmet_bdev_execute_zone_mgmt_recv(struct nvmet_req *req) +{ + INIT_WORK(&req->z.zmgmt_work, nvmet_bdev_zone_zmgmt_recv_work); + queue_work(zbd_wq, &req->z.zmgmt_work); +} + +static inline enum req_opf zsa_req_op(u8 zsa) +{ + switch (zsa) { + case NVME_ZONE_OPEN: + return REQ_OP_ZONE_OPEN; + case NVME_ZONE_CLOSE: + return REQ_OP_ZONE_CLOSE; + case NVME_ZONE_FINISH: + return REQ_OP_ZONE_FINISH; + case NVME_ZONE_RESET: + return REQ_OP_ZONE_RESET; + default: + return REQ_OP_LAST; + } +} + +static u16 blkdev_zone_mgmt_errno_to_nvme_status(int ret) +{ + switch (ret) { + case 0: + return NVME_SC_SUCCESS; + case -EINVAL: + case -EIO: + return NVME_SC_ZONE_INVALID_TRANSITION | NVME_SC_DNR; + default: + return NVME_SC_INTERNAL; + } +} + +struct nvmet_zone_mgmt_send_all_data { + unsigned long *zbitmap; + struct nvmet_req *req; +}; + +static int zmgmt_send_scan_cb(struct blk_zone *z, unsigned i, void *d) +{ + struct nvmet_zone_mgmt_send_all_data *data = d; + + switch (zsa_req_op(data->req->cmd->zms.zsa)) { + case REQ_OP_ZONE_OPEN: + switch (z->cond) { + case BLK_ZONE_COND_CLOSED: + break; + default: + return 0; + } + break; + case REQ_OP_ZONE_CLOSE: + switch (z->cond) { + case BLK_ZONE_COND_IMP_OPEN: + case BLK_ZONE_COND_EXP_OPEN: + break; + default: + return 0; + } + break; + case REQ_OP_ZONE_FINISH: + switch (z->cond) { + case BLK_ZONE_COND_IMP_OPEN: + case BLK_ZONE_COND_EXP_OPEN: + case BLK_ZONE_COND_CLOSED: + break; + default: + return 0; + } + break; + default: + return -EINVAL; + } + + set_bit(i, data->zbitmap); + + return 0; +} + +static u16 nvmet_bdev_zone_mgmt_emulate_all(struct nvmet_req *req) +{ + struct block_device *bdev = req->ns->bdev; + unsigned int nr_zones = blkdev_nr_zones(bdev->bd_disk); + struct request_queue *q = bdev_get_queue(bdev); + struct bio *bio = NULL; + sector_t sector = 0; + int ret; + struct nvmet_zone_mgmt_send_all_data d = { + .req = req, + }; + + d.zbitmap = kcalloc_node(BITS_TO_LONGS(nr_zones), sizeof(*(d.zbitmap)), + GFP_NOIO, q->node); + if (!d.zbitmap) { + ret = -ENOMEM; + goto out; + } + + /* Scan and build bitmap of the eligible zones */ + ret = blkdev_report_zones(bdev, 0, nr_zones, zmgmt_send_scan_cb, &d); + if (ret != nr_zones) { + if (ret > 0) + ret = -EIO; + goto out; + } else { + /* We scanned all the zones */ + ret = 0; + } + + while (sector < get_capacity(bdev->bd_disk)) { + if (test_bit(blk_queue_zone_no(q, sector), d.zbitmap)) { + bio = blk_next_bio(bio, 0, GFP_KERNEL); + bio->bi_opf = zsa_req_op(req->cmd->zms.zsa) | REQ_SYNC; + bio->bi_iter.bi_sector = sector; + bio_set_dev(bio, bdev); + /* This may take a while, so be nice to others */ + cond_resched(); + } + sector += blk_queue_zone_sectors(q); + } + + if (bio) { + ret = submit_bio_wait(bio); + bio_put(bio); + } + +out: + kfree(d.zbitmap); + + return blkdev_zone_mgmt_errno_to_nvme_status(ret); +} + +static u16 nvmet_bdev_execute_zmgmt_send_all(struct nvmet_req *req) +{ + int ret; + + switch (zsa_req_op(req->cmd->zms.zsa)) { + case REQ_OP_ZONE_RESET: + ret = blkdev_zone_mgmt(req->ns->bdev, REQ_OP_ZONE_RESET, 0, + get_capacity(req->ns->bdev->bd_disk), + GFP_KERNEL); + if (ret < 0) + return blkdev_zone_mgmt_errno_to_nvme_status(ret); + break; + case REQ_OP_ZONE_OPEN: + case REQ_OP_ZONE_CLOSE: + case REQ_OP_ZONE_FINISH: + return nvmet_bdev_zone_mgmt_emulate_all(req); + default: + /* this is needed to quiet compiler warning */ + req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, zsa); + return NVME_SC_INVALID_FIELD | NVME_SC_DNR; + } + + return NVME_SC_SUCCESS; +} + +static void nvmet_bdev_zmgmt_send_work(struct work_struct *w) +{ + struct nvmet_req *req = container_of(w, struct nvmet_req, z.zmgmt_work); + sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zms.slba); + enum req_opf op = zsa_req_op(req->cmd->zms.zsa); + struct block_device *bdev = req->ns->bdev; + sector_t zone_sectors = bdev_zone_sectors(bdev); + u16 status = NVME_SC_SUCCESS; + int ret; + + if (op == REQ_OP_LAST) { + req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, zsa); + status = NVME_SC_ZONE_INVALID_TRANSITION | NVME_SC_DNR; + goto out; + } + + /* when select all bit is set slba field is ignored */ + if (req->cmd->zms.select_all) { + status = nvmet_bdev_execute_zmgmt_send_all(req); + goto out; + } + + if (sect >= get_capacity(bdev->bd_disk)) { + req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, slba); + status = NVME_SC_LBA_RANGE | NVME_SC_DNR; + goto out; + } + + if (sect & (zone_sectors - 1)) { + req->error_loc = offsetof(struct nvme_zone_mgmt_send_cmd, slba); + status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; + goto out; + } + + ret = blkdev_zone_mgmt(bdev, op, sect, zone_sectors, GFP_KERNEL); + if (ret < 0) + status = blkdev_zone_mgmt_errno_to_nvme_status(ret); + +out: + nvmet_req_complete(req, status); +} + +void nvmet_bdev_execute_zone_mgmt_send(struct nvmet_req *req) +{ + INIT_WORK(&req->z.zmgmt_work, nvmet_bdev_zmgmt_send_work); + queue_work(zbd_wq, &req->z.zmgmt_work); +} + +static void nvmet_bdev_zone_append_bio_done(struct bio *bio) +{ + struct nvmet_req *req = bio->bi_private; + + if (bio->bi_status == BLK_STS_OK) { + req->cqe->result.u64 = + nvmet_sect_to_lba(req->ns, bio->bi_iter.bi_sector); + } + + nvmet_req_complete(req, blk_to_nvme_status(req, bio->bi_status)); + nvmet_req_bio_put(req, bio); +} + +void nvmet_bdev_execute_zone_append(struct nvmet_req *req) +{ + sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->rw.slba); + u16 status = NVME_SC_SUCCESS; + unsigned int total_len = 0; + struct scatterlist *sg; + struct bio *bio; + int sg_cnt; + + /* Request is completed on len mismatch in nvmet_check_transter_len() */ + if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req))) + return; + + if (!req->sg_cnt) { + nvmet_req_complete(req, 0); + return; + } + + if (sect >= get_capacity(req->ns->bdev->bd_disk)) { + req->error_loc = offsetof(struct nvme_rw_command, slba); + status = NVME_SC_LBA_RANGE | NVME_SC_DNR; + goto out; + } + + if (sect & (bdev_zone_sectors(req->ns->bdev) - 1)) { + req->error_loc = offsetof(struct nvme_rw_command, slba); + status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; + goto out; + } + + if (nvmet_use_inline_bvec(req)) { + bio = &req->z.inline_bio; + bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec)); + } else { + bio = bio_alloc(GFP_KERNEL, req->sg_cnt); + } + + bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE; + bio->bi_end_io = nvmet_bdev_zone_append_bio_done; + bio_set_dev(bio, req->ns->bdev); + bio->bi_iter.bi_sector = sect; + bio->bi_private = req; + if (req->cmd->rw.control & cpu_to_le16(NVME_RW_FUA)) + bio->bi_opf |= REQ_FUA; + + for_each_sg(req->sg, sg, req->sg_cnt, sg_cnt) { + struct page *p = sg_page(sg); + unsigned int l = sg->length; + unsigned int o = sg->offset; + unsigned int ret; + + ret = bio_add_zone_append_page(bio, p, l, o); + if (ret != sg->length) { + status = NVME_SC_INTERNAL; + goto out_put_bio; + } + total_len += sg->length; + } + + if (total_len != nvmet_rw_data_len(req)) { + status = NVME_SC_INTERNAL | NVME_SC_DNR; + goto out_put_bio; + } + + submit_bio(bio); + return; + +out_put_bio: + nvmet_req_bio_put(req, bio); +out: + nvmet_req_complete(req, status); +} + +u16 nvmet_bdev_zns_parse_io_cmd(struct nvmet_req *req) +{ + struct nvme_command *cmd = req->cmd; + + switch (cmd->common.opcode) { + case nvme_cmd_zone_append: + req->execute = nvmet_bdev_execute_zone_append; + return 0; + case nvme_cmd_zone_mgmt_recv: + req->execute = nvmet_bdev_execute_zone_mgmt_recv; + return 0; + case nvme_cmd_zone_mgmt_send: + req->execute = nvmet_bdev_execute_zone_mgmt_send; + return 0; + default: + return nvmet_bdev_parse_io_cmd(req); + } +} diff --git a/include/linux/nvme.h b/include/linux/nvme.h index c7ba83144d52..cb1197f1cfed 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -944,6 +944,13 @@ struct nvme_zone_mgmt_recv_cmd { enum { NVME_ZRA_ZONE_REPORT = 0, NVME_ZRASF_ZONE_REPORT_ALL = 0, + NVME_ZRASF_ZONE_STATE_EMPTY = 0x01, + NVME_ZRASF_ZONE_STATE_IMP_OPEN = 0x02, + NVME_ZRASF_ZONE_STATE_EXP_OPEN = 0x03, + NVME_ZRASF_ZONE_STATE_CLOSED = 0x04, + NVME_ZRASF_ZONE_STATE_READONLY = 0x05, + NVME_ZRASF_ZONE_STATE_FULL = 0x06, + NVME_ZRASF_ZONE_STATE_OFFLINE = 0x07, NVME_REPORT_ZONE_PARTIAL = 1, }; From 8abd7e2a753ad5ae59c3ca918e71f437c0c4b344 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 16 Jun 2021 15:15:51 -0700 Subject: [PATCH 73/78] nvmet: remove zeroout memset call for struct Declare and initialize structure variables to zero values so that we can remove zeroout memset calls in the target/rdma.c. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/rdma.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 7d607f435e36..891174ccd44b 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -1257,7 +1257,7 @@ out_err: static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue) { - struct ib_qp_init_attr qp_attr; + struct ib_qp_init_attr qp_attr = { }; struct nvmet_rdma_device *ndev = queue->dev; int nr_cqe, ret, i, factor; @@ -1275,7 +1275,6 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue) goto out; } - memset(&qp_attr, 0, sizeof(qp_attr)); qp_attr.qp_context = queue; qp_attr.event_handler = nvmet_rdma_qp_event; qp_attr.send_cq = queue->cq; From f66e2804d61aef690bb428d8de6a127f844bb240 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 16 Jun 2021 15:15:53 -0700 Subject: [PATCH 74/78] nvme-pci: remove zeroout memset call for struct Declare and initialize structure variables to zero values so that we can remove zeroout memset calls in the host/pci.c. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/pci.c | 24 ++++++++---------------- 1 file changed, 8 insertions(+), 16 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 12ffd58c27b1..d3c5086673bc 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -307,13 +307,12 @@ static void nvme_dbbuf_free(struct nvme_queue *nvmeq) static void nvme_dbbuf_set(struct nvme_dev *dev) { - struct nvme_command c; + struct nvme_command c = { }; unsigned int i; if (!dev->dbbuf_dbs) return; - memset(&c, 0, sizeof(c)); c.dbbuf.opcode = nvme_admin_dbbuf; c.dbbuf.prp1 = cpu_to_le64(dev->dbbuf_dbs_dma_addr); c.dbbuf.prp2 = cpu_to_le64(dev->dbbuf_eis_dma_addr); @@ -1112,9 +1111,8 @@ static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl) { struct nvme_dev *dev = to_nvme_dev(ctrl); struct nvme_queue *nvmeq = &dev->queues[0]; - struct nvme_command c; + struct nvme_command c = { }; - memset(&c, 0, sizeof(c)); c.common.opcode = nvme_admin_async_event; c.common.command_id = NVME_AQ_BLK_MQ_DEPTH; nvme_submit_cmd(nvmeq, &c, true); @@ -1122,9 +1120,8 @@ static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl) static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id) { - struct nvme_command c; + struct nvme_command c = { }; - memset(&c, 0, sizeof(c)); c.delete_queue.opcode = opcode; c.delete_queue.qid = cpu_to_le16(id); @@ -1134,7 +1131,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id) static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, struct nvme_queue *nvmeq, s16 vector) { - struct nvme_command c; + struct nvme_command c = { }; int flags = NVME_QUEUE_PHYS_CONTIG; if (!test_bit(NVMEQ_POLLED, &nvmeq->flags)) @@ -1144,7 +1141,6 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, * Note: we (ab)use the fact that the prp fields survive if no data * is attached to the request. */ - memset(&c, 0, sizeof(c)); c.create_cq.opcode = nvme_admin_create_cq; c.create_cq.prp1 = cpu_to_le64(nvmeq->cq_dma_addr); c.create_cq.cqid = cpu_to_le16(qid); @@ -1159,7 +1155,7 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid, struct nvme_queue *nvmeq) { struct nvme_ctrl *ctrl = &dev->ctrl; - struct nvme_command c; + struct nvme_command c = { }; int flags = NVME_QUEUE_PHYS_CONTIG; /* @@ -1174,7 +1170,6 @@ static int adapter_alloc_sq(struct nvme_dev *dev, u16 qid, * Note: we (ab)use the fact that the prp fields survive if no data * is attached to the request. */ - memset(&c, 0, sizeof(c)); c.create_sq.opcode = nvme_admin_create_sq; c.create_sq.prp1 = cpu_to_le64(nvmeq->sq_dma_addr); c.create_sq.sqid = cpu_to_le16(qid); @@ -1255,7 +1250,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) struct nvme_queue *nvmeq = iod->nvmeq; struct nvme_dev *dev = nvmeq->dev; struct request *abort_req; - struct nvme_command cmd; + struct nvme_command cmd = { }; u32 csts = readl(dev->bar + NVME_REG_CSTS); /* If PCI error recovery process is happening, we cannot reset or @@ -1335,7 +1330,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) } iod->aborted = 1; - memset(&cmd, 0, sizeof(cmd)); cmd.abort.opcode = nvme_admin_abort_cmd; cmd.abort.cid = req->tag; cmd.abort.sqid = cpu_to_le16(nvmeq->qid); @@ -1886,10 +1880,9 @@ static int nvme_set_host_mem(struct nvme_dev *dev, u32 bits) { u32 host_mem_size = dev->host_mem_size >> NVME_CTRL_PAGE_SHIFT; u64 dma_addr = dev->host_mem_descs_dma; - struct nvme_command c; + struct nvme_command c = { }; int ret; - memset(&c, 0, sizeof(c)); c.features.opcode = nvme_admin_set_features; c.features.fid = cpu_to_le32(NVME_FEAT_HOST_MEM_BUF); c.features.dword11 = cpu_to_le32(bits); @@ -2263,9 +2256,8 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode) { struct request_queue *q = nvmeq->dev->ctrl.admin_q; struct request *req; - struct nvme_command cmd; + struct nvme_command cmd = { }; - memset(&cmd, 0, sizeof(cmd)); cmd.delete_queue.opcode = opcode; cmd.delete_queue.qid = cpu_to_le16(nvmeq->qid); From cc72c4426764d1716839e9ec591ee8e161ed5cbc Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Wed, 16 Jun 2021 15:15:52 -0700 Subject: [PATCH 75/78] nvme: remove zeroout memset call for struct Declare and initialize structure variables to zero values so that we can remove zeroout memset calls in the host/core.c. Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 177cae44b612..c7ef0b6684b5 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -721,9 +721,7 @@ EXPORT_SYMBOL_GPL(__nvme_check_ready); static int nvme_toggle_streams(struct nvme_ctrl *ctrl, bool enable) { - struct nvme_command c; - - memset(&c, 0, sizeof(c)); + struct nvme_command c = { }; c.directive.opcode = nvme_admin_directive_send; c.directive.nsid = cpu_to_le32(NVME_NSID_ALL); @@ -748,9 +746,8 @@ static int nvme_enable_streams(struct nvme_ctrl *ctrl) static int nvme_get_stream_params(struct nvme_ctrl *ctrl, struct streams_directive_params *s, u32 nsid) { - struct nvme_command c; + struct nvme_command c = { }; - memset(&c, 0, sizeof(c)); memset(s, 0, sizeof(*s)); c.directive.opcode = nvme_admin_directive_recv; @@ -1460,10 +1457,9 @@ static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid, unsigned int dword11, void *buffer, size_t buflen, u32 *result) { union nvme_result res = { 0 }; - struct nvme_command c; + struct nvme_command c = { }; int ret; - memset(&c, 0, sizeof(c)); c.features.opcode = op; c.features.fid = cpu_to_le32(fid); c.features.dword11 = cpu_to_le32(dword11); @@ -1591,9 +1587,8 @@ int nvme_getgeo(struct block_device *bdev, struct hd_geometry *geo) static void nvme_init_integrity(struct gendisk *disk, u16 ms, u8 pi_type, u32 max_integrity_segments) { - struct blk_integrity integrity; + struct blk_integrity integrity = { }; - memset(&integrity, 0, sizeof(integrity)); switch (pi_type) { case NVME_NS_DPS_PI_TYPE3: integrity.profile = &t10_pi_type3_crc; @@ -1964,13 +1959,12 @@ static int nvme_send_ns_pr_command(struct nvme_ns *ns, struct nvme_command *c, static int nvme_pr_command(struct block_device *bdev, u32 cdw10, u64 key, u64 sa_key, u8 op) { - struct nvme_command c; + struct nvme_command c = { }; u8 data[16] = { 0, }; put_unaligned_le64(key, &data[0]); put_unaligned_le64(sa_key, &data[8]); - memset(&c, 0, sizeof(c)); c.common.opcode = op; c.common.cdw10 = cpu_to_le32(cdw10); @@ -2042,9 +2036,8 @@ int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, bool send) { struct nvme_ctrl *ctrl = data; - struct nvme_command cmd; + struct nvme_command cmd = { }; - memset(&cmd, 0, sizeof(cmd)); if (send) cmd.common.opcode = nvme_admin_security_send; else From 8cf486e131b351db4f224078bef8e1efedcf0340 Mon Sep 17 00:00:00 2001 From: Wesley Sheng Date: Wed, 16 Jun 2021 13:25:08 +0800 Subject: [PATCH 76/78] nvme.h: add missing nvme_lba_range_type endianness annotations Signed-off-by: Wesley Sheng Signed-off-by: Christoph Hellwig --- include/linux/nvme.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/nvme.h b/include/linux/nvme.h index cb1197f1cfed..b7c4c4130b65 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -636,8 +636,8 @@ struct nvme_lba_range_type { __u8 type; __u8 attributes; __u8 rsvd2[14]; - __u64 slba; - __u64 nlb; + __le64 slba; + __le64 nlb; __u8 guid[16]; __u8 rsvd48[16]; }; From 2b9ac22b12a266eb4fec246a07b504dd4983b16b Mon Sep 17 00:00:00 2001 From: Kristian Klausen Date: Fri, 18 Jun 2021 13:51:57 +0200 Subject: [PATCH 77/78] loop: Fix missing discard support when using LOOP_CONFIGURE Without calling loop_config_discard() the discard flag and parameters aren't set/updated for the loop device and worst-case they could indicate discard support when it isn't the case (ex: if the LOOP_SET_STATUS ioctl was used with a different file prior to LOOP_CONFIGURE). Cc: # 5.8.x- Fixes: 3448914e8cc5 ("loop: Add LOOP_CONFIGURE ioctl") Signed-off-by: Kristian Klausen Link: https://lore.kernel.org/r/20210618115157.31452-1-kristian@klausen.dk Signed-off-by: Jens Axboe --- drivers/block/loop.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index d58d68f3c7cd..fda071fae541 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1154,6 +1154,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, blk_queue_physical_block_size(lo->lo_queue, bsize); blk_queue_io_min(lo->lo_queue, bsize); + loop_config_discard(lo); loop_update_rotational(lo); loop_update_dio(lo); loop_sysfs_init(lo); From 3c3ee16532c1be92350a2a88bd19283b7bdf32e9 Mon Sep 17 00:00:00 2001 From: Chaitanya Kulkarni Date: Sun, 20 Jun 2021 20:01:09 -0700 Subject: [PATCH 78/78] nvmet: use NVMET_MAX_NAMESPACES to set nn value For Spec regarding MNAN value:- If the controller supports Asymmetric Namespace Access Reporting, then this field shall be set to a non-zero value that is less than or equal to the NN value. Instead of using subsys->max_nsid that gets calculated dynamically, use NVMET_MAX_NAMESPACES value to report NN. This way we will maintain the MNAN value spec compliant with NN. Without this patch, code results in the following error :- [337976.409142] nvme nvme1: Invalid MNAN value 1024 Signed-off-by: Chaitanya Kulkarni Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 363e357d2f20..0cb98f2bbc8c 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -422,7 +422,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req) /* no enforcement soft-limit for maxcmd - pick arbitrary high value */ id->maxcmd = cpu_to_le16(NVMET_MAX_CMD); - id->nn = cpu_to_le32(ctrl->subsys->max_nsid); + id->nn = cpu_to_le32(NVMET_MAX_NAMESPACES); id->mnan = cpu_to_le32(NVMET_MAX_NAMESPACES); id->oncs = cpu_to_le16(NVME_CTRL_ONCS_DSM | NVME_CTRL_ONCS_WRITE_ZEROES);