linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-18 00:53:40 +00:00

Author	SHA1	Message	Date
Jens Axboe	8ad9f57755	nvme updates for Linux 5.19 - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (me) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKGnjgLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYPxSA//XaYH0kncX2znto5S6772TJ64L2Rtfbh7LPQzE40V /Au3OvEESyP/oGh7RQxTMdKnU19Gzc+lYzTZ8VP3h/cORVEtHb5RSzb2UEUWyRbm 3+WEurG6d91PlH0HyfUsRe1y441pwr6WVDb3Zh/O1/PhsM0RihjnLEjdjobQMQJL BKFPpEfOxBPFDaaHEAVM8SB0wkhAD17C8C+YPow8IF4F1iBqmjpgtjixtHBmAovu 0Uq99ySyypyjPsEYPrUHesNFmOP4FJbosBvLVJuHDrOVdl900PmWhbCjsOcfzVRn g7xpYxreStHFUsneA8GkfUM6gEVMhOpTjdeborv9P64O9oziH35ZENHorHf+QCAu Es0HGJ3QziWqchkWxFxzLJWB4Ab5j0hhBNocQB4aM6lb8sV82+Fcyxf7NOMIkJIo aFlHXRdm6d3Ox/hBJiERFmb52Op3KAvrTYyhpPFGhQRdptAf3p1IQCfUykl5mIXz KW5BhonVtY/jS/lqbe5bw08becIqlcaX9+4QbxHGHaQO7tBRewBPMmbWfIs4Nh9i OytRRqiF/jigL66C6+wZR8bvnyk1kKvJvfUJDBv2s0THPqfqG55uYkDfZsX+B3nV n8KF4xRoLwbqIhkDRdnrhYbFaRwsK7S6hEa8nHLsUR+Ue1lc9rzOJmnbSHgXALWg VeA= =6fdL -----END PGP SIGNATURE----- Merge tag 'nvme-5.19-2022-05-19' of git://git.infradead.org/nvme into for-5.19/drivers Pull NVMe updates from Christoph: "nvme updates for Linux 5.19 - set non-mdts limits in nvme_scan_work (Chaitanya Kulkarni) - add support for TP4084 - Time-to-Ready Enhancements (me)" * tag 'nvme-5.19-2022-05-19' of git://git.infradead.org/nvme: nvme: set non-mdts limits in nvme_scan_work nvme: add support for TP4084 - Time-to-Ready Enhancements	2022-05-20 06:56:32 -06:00
Chaitanya Kulkarni	78288665b5	nvme: set non-mdts limits in nvme_scan_work In current implementation we set the non-mdts limits by calling nvme_init_non_mdts_limits() from nvme_init_ctrl_finish(). This also tries to set the limits for the discovery controller which has no I/O queues resulting in the warning message reported by the nvme_log_error() when running blktest nvme/002: - [ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47 [ 2005.192223] loop: module loaded [ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0 [ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 <------------------------------SNIP----------------------------------> [ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997 [ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998 [ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999 [ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn. [ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR [ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery" [ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery" Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after we verify that I/O queues are created since that is a converging point for each transport where these limits are actually used. 1. FC : nvme_fc_create_association() ... nvme_fc_create_io_queues(ctrl); ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() 2. PCIe:- nvme_reset_work() ... nvme_setup_io_queues() nvme_create_io_queues() nvme_alloc_queue() ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() 3. RDMA :- nvme_rdma_setup_ctrl ... nvme_rdma_configure_io_queues ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() 4. TCP :- nvme_tcp_setup_ctrl ... nvme_tcp_configure_io_queues ... nvme_start_ctrl() nvme_scan_queue() nvme_scan_work() * nvme_scan_work() ... nvme_validate_or_alloc_ns() nvme_alloc_ns() nvme_update_ns_info() nvme_update_disk_info() nvme_config_discard() <--- blk_queue_max_write_zeroes_sectors() <--- Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-19 21:30:22 +02:00
Christoph Hellwig	354201c53e	nvme: add support for TP4084 - Time-to-Ready Enhancements Add support for using longer timeouts during controller initialization and letting the controller come up with namespaces that are not ready for I/O yet. We skip these not ready namespaces during scanning and only bring them online once anoter scan is kicked off by the AEN that is set when the NRDY bit gets set in the I/O Command Set Independent Identify Namespace Data Structure. This asynchronous probing avoids blocking the kernel boot when controllers take a very long time to recover after unclean shutdowns (up to minutes). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2022-05-18 18:54:17 +02:00
Jens Axboe	da14f237ce	nvme updates for Linux 5.19 - tighten the PCI presence check (Stefan Roese): - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, me) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKEle4LHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMUfw/8CyIcKdN5uUtMAekDhFEnskivxthtWj1PcRXSyA3M fkfc+Jc5lOALTZ2d4zVHoDEB+8IoojfSwNuCtftCVPTv0dEnPlZ8pQ8qKe9Mj245 I/TswqlV2euiuwkP8RorLcUjPJWRGAOinCgyfgdCTBjjQIO40OsgPFk/COYpVBvJ aWa5PXPk7JT1b+6VR27tr2lmGnqppS8Uz03fMouqvcSCGPjN0IKEGuYrpvjvJVs0 YOdIEPc/GnMUmlg3W6H1YkC8bCiCS2/YtPqXHBLlmQqDswgqgqWT8ycrV1PK3Z7K RWwumuXiX0bgSiv6DWz3j8lVgOp79B59nAFyvnRFAWaziIE1ByCp/iVQQrZsxJcs MjeJrUmtu+jlKN80ZBZJHDBCnT0XePrF2u7YL9j4jJC6ORlQsjKOI3BYt9dIWiYQ PAj58bmsCu7I0w89aqKd3YB3hBRLRT70v0XIK/nbdbAQTFYQ73yi42Kds7oHw2xO Vg7115yh1K3fiawOI7l2DyAQ4yQ0n42NOt2HkMfmRRQyWMeLLuv3GK/WAs5qpEuB dRijsEvLFUU4uPRSR6W6upgA8Oqgx+71wHVRpYxg7k3TDn+HWOQ3YRiRtZJ0xR7W HCtHTrDTn5AuMBrKMtTpG0VqTo5CIOS8jFAz3yvbL/6EBBWmAAj6+Jw7QD12dtLc t50= =w9PF -----END PGP SIGNATURE----- Merge tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme into for-5.19/drivers Pull NVMe updates from Christoph: "nvme updates for Linux 5.19 - tighten the PCI presence check (Stefan Roese): - fix a potential NULL pointer dereference in an error path (Kyle Miller Smith) - fix interpretation of the DMRSL field (Tom Yan) - relax the data transfer alignment (Keith Busch) - verbose error logging improvements (Max Gurtovoy, Chaitanya Kulkarni) - misc cleanups (Chaitanya Kulkarni, me)" * tag 'nvme-5.19-2022-05-18' of git://git.infradead.org/nvme: nvme: split the enum used for various register constants nvme-fabrics: add a request timeout helper nvme-pci: harden drive presence detect in nvme_dev_disable() nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags nvme: mark internal passthru request RQF_QUIET nvme: remove unneeded include from constants file nvme: add missing status values to verbose logging nvme: set dma alignment to dword nvme: fix interpretation of DMRSL	2022-05-18 06:28:04 -06:00
Christoph Hellwig	e626f37e65	nvme: split the enum used for various register constants Instead of having one big enum add one for each register or field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>	2022-05-17 07:33:27 +02:00
Xie Yongji	491bf8f236	nbd: Fix hung on disconnect request if socket is closed before When userspace closes the socket before sending a disconnect request, the following I/O requests will be blocked in wait_for_reconnect() until dead timeout. This will cause the following disconnect request also hung on blk_mq_quiesce_queue(). That means we have no way to disconnect a nbd device if there are some I/O requests waiting for reconnecting until dead timeout. It's not expected. So let's wake up the thread waiting for reconnecting directly when a disconnect request is sent. Reported-by: Xu Jianhai <zero.xu@bytedance.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20220322080639.142-1-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-16 06:19:35 -06:00
Chaitanya Kulkarni	93ba75c905	nvme-fabrics: add a request timeout helper The RDAMA and TCP transport both complete the timed out request in the same manner and hence code is duplicated. Add and use the helper nvmf_complete_timed_out_request() to remove the duplicate code. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:07:25 +02:00
Stefan Roese	b98235d3a4	nvme-pci: harden drive presence detect in nvme_dev_disable() On our ZynqMP system we observe, that a NVMe drive that resets itself while doing a firmware update causes a Kernel crash like this: [ 67.720772] pcieport 0000:02:02.0: pciehp: Slot(2): Link Down [ 67.720783] pcieport 0000:02:02.0: pciehp: Slot(2): Card not present [ 67.720795] nvme 0000:04:00.0: PME# disabled [ 67.720849] Internal error: synchronous external abort: 96000010 [#1] PREEMPT SMP [ 67.720853] nwl-pcie fd0e0000.pcie: Slave error Analysis: When nvme_dev_disable() is called because of this PCIe hotplug event, pci_is_enabled() is still true. And accessing the NVMe drive which is currently not available as it's in reboot process causes this "synchronous external abort" on this ARM64 platform. This patch adds the pci_device_is_present() check as well, which returns false in this "Card not present" hot-plug case. With this change, the NVMe driver does not try to access the NVMe registers any more and the FW update finishes without any problems. Signed-off-by: Stefan Roese <sr@denx.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:07:19 +02:00
Smith, Kyle Miller (Nimble Kernel)	da42761181	nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags In nvme_alloc_admin_tags, the admin_q can be set to an error (typically -ENOMEM) if the blk_mq_init_queue call fails to set up the queue, which is checked immediately after the call. However, when we return the error message up the stack, to nvme_reset_work the error takes us to nvme_remove_dead_ctrl() nvme_dev_disable() nvme_suspend_queue(&dev->queues[0]). Here, we only check that the admin_q is non-NULL, rather than not an error or NULL, and begin quiescing a queue that never existed, leading to bad / NULL pointer dereference. Signed-off-by: Kyle Smith <kyles@hpe.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:06:59 +02:00
Chaitanya Kulkarni	128126a794	nvme: mark internal passthru request RQF_QUIET Most of the internal passthru commands use __nvme_submit_sync_cmd() interface. There are few places we open code the request submission :- 1. nvme_keep_alive_work(struct work_struct work) 2. nvme_timeout(struct request req, bool reserved) 3. nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode) Mark the internal passthru request quiet so that we can skip the verbose error message from nvme_log_error() in nvme_end_req() completion path, this will be consistent with what we have in __nvme_submit_sync_cmd(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:06:59 +02:00
Max Gurtovoy	da3340e77e	nvme: remove unneeded include from constants file No usage of blkdev.h elements. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:06:59 +02:00
Max Gurtovoy	ca2d89925a	nvme: add missing status values to verbose logging Log a few more path related status codes. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:06:58 +02:00
Keith Busch	52fde2c07d	nvme: set dma alignment to dword The nvme specification only requires qword alignment for segment descriptors, and the driver already guarantees that. The spec has always allowed user data to be dword aligned, which is what the queue's attribute is for, so relax the alignment requirement to that value. While we could allow byte alignment for some controllers when using SGLs, we still need to support PRP, and that only allows dword. Fixes: `3b2a1ebceb` ("nvme: set dma alignment to qword") Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:06:58 +02:00
Tom Yan	1a86924e4f	nvme: fix interpretation of DMRSL DMRSLl is in the unit of logical blocks, while max_discard_sectors is in the unit of "linux sector". Signed-off-by: Tom Yan <tom.ty89@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2022-05-16 08:06:54 +02:00
Christoph Hellwig	c23d47abee	loop: remove most the top-of-file boilerplate comment from the UAPI header Just leave the SPDX marker and the copyright notice and remove the irrelevant rest. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220419063303.583106-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-10 06:30:05 -06:00
Christoph Hellwig	eb04bb154b	loop: remove most the top-of-file boilerplate comment Remove the irrelevant changelogs and todo notes and just leave the SPDX marker and the copyright notice. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220419063303.583106-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-10 06:30:05 -06:00
Christoph Hellwig	f21e6e185a	loop: add a SPDX header The copyright statement says: "Redistribution of this file is permitted under the GNU General Public License." and was added by Ted in 1993, at which point GPLv2 only was the default Linux license. Replace it with the usual GPLv2 only SPDX header. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220419063303.583106-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-10 06:30:05 -06:00
Christoph Hellwig	754d96798f	loop: remove loop.h Merge loop.h into loop.c as all the content is only used there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220419063303.583106-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-10 06:30:05 -06:00
Damien Le Moal	49c3b9266a	block: null_blk: Improve device creation with configfs Currently, the directory name used to create a nullb device through sysfs is not used as the device name, potentially causing headaches for users if devices are already created through the modprobe operation withe the nr_device module parameter not set to 0. E.g. a user can do "mkdir /sys/kernel/config/nullb/nullb0" to create a nullb device even though /dev/nullb0 was already created by modprobe. In this case, the configfs nullb device will be named nullb1, causing confusion for the user. Simplify this by using the configfs directory name as the nullb device name, always, unless another nullb device is already using the same name. E.g. if modprobe created nullb0, then: $ mkdir /sys/kernel/config/nullb/nullb0 mkdir: cannot create directory '/sys/kernel/config/nullb/nullb0': File exists will be reported to the user. To implement this, the function null_find_dev_by_name() is added to check for the existence of a nullb device with the name used for a new configfs device directory. nullb_group_make_item() uses this new function to check if the directory name can be used as the disk name. Finally, null_add_dev() is modified to use the device config item name as the disk name for a new nullb device created using configfs. The naming of devices created though modprobe remains unchanged. Of note is that it is possible for a user to create through configfs a nullb device with the same name as an existing device. E.g. $ mkdir /sys/kernel/config/nullb/null will successfully create the nullb device named "null" but this block device will however not appear under /dev/ since /dev/null already exists. Suggested-by: Joseph Bacik <josef@toxicpanda.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220420005718.3780004-5-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 05:24:58 -06:00
Damien Le Moal	db060f54e0	block: null_blk: Cleanup messages Use the pr_fmt() macro to prefix all null_blk pr_xxx() messages with "null_blk:" to clarify which module is printing the messages. Also add a pr_info() message in null_add_dev() to print the name of a newly created disk. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220420005718.3780004-4-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 05:24:58 -06:00
Damien Le Moal	b3a0a73e8a	block: null_blk: Cleanup device creation and deletion Introduce the null_create_dev() and null_destroy_dev() helper functions to respectivel create nullb devices on modprobe and destroy them on rmmod. The null_destroy_dev() helper avoids duplicated code in the null_init() and null_exit() functions for deleting devices. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220420005718.3780004-3-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 05:24:58 -06:00
Damien Le Moal	525323d25e	block: null_blk: Fix code style issues Fix message grammar and code style issues (brackets and indentation) in null_init(). Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220420005718.3780004-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 05:24:58 -06:00
Christoph Hellwig	0000f2f720	xen-blkback: use bdev_discard_alignment Use bdev_discard_alignment to calculate the correct discard alignment offset even for partitions instead of just looking at the queue limit. Also switch to use bdev_discard_granularity to get rid of the last direct queue reference in xen_blkbk_discard. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-12-hch@lst.de [axboe: fold in 'q' removal as it's now unused] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-04 05:24:40 -06:00
Christoph Hellwig	18292faa89	rnbd-srv: use bdev_discard_alignment Use bdev_discard_alignment to calculate the correct discard alignment offset even for partitions instead of just looking at the queue limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	4e7f0ece41	nvme: remove a spurious clear of discard_alignment The nvme driver never sets a discard_alignment, so it also doens't need to clear it to zero. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	4418bfd8fb	loop: remove a spurious clear of discard_alignment The loop driver never sets a discard_alignment, so it also doens't need to clear it to zero. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	c3f7652996	dasd: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to PAGE_SIZE while the discard granularity is the block size that is smaller or the same as PAGE_SIZE as done by dasd is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	3d50d368c9	raid5: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by raid5 is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	44d583702f	dm-zoned: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by dm-zoned is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	62952cc5bc	virtio_blk: fix the discard_granularity and discard_alignment queue limits The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. On the other hand the discard_sector_alignment from the virtio 1.1 looks similar to what Linux uses as discard granularity (even if not very well described): "discard_sector_alignment can be used by OS when splitting a request based on alignment. " And at least qemu does set it to the discard granularity. So stop setting the discard_alignment and use the virtio discard_sector_alignment to set the discard granularity. Fixes: `1f23816b8e` ("virtio_blk: add discard and write zeroes support") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	fb749a87f4	null_blk: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by null_blk is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	4a04d517c5	nbd: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by nbd is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Christoph Hellwig	07c6e92a84	ubd: don't set the discard_alignment queue limit The discard_alignment queue limit is named a bit misleading means the offset into the block device at which the discard granularity starts. Setting it to the discard granularity as done by ubd is mostly harmless but also useless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20220418045314.360785-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-03 10:38:50 -06:00
Tetsuo Handa	0b8d7622ab	aoe: Avoid flush_scheduled_work() usage Flushing system-wide workqueues is dangerous and will be forbidden. Replace system_wq with local aoe_wq. Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Link: https://lore.kernel.org/r/abb37616-eec9-2794-e21e-7c623085d987@I-love.SAKURA.ne.jp Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-05-01 06:41:41 -06:00
Jens Axboe	f01e49fb17	Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.19/drivers Pull MD updates from Song: "1. Improve annotation in raid5 code, by Logan Gunthorpe. 2. Support MD_BROKEN flag in raid-1/5/10, by Mariusz Tkaczyk. 3. Other small fixes/cleanups." * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: Replace role magic numbers with defined constants md/raid0: Ignore RAID0 layout if the second zone has only one device md/raid5: Annotate functions that hold device_lock with __must_hold md/raid5-ppl: Annotate with rcu_dereference_protected() md/raid5: Annotate rdev/replacement access when mddev_lock is held md/raid5: Annotate rdev/replacement accesses when nr_pending is elevated md/raid5: Add __rcu annotation to struct disk_info md/raid5: Un-nest struct raid5_percpu definition md/raid5: Cleanup setup_conf() error returns md: replace deprecated strlcpy & remove duplicated line md/bitmap: don't set sb values if can't pass sanity check md: fix an incorrect NULL check in md_reload_sb md: fix an incorrect NULL check in does_sb_need_changing raid5: introduce MD_BROKEN md: Set MD_BROKEN for RAID1 and RAID10	2022-04-27 19:36:18 -06:00
Yu Kuai	8ba816b23a	null-blk: save memory footprint for struct nullb_cmd Total 16 bytes can be saved in two ways: 1) The field 'bio' will only be used in bio based mode, and the field 'rq' will only be used in mq mode. Since they won't be used in the same time, declare a union for them. 2) The field 'bool fake_timeout' can be placed in the hole after the field 'error'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220426022133.3999006-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-04-25 20:31:20 -06:00
David Sloan	9151ad5d86	md: Replace role magic numbers with defined constants There are several instances where magic numbers are used in md.c instead of the defined constants in md_p.h. This patch set improves code readability by replacing all occurrences of 0xffff, 0xfffe, and 0xfffd when relating to md roles with their equivalent defined constant. Signed-off-by: David Sloan <david.sloan@eideticom.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:22:31 -07:00
Pascal Hambourg	ea23994edc	md/raid0: Ignore RAID0 layout if the second zone has only one device The RAID0 layout is irrelevant if all members have the same size so the array has only one zone. It is also irrelevant if the array has two zones and the second zone has only one device, for example if the array has two members of different sizes. So in that case it makes sense to allow assembly even when the layout is undefined, like what is done when the array has only one zone. Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Pascal Hambourg <pascal@plouf.fr.eu.org> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	4631f39f05	md/raid5: Annotate functions that hold device_lock with __must_hold A handful of functions note the device_lock must be held with a comment but this is not comprehensive. Many other functions hold the lock when taken so add an __must_hold() to each call to annotate when the lock is held. This makes it a bit easier to analyse device_lock. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	4f4ee2bf32	md/raid5-ppl: Annotate with rcu_dereference_protected() To suppress the last remaining sparse warnings about accessing rdev, add rcu_dereference_protected calls to a couple places in raid5-ppl. All of these places are called under raid5_run and therefore are occurring before the array has started and is thus safe. There's no sensible check to do for the second argument of rcu_dereference_protected() so a comment is added instead. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	9aeb7f99a1	md/raid5: Annotate rdev/replacement access when mddev_lock is held The mddev_lock should be held during raid5_remove_disk() which is when the rdev/replacement pointers are modified. So any access to these pointers marked __rcu should be safe whenever the mddev_lock is held. There are numerous such access that currently produce sparse warnings. Add a helper function, rdev_mdlock_deref() that wraps rcu_dereference_protected() in all these instances. This annotation fixes a number of sparse warnings. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:37 -07:00
Logan Gunthorpe	e38b043255	md/raid5: Annotate rdev/replacement accesses when nr_pending is elevated There are a number of accesses to __rcu variables that should be safe because nr_pending in the disk is known to be elevated. Create a wrapper around rcu_dereference_protected() to annotate these accesses and verify that nr_pending is non-zero. This fixes a number of sparse warnings. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Logan Gunthorpe	b0920ede08	md/raid5: Add __rcu annotation to struct disk_info rdev and replacement are protected in some circumstances with rcu_dereference and synchronize_rcu (in raid5_remove_disk()). However, they were not annotated with __rcu so a sparse warning is emitted for every rcu_dereference() call. Add the __rcu annotation and fix up the initialization with RCU_INIT_POINTER, all pointer modifications with rcu_assign_pointer(), a few cases where the pointer value is tested with rcu_access_pointer() and one case where READ_ONCE() is used instead of rcu_dereference(), a case in print_raid5_conf() that should have rcu_dereference() and rcu_read_[un]lock() calls. Additional sparse issues will be fixed up in further commits. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Logan Gunthorpe	3d9a644cf4	md/raid5: Un-nest struct raid5_percpu definition Sparse reports many warnings of the form: drivers/md/raid5.c:1476:16: warning: dereference of noderef expression This is because all struct raid5_percpu definitions get marked as __percpu when really only the pointer in r5conf should have that annotation. Fix this by moving the defnition of raid5_precpu out of the definition of struct r5conf. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Logan Gunthorpe	8fbcba6b99	md/raid5: Cleanup setup_conf() error returns Be more careful about the error returns. Most errors in this function are actually ENOMEM, but it forcibly returns EIO if conf has been allocated. Instead return ret and ensure it is set appropriately before each goto abort. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Heming Zhao	92d9aac92b	md: replace deprecated strlcpy & remove duplicated line This commit includes two topics: 1> replace deprecated strlcpy change strlcpy to strscpy for strlcpy is marked as deprecated in Documentation/process/deprecated.rst 2> remove duplicated strlcpy line in md_bitmap_read_sb@md-bitmap.c there are two duplicated strlcpy(), the history: - commit `cf921cc19c` ("Add node recovery callbacks") introduced the first usage of strlcpy(). - commit `b97e92574c` ("Use separate bitmaps for each nodes in the cluster") introduced the second strlcpy(). this time, the two strlcpy() are same, we can remove anyone safely. - commit `d3b178adb3` ("md: Skip cluster setup for dm-raid") added dm-raid special handling. And the "nodes" value is the key of this patch. but from this patch, strlcpy() which was introduced by `b97e92574c` become necessary. - commit `3c462c880b` ("md: Increment version for clustered bitmaps") used clustered major version to only handle in clustered env. this patch could look a polishment for clustered code logic. So `cf921cc19c` became useless after `d3b178adb3`, we could remove it safely. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:36 -07:00
Heming Zhao	e68cb83a57	md/bitmap: don't set sb values if can't pass sanity check If bitmap area contains invalid data, kernel will crash then mdadm triggers "Segmentation fault". This is cluster-md speical bug. In non-clustered env, mdadm will handle broken metadata case. In clustered array, only kernel space handles bitmap slot info. But even this bug only happened in clustered env, current sanity check is wrong, the code should be changed. How to trigger: (faulty injection) dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sda dd if=/dev/zero bs=1M count=1 oflag=direct of=/dev/sdb mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm -Ss echo aaa > magic.txt == below modifying slot 2 bitmap data == dd if=magic.txt of=/dev/sda seek=16384 bs=1 count=3 <== destroy magic dd if=/dev/zero of=/dev/sda seek=16436 bs=1 count=4 <== ZERO chunksize mdadm -A /dev/md0 /dev/sda /dev/sdb == kernel crashes. mdadm outputs "Segmentation fault" == Reason of kernel crash: In md_bitmap_read_sb (called by md_bitmap_create), bad bitmap magic didn't block chunksize assignment, and zero value made DIV_ROUND_UP_SECTOR_T() trigger "divide error". Crash log: kernel: md: md0 stopped. kernel: md/raid1:md0: not clean -- starting background reconstruction kernel: md/raid1:md0: active with 2 out of 2 mirrors kernel: dlm: ... ... kernel: md-cluster: Joined cluster 44810aba-38bb-e6b8-daca-bc97a0b254aa slot 1 kernel: md0: invalid bitmap file superblock: bad magic kernel: md_bitmap_copy_from_slot can't get bitmap from slot 2 kernel: md-cluster: Could not gather bitmaps from slot 2 kernel: divide error: 0000 [#1] SMP NOPTI kernel: CPU: 0 PID: 1603 Comm: mdadm Not tainted 5.14.6-1-default kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] kernel: RSP: 0018:ffffc22ac0843ba0 EFLAGS: 00010246 kernel: ... ... kernel: Call Trace: kernel: ? dlm_lock_sync+0xd0/0xd0 [md_cluster 77fe..7a0] kernel: md_bitmap_copy_from_slot+0x2c/0x290 [md_mod 24ea..d3a] kernel: load_bitmaps+0xec/0x210 [md_cluster 77fe..7a0] kernel: md_bitmap_load+0x81/0x1e0 [md_mod 24ea..d3a] kernel: do_md_run+0x30/0x100 [md_mod 24ea..d3a] kernel: md_ioctl+0x1290/0x15a0 [md_mod 24ea....d3a] kernel: ? mddev_unlock+0xaa/0x130 [md_mod 24ea..d3a] kernel: ? blkdev_ioctl+0xb1/0x2b0 kernel: block_ioctl+0x3b/0x40 kernel: __x64_sys_ioctl+0x7f/0xb0 kernel: do_syscall_64+0x59/0x80 kernel: ? exit_to_user_mode_prepare+0x1ab/0x230 kernel: ? syscall_exit_to_user_mode+0x18/0x40 kernel: ? do_syscall_64+0x69/0x80 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae kernel: RIP: 0033:0x7f4a15fa722b kernel: ... ... kernel: ---[ end trace 8afa7612f559c868 ]--- kernel: RIP: 0010:md_bitmap_create+0x1d1/0x850 [md_mod] Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Xiaomeng Tong	64c54d9244	md: fix an incorrect NULL check in md_reload_sb The bug is here: if (!rdev \|\| rdev->desc_nr != nr) { The list iterator value 'rdev' will always be set and non-NULL by rdev_for_each_rcu(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found (In fact, it will be a bogus pointer to an invalid struct object containing the HEAD). Otherwise it will bypass the check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'pdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: `70bcecdb15` ("md-cluster: Improve md_reload_sb to be less error prone") Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Xiaomeng Tong	fc8738343e	md: fix an incorrect NULL check in does_sb_need_changing The bug is here: if (!rdev) The list iterator value 'rdev' will always be set and non-NULL by rdev_for_each(), so it is incorrect to assume that the iterator value will be NULL if the list is empty or no element found. Otherwise it will bypass the NULL check and lead to invalid memory access passing the check. To fix the bug, use a new variable 'iter' as the list iterator, while using the original variable 'rdev' as a dedicated pointer to point to the found element. Cc: stable@vger.kernel.org Fixes: `2aa82191ac` ("md-cluster: Perform a lazy update") Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00
Mariusz Tkaczyk	57668f0a4c	raid5: introduce MD_BROKEN Raid456 module had allowed to achieve failed state. It was fixed by `fb73b357fb` ("raid5: block failing device if raid will be failed"). This fix introduces a bug, now if raid5 fails during IO, it may result with a hung task without completion. Faulty flag on the device is necessary to process all requests and is checked many times, mainly in analyze_stripe(). Allow to set faulty on drive again and set MD_BROKEN if raid is failed. As a result, this level is allowed to achieve failed state again, but communication with userspace (via -EBUSY status) will be preserved. This restores possibility to fail array via #mdadm --set-faulty command and will be fixed by additional verification on mdadm side. Reproduction steps: mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean mkfs.xfs /dev/md126 -f mount /dev/md126 /mnt/root/ fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 --time_based --group_reporting --name=throughput-test-job --eta-newline=1 & echo 1 > /sys/block/nvme2n1/device/device/remove echo 1 > /sys/block/nvme1n1/device/device/remove [ 1475.787779] Call Trace: [ 1475.793111] __schedule+0x2a6/0x700 [ 1475.799460] schedule+0x38/0xa0 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] [ 1475.813856] ? finish_wait+0x80/0x80 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] [ 1475.828281] ? finish_wait+0x80/0x80 [ 1475.834727] ? finish_wait+0x80/0x80 [ 1475.841127] ? finish_wait+0x80/0x80 [ 1475.847480] md_handle_request+0x119/0x190 [ 1475.854390] md_make_request+0x8a/0x190 [ 1475.861041] generic_make_request+0xcf/0x310 [ 1475.868145] submit_bio+0x3c/0x160 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 [ 1475.889149] iomap_apply+0xff/0x310 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.909974] iomap_dio_rw+0x2f2/0x490 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 [ 1475.923680] ? atime_needs_update+0x77/0xe0 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] [ 1475.953403] aio_read+0xd5/0x180 [ 1475.959395] ? _cond_resched+0x15/0x30 [ 1475.965907] io_submit_one+0x20b/0x3c0 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 [ 1475.979335] ? do_io_getevents+0x7c/0xc0 [ 1475.986009] do_syscall_64+0x5b/0x1a0 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 1476.000255] RIP: 0033:0x7f11fc27978d [ 1476.006631] Code: Bad RIP value. [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds. Cc: stable@vger.kernel.org Fixes: `fb73b357fb` ("raid5: block failing device if raid will be failed") Reviewd-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org>	2022-04-25 14:00:35 -07:00

1 2 3 4 5 ...

1089516 Commits