linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-24 21:21:41 +00:00

Author	SHA1	Message	Date
John Garry	e546fe1da9	block: Rework bio_split() return value Instead of returning an inconclusive value of NULL for an error in calling bio_split(), return a ERR_PTR() always. Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since almost all callers don't check the return code from bio_split(), we'll crash anyway (for those failures). Fix up the only user which checks bio_split() return code today (directly or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache code does check the return code in cached_dev_cache_miss() -> bio_next_split() -> bio_split(), but only to see if there was a split, so there would be no change in behaviour here (when returning a ERR_PTR()). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00
Ming Lei	d369735e02	ublk: fix ublk_ch_mmap() for 64K page size In ublk_ch_mmap(), queue id is calculated in the following way: (vma->vm_pgoff << PAGE_SHIFT) / `max_cmd_buf_size` 'max_cmd_buf_size' is equal to `UBLK_MAX_QUEUE_DEPTH * sizeof(struct ublksrv_io_desc)` and UBLK_MAX_QUEUE_DEPTH is 4096 and part of UAPI, so 'max_cmd_buf_size' is always page aligned in 4K page size kernel. However, it isn't true in 64K page size kernel. Fixes the issue by always rounding up 'max_cmd_buf_size' with PAGE_SIZE. Cc: stable@vger.kernel.org Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241111110718.1394001-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:16:17 -07:00
Yu Jiaoliang	b2113edaa9	s390/dasd: Fix typo in comment Fix typo in comment: requeust->request, Removve->Remove, notthing->nothing. Signed-off-by: Yu Jiaoliang <yujiaoliang@vivo.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20241108133913.3068782-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-09 20:06:58 -07:00
Miroslav Franc	7f5435b2a5	s390/dasd: fix redundant /proc/dasd* entries removal In case of an early failure in dasd_init, dasd_proc_init is never called and /proc/dasd* files are never created. That can happen, for example, if an incompatible or incorrect argument is provided to the dasd_mod.dasd= kernel parameter. However, the attempted removal of /proc/dasd* files causes 8 warnings and backtraces in this case. 4 on the error path within dasd_init and 4 when the dasd module is unloaded. Notice the "removing permanent /proc entry 'devices'" message that is caused by the dasd_proc_exit function trying to remove /proc/devices instead of /proc/dasd/devices since dasd_proc_root_entry is NULL and /proc/devices is indeed permanent. Example: ------------[ cut here ]------------ removing permanent /proc entry 'devices' WARNING: CPU: 6 PID: 557 at fs/proc/generic.c:701 remove_proc_entry+0x22e/0x240 CPU: 6 PID: 557 Comm: modprobe Not tainted 6.10.5-1-default #1 openSUSE Tumbleweed f6917bfd6e5a5c7a7e900e0e3b517786fb5c6301 Hardware name: QEMU 8561 QEMU (KVM/Linux) Krnl PSW : 0704c00180000000 000003fffed0e9f2 (remove_proc_entry+0x232/0x240) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 000003ff00000027 000003ff00000023 0000000000000028 000002f200000000 000002f3f05bec20 0000037ffecfb7d0 000003ffffdabab0 000003ff7ee4ec72 000003ff7ee4ec72 0000000000000007 000002f280e22600 000002f280e22688 000003ffa252cfa0 0000000000010000 000003fffed0e9ee 0000037ffecfba38 Krnl Code: 000003fffed0e9e2: c020004e7017 larl %r2,000003ffff6dca10 000003fffed0e9e8: c0e5ffdfad24 brasl %r14,000003fffe904430 #000003fffed0e9ee: af000000 mc 0,0 >000003fffed0e9f2: a7f4ff4c brc 15,000003fffed0e88a 000003fffed0e9f6: 0707 bcr 0,%r7 000003fffed0e9f8: 0707 bcr 0,%r7 000003fffed0e9fa: 0707 bcr 0,%r7 000003fffed0e9fc: 0707 bcr 0,%r7 Call Trace: [<000003fffed0e9f2>] remove_proc_entry+0x232/0x240 ([<000003fffed0e9ee>] remove_proc_entry+0x22e/0x240) [<000003ff7ef5a084>] dasd_proc_exit+0x34/0x60 [dasd_mod] [<000003ff7ef560c2>] dasd_exit+0x22/0xc0 [dasd_mod] [<000003ff7ee5a26e>] dasd_init+0x26e/0x280 [dasd_mod] [<000003fffe8ac9d0>] do_one_initcall+0x40/0x220 [<000003fffe9bc758>] do_init_module+0x78/0x260 [<000003fffe9bf3a6>] __do_sys_init_module+0x216/0x250 [<000003ffff37ac9e>] __do_syscall+0x24e/0x2d0 [<000003ffff38cca8>] system_call+0x70/0x98 Last Breaking-Event-Address: [<000003fffef7ea20>] __s390_indirect_jump_r14+0x0/0x10 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ While the cause is a user failure, the dasd module should handle the situation more gracefully. One of the simplest solutions is to make removal of the /proc/dasd* entries idempotent. Signed-off-by: Miroslav Franc <mfranc@suse.cz> [ sth: shortened if clause ] Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20241108133913.3068782-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-09 20:06:58 -07:00
Jens Axboe	f5558be1ea	Merge tag 'md-6.13-20241107' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.13/block Pull raid5 fix from Song. * tag 'md-6.13-20241107' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: MAINTAINERS: Make Yu Kuai co-maintainer of md/raid subsystem md/raid5: Wait sync io to finish before changing group cnt	2024-11-09 20:06:13 -07:00
Li Wang	8e604cac49	loop: fix type of block size PAGE_SIZE may be 64K, and the max block size can be PAGE_SIZE, so any variable for holding block size can't be defined as 'unsigned short'. Unfortunately commit `473516b361` ("loop: use the atomic queue limits update API") passes 'bsize' with type of 'unsigned short' to loop_reconfigure_limits(), and causes LTP/ioctl_loop06 test failure: 12 ioctl_loop06.c:76: TINFO: Using LOOP_SET_BLOCK_SIZE with arg > PAGE_SIZE 13 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly ... 18 ioctl_loop06.c:76: TINFO: Using LOOP_CONFIGURE with block_size > PAGE_SIZE 19 ioctl_loop06.c:59: TFAIL: Set block size succeed unexpectedly Fixes the issue by defining 'block size' variable with 'unsigned int', which is aligned with block layer's definition. (improve commit log & add fixes tag) Fixes: `473516b361` ("loop: use the atomic queue limits update API") Cc: John Garry <john.g.garry@oracle.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Li Wang <liwang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241109022744.1126003-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-09 20:05:33 -07:00
Song Liu	c13c2d2a4b	MAINTAINERS: Make Yu Kuai co-maintainer of md/raid subsystem In the past couple years, Yu Kuai has been making solid contributions to md/raid subsystem. Make Yu Kuai a co-maintainer. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241108014112.2098079-1-song@kernel.org Signed-off-by: Song Liu <song@kernel.org>	2024-11-07 17:45:02 -08:00
Xiao Ni	fa1944bbe6	md/raid5: Wait sync io to finish before changing group cnt One customer reports a bug: raid5 is hung when changing thread cnt while resync is running. The stripes are all in conf->handle_list and new threads can't handle them. Commit `b39f35ebe8` ("md: don't quiesce in mddev_suspend()") removes pers->quiesce from mddev_suspend/resume. Before this patch, mddev_suspend needs to wait for all ios including sync io to finish. Now it's used to only wait normal io. Fix this by calling raid5_quiesce from raid5_store_group_thread_cnt directly to wait all sync requests to finish before changing the group cnt. Fixes: `b39f35ebe8` ("md: don't quiesce in mddev_suspend()") Cc: stable@vger.kernel.org Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241106095124.74577-1-xni@redhat.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-07 15:34:52 -08:00
Ming Lei	357e1b7f73	block: don't verify IO lock for freeze/unfreeze in elevator_init_mq() elevator_init_mq() is only called at the entry of add_disk_fwnode() when disk IO isn't allowed yet. So not verify io lock(q->io_lockdep_map) for freeze & unfreeze in elevator_init_mq(). Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Lai Yi <yi1.lai@linux.intel.com> Fixes: `f1be1788a3` ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 16:27:22 -07:00
Ming Lei	6a78699838	block: always verify unfreeze lock on the owner task commit `f1be1788a3` ("block: model freeze & enter queue as lock for supporting lockdep") tries to apply lockdep for verifying freeze & unfreeze. However, the verification is only done the outmost freeze and unfreeze. This way is actually not correct because q->mq_freeze_depth still may drop to zero on other task instead of the freeze owner task. Fix this issue by always verifying the last unfreeze lock on the owner task context, and make sure both the outmost freeze & unfreeze are verified in the current task. Fixes: `f1be1788a3` ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 16:27:22 -07:00
Ming Lei	a471977780	rbd: unfreeze queue after marking disk as dead Unfreeze queue after returning from blk_mark_disk_dead(), this way at least allows us to verify queue freeze correctly with lockdep. Suggested-by: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 16:27:22 -07:00
Ming Lei	54027869df	block: remove blk_freeze_queue() No one use blk_freeze_queue(), so remove it and the obsolete comment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 16:27:22 -07:00
Damien Le Moal	f3d9bf0514	block: Add a public bdev_zone_is_seq() helper Turn the private disk_zone_is_conv() function in blk-zoned.c into a public and documented bdev_zone_is_seq() helper with the inverse polarity of the original function, also adding a check for non-zoned devices so that all file systems can use the helper, even with a regular block device. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 15:30:54 -07:00
Damien Le Moal	d7cb6d7414	block: RCU protect disk->conv_zones_bitmap Ensure that a disk revalidation changing the conventional zones bitmap of a disk does not cause invalid memory references when using the disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap pointer. disk_zone_is_conv() is modified to operate under the RCU read lock and the function disk_set_conv_zones_bitmap() is added to update a disk conv_zones_bitmap pointer using rcu_replace_pointer() with the disk zone_wplugs_lock spinlock held. disk_free_zone_resources() is modified to call disk_update_zone_resources() with a NULL bitmap pointer to free the disk conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in disk_update_zone_resources() to set the new (revalidated) bitmap and free the old one. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 15:30:54 -07:00
zhangguopeng	8e71afb94d	block: Replace sprintf() with sysfs_emit() Per Documentation/filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. No functional change intended. Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241107104258.29742-1-zhangguopeng@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 15:30:30 -07:00
Damien Le Moal	4122fef16b	block: Switch to using refcount_t for zone write plugs Replace the raw atomic_t reference counting of zone write plugs with a refcount_t. No functional changes. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411050650.ilIZa8S7-lkp@intel.com/ Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241107065438.236348-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 11:21:52 -07:00
Jens Axboe	ab9bc81c1c	Revert "block: pre-calculate max_zone_append_sectors" This causes issue on, at least, nvme-mpath where my boot fails with: WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380 Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181 Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024 Workqueue: async async_run_entry_fn RIP: 0010:blk_validate_limits+0x356/0x380 Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46 RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088 RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38 RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000 R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38 FS: 0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0 Call Trace: <TASK> ? __warn+0xca/0x1a0 ? blk_validate_limits+0x356/0x380 ? report_bug+0x11a/0x1a0 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? blk_validate_limits+0x356/0x380 blk_alloc_queue+0x7a/0x250 __blk_alloc_disk+0x39/0x80 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core] nvme_scan_ns+0xcc7/0x1010 [nvme_core] async_run_entry_fn+0x27/0x120 process_scheduled_works+0x1a0/0x360 worker_thread+0x2bc/0x350 ? pr_cont_work+0x1b0/0x1b0 kthread+0x111/0x120 ? kthread_unuse_mm+0x90/0x90 ret_from_fork+0x30/0x40 ? kthread_unuse_mm+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- presumably due to max_zone_append_sectors not being cleared to zero, resulting in blk_validate_zoned_limits() complaining and failing. This reverts commit `2a8f6153e1`. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 05:45:34 -07:00
Jens Axboe	0b66deb16c	Merge tag 'md-6.13-20241105' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.13/block Pull MD changes from Song: "1. Enhance handling of faulty and blocked devices, by Yu Kuai. 2. raid5-ppl atomic improvement, by Uros Bizjak. 3. md-bitmap fix, by Yuan Can." * tag 'md-6.13-20241105' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/md-bitmap: Add missing destroy_work_on_stack() md/raid5: don't set Faulty rdev for blocked_rdev md/raid10: don't wait for Faulty rdev in wait_blocked_rdev() md/raid1: don't wait for Faulty rdev in wait_blocked_rdev() md/raid1: factor out helper to handle blocked rdev from raid1_write_request() md: don't record new badblocks for faulty rdev md: don't wait faulty rdev in md_wait_for_blocked_rdev() md: add a new helper rdev_blocked() md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit()	2024-11-06 07:55:19 -07:00
Philipp Stanner	91ff97a722	mtip32xx: Replace deprecated PCI functions pcim_iomap_table() and pcim_request_regions() have been deprecated in commit `e354bb84a4` ("PCI: Deprecate pcim_iomap_table(), pcim_iomap_regions_request_all()") and commit `d140f80f60` ("PCI: Deprecate pcim_iomap_regions() in favor of pcim_iomap_region()"), respectively. Replace these functions with pcim_iomap_region(). Signed-off-by: Philipp Stanner <pstanner@redhat.com> Link: https://lore.kernel.org/r/20241106145249.108996-2-pstanner@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-06 07:54:50 -07:00
Yuan Can	6012169e8a	md/md-bitmap: Add missing destroy_work_on_stack() This commit add missed destroy_work_on_stack() operations for unplug_work.work in bitmap_unplug_async(). Fixes: `a022325ab9` ("md/md-bitmap: add a new helper to unplug bitmap asynchrously") Cc: stable@vger.kernel.org Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241105130105.127336-1-yuancan@huawei.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 21:06:51 -08:00
Yu Kuai	649bfec690	md/raid5: don't set Faulty rdev for blocked_rdev Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-8-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	d419284c95	md/raid10: don't wait for Faulty rdev in wait_blocked_rdev() Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-7-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	ff31a7ef2b	md/raid1: don't wait for Faulty rdev in wait_blocked_rdev() Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	88ed59c4cc	md/raid1: factor out helper to handle blocked rdev from raid1_write_request() Currently raid1 is preparing IO for underlying disks while checking if any disk is blocked, if so allocated resources must be released, then waiting for rdev to be unblocked and try to prepare IO again. Make code cleaner by checking blocked rdev first, it doesn't matter if rdev is blocked while issuing IO, the IO will wait for rdev to be unblocked or not. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	29967332ce	md: don't record new badblocks for faulty rdev Faulty will be checked before issuing IO to the rdev, however, rdev can be faulty at any time, hence it's possible that rdev_set_badblocks() will be called for faulty rdev. In this case, mddev->sb_flags will be set and some other path can be blocked by updating super block. Since faulty rdev will not be accesed anymore, there is no need to record new babblocks for faulty rdev and forcing updating super block. Noted this is not a bugfix, just prevent updating superblock in some corner cases, and will help to slice a bug related to external metadata[1], testing also shows that devices are removed faster in the case IO error. [1] https://lore.kernel.org/all/f34452df-810b-48b2-a9b4-7f925699a9e7@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Yu Kuai	50e8274855	md: don't wait faulty rdev in md_wait_for_blocked_rdev() md_wait_for_blocked_rdev() is called for write IO while rdev is blocked, howerver, rdev can be faulty after choosing this rdev to write, and faulty rdev should never be accessed anymore, hence there is no point to wait for faulty rdev to be unblocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Yu Kuai	4abfce19c7	md: add a new helper rdev_blocked() The helper will be used in later patches for raid1/raid10/raid5, the difference is that Faulty rdev with unacknowledged bad block will not be considered blocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Uros Bizjak	1e79892e76	md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit() Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20241007084831.48067-1-ubizjak@gmail.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Christoph Hellwig	2a8f6153e1	block: pre-calculate max_zone_append_sectors max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-04 10:34:07 -07:00
Christoph Hellwig	e494c3dce6	block: remove the max_zone_append_sectors check in blk_revalidate_disk_zones With the lock layer zone append emulation, we are now always setting a max_zone_append_sectors value for zoned devices and this check can't ever trigger. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-04 10:34:06 -07:00
Christoph Hellwig	05df016684	block: update blk_stack_limits documentation Listing every single features that needs to be pre-set by stacking drivers does not scale. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104054218.45596-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-04 10:33:20 -07:00
Ming Lei	341468e0ab	lib/iov_iter: fix bvec iterator setup .bi_size of bvec iterator should be initialized as real max size for walking, and .bi_bvec_done just counts how many bytes need to be skipped in the 1st bvec, so .bi_size isn't related with .bi_bvec_done. This patch fixes bvec iterator initialization, and the inner `size` check isn't needed any more, so revert Eric Dumazet's commit 7bc802acf193 ("iov-iter: do not return more bytes than requested in iov_iter_extract_bvec_pages()"). Cc: Eric Dumazet <edumazet@google.com> Fixes: `e4e535bff2` ("iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages") Reported-by: syzbot+71abe7ab2b70bca770fd@syzkaller.appspotmail.com Tested-by: syzbot+71abe7ab2b70bca770fd@syzkaller.appspotmail.com Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-01 20:18:21 -06:00
John Garry	d47de6ac88	loop: Simplify discard granularity calc A bdev discard granularity is always at least SECTOR_SIZE, so don't check for a zero value. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241101092215.422428-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-01 20:17:52 -06:00
Christoph Hellwig	f187b9bf1a	block: remove bio_add_zone_append_page This is only used by the nvmet zns passthrough code, which can trivially just use bio_add_pc_page and do the sanity check for the max zone append limit itself. All future zoned file systems should follow the btrfs lead and let the upper layers fill up bios unlimited by hardware constraints and split them to the limits in the I/O submission handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-31 10:54:25 -06:00
Christoph Hellwig	cafd00d0e9	block: remove zone append special casing from the direct I/O path This code is unused, and all future zoned file systems should follow the btrfs lead of splitting the bios themselves to the zoned limits in the I/O submission handler, because if they didn't they would be hit by commit `ed9832bc08` ("block: introduce folio awareness and add a bigger size from folio") breaking this code when the zone append limit (that is usually the max_hw_sectors limit) is smaller than the largest possible folio size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241030051859.280923-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-31 10:54:25 -06:00
Ming Lei	496a51b371	lib/iov_iter.c: initialize bi.bi_idx before iterating over bvec Initialize bi.bi_idx as 0 before iterating over bvec, otherwise garbage data can be used as ->bi_idx. Cc: Christoph Hellwig <hch@lst.de> Reported-and-tested-by: Klara Modin <klarasmodin@gmail.com> Fixes: `e4e535bff2` ("iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-31 07:40:34 -06:00
Keith Busch	133008e84b	blk-integrity: remove seed for user mapped buffers The seed is only used for kernel generation and verification. That doesn't happen for user buffers, so passing the seed around doesn't accomplish anything. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-30 07:49:32 -06:00
Yang Erkun	826cc42adf	brd: defer automatic disk creation until module initialization succeeds My colleague Wupeng found the following problems during fault injection: BUG: unable to handle page fault for address: fffffbfff809d073 PGD 6e648067 P4D 123ec8067 PUD 123ec4067 PMD 100e38067 PTE 0 Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 5 UID: 0 PID: 755 Comm: modprobe Not tainted 6.12.0-rc3+ #17 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:__asan_load8+0x4c/0xa0 ... Call Trace: <TASK> blkdev_put_whole+0x41/0x70 bdev_release+0x1a3/0x250 blkdev_release+0x11/0x20 __fput+0x1d7/0x4a0 task_work_run+0xfc/0x180 syscall_exit_to_user_mode+0x1de/0x1f0 do_syscall_64+0x6b/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e loop_init() is calling loop_add() after __register_blkdev() succeeds and is ignoring disk_add() failure from loop_add(), for loop_add() failure is not fatal and successfully created disks are already visible to bdev_open(). brd_init() is currently calling brd_alloc() before __register_blkdev() succeeds and is releasing successfully created disks when brd_init() returns an error. This can cause UAF for the latter two case: case 1: T1: modprobe brd brd_init brd_alloc(0) // success add_disk disk_scan_partitions bdev_file_open_by_dev // alloc file fput // won't free until back to userspace brd_alloc(1) // failed since mem alloc error inject // error path for modprobe will release code segment // back to userspace __fput blkdev_release bdev_release blkdev_put_whole bdev->bd_disk->fops->release // fops is freed now, UAF! case 2: T1: T2: modprobe brd brd_init brd_alloc(0) // success open(/dev/ram0) brd_alloc(1) // fail // error path for modprobe close(/dev/ram0) ... /* UAF! */ bdev->bd_disk->fops->release Fix this problem by following what loop_init() does. Besides, reintroduce brd_devices_mutex to help serialize modifications to brd_list. Fixes: `7f9b348cb5` ("brd: convert to blk_alloc_disk/blk_cleanup_disk") Reported-by: Wupeng Ma <mawupeng1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241030034914.907829-1-yangerkun@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-30 07:31:07 -06:00
John Garry	8d3fd059dd	loop: Use bdev limit helpers for configuring discard Instead of directly looking at the request_queue limits, use the bdev limits helpers, which is preferable. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241030111900.3981223-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-30 07:23:57 -06:00
Christoph Hellwig	2f5a65ef30	block: add a bdev_limits helper Add a helper to get the queue_limits from the bdev without having to poke into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 09:15:00 -06:00
Ming Lei	e4e535bff2	iov_iter: don't require contiguous pages in iov_iter_extract_bvec_pages The iov_iter_extract_pages interface allows to return physically discontiguous pages, as long as all but the first and last page in the array are page aligned and page size. Rewrite iov_iter_extract_bvec_pages to take advantage of that instead of only returning ranges of physically contiguous pages. Signed-off-by: Ming Lei <ming.lei@redhat.com> [hch: minor cleanups, new commit log] Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241024050021.627350-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 09:14:28 -06:00
Ming Lei	f1be1788a3	block: model freeze & enter queue as lock for supporting lockdep Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue(). Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep: 1) model q->q_usage_counter as two locks(io and queue lock) - queue lock covers sync with blk_enter_queue() - io lock covers sync with bio_enter_queue() 2) make the lockdep class/key as per-queue: - different subsystem has very different lock use pattern, shared lock class causes false positive easily - freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more - freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered - it is trylock because blk_mq_freeze_queue() are allowed to run concurrently 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed - dependency with blk_mq_freeze_queue() is covered - blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered. For example, lockdep report can be triggered in the report[3] with this patch applied. [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166 [2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/ [3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-26 07:14:53 -06:00
Ming Lei	6b6f6c41c8	nvme: core: switch to non_owner variant of start_freeze/unfreeze queue nvme_start_freeze() and nvme_unfreeze() may be called from same context, so switch them to call non_owner variant of start_freeze/unfreeze queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-26 07:14:53 -06:00
Ming Lei	8acdd0e7bf	blk-mq: add non_owner variant of start_freeze/unfreeze queue APIs Add non_owner variant of start_freeze/unfreeze queue APIs, so that the caller knows that what they are doing, and we can skip lockdep support for non_owner variant in per-call level. Prepare for supporting lockdep for freezing/unfreezing queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-26 07:14:53 -06:00
Bart Van Assche	e203e20a8b	blk-mq: Unexport blk_mq_flush_busy_ctxs() Commit `a6088845c2` ("block: kyber: make kyber more friendly with merging") removed the only blk_mq_flush_busy_ctxs() call from outside the block layer core. Hence unexport blk_mq_flush_busy_ctxs(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241023202850.3469279-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-23 16:30:40 -06:00
Bart Van Assche	ccd9e252c5	blk-mq: Make blk_mq_quiesce_tagset() hold the tag list mutex less long Make sure that the tag_list_lock mutex is not held any longer than necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset() is called concurrently from more than one thread. This function is used by the NVMe core and also by the UFS driver. Reported-by: Peter Wang <peter.wang@mediatek.com> Cc: Chao Leng <lengchao@huawei.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Fixes: `414dd48e88` ("blk-mq: add tagset quiesce interface") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241022181617.2716173-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-22 19:11:25 -06:00
Muchun Song	904ebd2527	block: remove redundant explicit memory barrier from rq_qos waiter and waker The memory barriers in list_del_init_careful() and list_empty_careful() in pairs already handle the proper ordering between data.got_token and data.wq.entry. So remove the redundant explicit barriers. And also change a "break" statement to "return" to avoid redundant calling of finish_wait(). Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/20241021085251.73353-1-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-22 14:05:09 -06:00
Jens Axboe	fdad1a20cd	Merge branch 'for-6.13/block-atomic' into for-6.13/block Merge in block/fs prep patches for the atomic write support. * for-6.13/block-atomic: block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()	2024-10-22 08:21:51 -06:00
Li Lingfeng	919b5139bd	block: flush all throttled bios when deleting the cgroup When a process migrates to another cgroup and the original cgroup is deleted, the restrictions of throttled bios cannot be removed. If the restrictions are set too low, it will take a long time to complete these bios. Refer to the process of deleting a disk to remove the restrictions and issue bios when deleting the cgroup. This makes difference on the behavior of throttled bios: Before: the limit of the throttled bios can't be changed and the bios will complete under this limit; Now: the limit will be canceled and the throttled bios will be flushed immediately. References: [1] https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com [2] https://lore.kernel.org/all/da861d63-58c6-3ca0-2535-9089993e9e28@huaweicloud.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240817071108.1919729-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-22 08:16:43 -06:00
Muchun Song	96a9fe64bf	block: fix ordering between checking BLK_MQ_S_STOPPED request adding Supposing first scenario with a virtio_blk driver. CPU0 CPU1 blk_mq_try_issue_directly() __blk_mq_issue_directly() q->mq_ops->queue_rq() virtio_queue_rq() blk_mq_stop_hw_queue() virtblk_done() blk_mq_request_bypass_insert() 1) store blk_mq_start_stopped_hw_queue() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load return __blk_mq_sched_dispatch_requests() Supposing another scenario. CPU0 CPU1 blk_mq_requeue_work() blk_mq_insert_request() 1) store virtblk_done() blk_mq_start_stopped_hw_queue() blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load continue blk_mq_run_hw_queue() Both scenarios are similar, the full memory barrier should be inserted between 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU will not rerun the hardware queue causing starvation of the request. The easy way to fix it is to add the essential full memory barrier into helper of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue is not stopped most of the time), we only insert the barrier into the slow path. Actually, only slow path needs to care about missing of dispatching the request to the low-level device driver. Fixes: `320ae51fee` ("blk-mq: new multi-queue block IO queueing mechanism") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-22 08:16:40 -06:00

1 2 3 4 5 ...

1310093 Commits