linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-02 17:11:33 +00:00

Author	SHA1	Message	Date
Kemeng Shi	08470a98a7	sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat code Rewrite sbitmap_find_bit_in_index as following: 1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word 2. Accept "struct sbitmap_word " directly instead of accepting "struct sbitmap " and "int index" to get "struct sbitmap_word *". 3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller to support need of both __sbitmap_get_shallow and __sbitmap_get. With helper function sbitmap_find_bit_in_word, we can remove repeat code in __sbitmap_get_shallow to find bit considring deferred clear. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:01 -07:00
Kemeng Shi	903e86f3a6	sbitmap: remove redundant check in __sbitmap_queue_get_batch Commit `fbb564a557` ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") mentioned that "Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits." This commit add check to make sure all masked bits in word before cmpxchg is zero. Then the existing check after cmpxchg to check any zero bit is existing in masked bits in word is redundant. Actually, old value of word before cmpxchg is stored in val and we will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg. So we will not reuse busy bits methioned in commit `fbb564a557` ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert new-added check to remove redundant check. Fixes: `fbb564a557` ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:01 -07:00
Kemeng Shi	f1591a8bb3	sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallow Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly pointless and equivalent to setting alloc_hint to zero (because SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So simplify the logic. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116205059.3821738-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 20:03:01 -07:00
Yu Kuai	f1c006f1c6	blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy() Currently parent pd can be freed before child pd: t1: remove cgroup C1 blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // remove blkg from queue list percpu_ref_kill(&blkg->refcnt) blkg_release call_rcu t2: from t1 __blkg_release blkg_free schedule_work t4: deactivate policy blkcg_deactivate_policy pd_free_fn // parent of C1 is freed first t3: from t2 blkg_free_workfn pd_free_fn If policy(for example, ioc_timer_fn() from iocost) access parent pd from child pd after pd_offline_fn(), then UAF can be triggered. Fix the problem by delaying 'list_del_init(&blkg->q_node)' from blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to synchronize blkg_free_workfn() and blkcg_deactivate_policy(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:19:04 -07:00
Yu Kuai	dfd6200a09	blk-cgroup: support to track if policy is online A new field 'online' is added to blkg_policy_data to fix following 2 problem: 1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT' failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with it, and pd_offline_fn() will be called without pd_init_fn() and pd_online_fn(). This way null-ptr-deference can be triggered. 2) In order to synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be delayed to blkg_free_workfn(), hence pd_offline_fn() can be called first in blkg_destroy(), and then blkcg_deactivate_policy() will call it again, we must prevent it. The new field 'online' will be set after pd_online_fn() and will be cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only be called if 'online' is set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:19:04 -07:00
Yu Kuai	c7241babf0	blk-cgroup: dropping parent refcount after pd_free_fn() is done Some cgroup policies will access parent pd through child pd even after pd_offline_fn() is done. If pd_free_fn() for parent is called before child, then UAF can be triggered. Hence it's better to guarantee the order of pd_free_fn(). Currently refcount of parent blkg is dropped in __blkg_release(), which is before pd_free_fn() is called in blkg_free_work_fn() while blkg_free_work_fn() is called asynchronously. This patch make sure pd_free_fn() called from removing cgroup is ordered by delaying dropping parent refcount after calling pd_free_fn() for child. BTW, pd_free_fn() will also be called from blkcg_deactivate_policy() from deleting device, and following patches will guarantee the order. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:19:04 -07:00
Zhong Jinghua	b36781034c	blk-mq: cleanup unused methods: blk_mq_hw_sysfs_store We found that the blk_mq_hw_sysfs_store interface has no place to use. The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses the show method and does not use the store method. Since this patch: `4a46f05ebf` ("blk-mq: move hctx and ctx counters from sysfs to debugfs") moved the store method to debugfs, the store method is not used anymore. So let me do some tiny work to clean up unused code. Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230128030419.2780298-1-zhongjinghua@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:35 -07:00
Christoph Hellwig	9607cd36bb	s390/dcssblk:: don't call bio_split_to_limits s390 iterates over the bio using bio_for_each_segment and doesn't need any bio splitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/r/20230123075356.60847-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:35 -07:00
Christoph Hellwig	1bf7a749ef	ps3vram: remove bio splitting ps3vram iterates over the bio one segment, that is page aligned and max page sized chunk, a time. Because of that there is no point in calling bio_split_to_limits, or explicitly setting the default limits that are only used by bio_split_to_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Geoff Levand <geoff@infradead.org> Link: https://lore.kernel.org/r/20230123074718.57951-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:35 -07:00
Jens Axboe	33391eecd6	block: treat poll queue enter similarly to timeouts We ran into an issue where a production workload would randomly grind to a halt and not continue until the pending IO had timed out. This turned out to be a complicated interaction between queue freezing and polled IO: 1) You have an application that does polled IO. At any point in time, there may be polled IO pending. 2) You have a monitoring application that issues a passthrough command, which is marked with side effects such that it needs to freeze the queue. 3) Passthrough command is started, which calls blk_freeze_queue_start() on the device. At this point the queue is marked frozen, and any attempt to enter the queue will fail (for non-blocking) or block. 4) Now the driver calls blk_mq_freeze_queue_wait(), which will return when the queue is quiesced and pending IO has completed. 5) The pending IO is polled IO, but any attempt to poll IO through the normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to bio_queue_enter() as the queue is frozen. Rather than poll and complete IO, the polling threads will sit in a tight loop attempting to poll, but failing to enter the queue to do so. The end result is that progress for either application will be stalled until all pending polled IO has timed out. This causes obvious huge latency issues for the application doing polled IO, but also long delays for passthrough command. Fix this by treating queue enter for polled IO just like we do for timeouts. This allows quick quiesce of the queue as we still poll and complete this IO, while still disallowing queueing up new IO. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Li Nan	b326032965	blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params() vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated by div64_u64. Vrate_min may be 1 greater than vrate_max if the input values min and max of cost.qos are equal. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Li Nan	984af1e66b	blk-iocost: fix divide by 0 error in calc_lcoefs() echo max of u64 to cost.model can cause divide by 0 error. # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model divide error: 0000 [#1] PREEMPT SMP RIP: 0010:calc_lcoefs+0x4c/0xc0 Call Trace: <TASK> ioc_refresh_params+0x2b3/0x4f0 ioc_cost_model_write+0x3cb/0x4c0 ? _copy_from_iter+0x6d/0x6c0 ? kernfs_fop_write_iter+0xfc/0x270 cgroup_file_write+0xa0/0x200 kernfs_fop_write_iter+0x17d/0x270 vfs_write+0x414/0x620 ksys_write+0x73/0x160 __x64_sys_write+0x1e/0x30 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL, overflow would happen if bps plus IOC_PAGE_SIZE is greater than ULLONG_MAX, it can cause divide by 0 error. Fix the problem by setting basecost Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Yu Kuai	35198e3230	blk-iocost: read params inside lock in sysfs apis Otherwise, user might get abnormal values if params is updated concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Yu Kuai	235a5a83f6	blk-iocost: don't allow to configure bio based device iocost is based on rq_qos, which can only work for request based device, thus it doesn't make sense to configure iocost for bio based device. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Yu Kuai	7b7c5ae440	blk-iocost: check return value of match_u64() This patch fixs that the return value of match_u64() from ioc_qos_write() is not checked, Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230117070806.3857142-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Arnd Bergmann	5f2779dfa7	blk-iocost: avoid 64-bit division in ioc_timer_fn The behavior of 'enum' types has changed in gcc-13, so now the UNBUSY_THR_PCT constant is interpreted as a 64-bit number because it is defined as part of the same enum definition as some other constants that do not fit within a 32-bit integer. This in turn leads to some inefficient code on 32-bit architectures as well as a link error: arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn': blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod' Split the enum definition to keep the 64-bit timing constants in a separate enum type from those constants that can clearly fit within a smaller type. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230118080706.3303186-1-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	464544fb93	block: ublk: fix doc build warning Fix the following warning: Documentation/block/ublk.rst:157: WARNING: Enumerated list ends without a blank line; unexpected unindent. Documentation/block/ublk.rst:171: WARNING: Enumerated list ends without a blank line; unexpected unindent. Fixes: 56f5160bc1b8 ("ublk_drv: add mechanism for supporting unprivileged ublk device") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230118042318.127900-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Pankaj Raghav	d67ea690ce	block: introduce bdev_zone_no helper Add a generic bdev_zone_no() helper to calculate zone number for a given sector in a block device. This helper internally uses disk_zone_no() to find the zone number. Use the helper bdev_zone_no() to calculate nr of zones. This lets us make modifications to the math if needed in one place. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230110143635.77300-4-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Pankaj Raghav	e29b210021	block: add a new helper bdev_{is_zone_start, offset_from_zone_start} Instead of open coding to check for zone start, add a helper to improve readability and store the logic in one place. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230110143635.77300-3-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Pankaj Raghav	fea127b36c	block: remove superfluous check for request queue in bdev_is_zoned() Remove the superfluous request queue check in bdev_is_zoned() as bdev_get_queue() can never return NULL. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20230110143635.77300-2-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Anuj Gupta	7e2e355dd9	block: extend bio-cache for non-polled requests This patch modifies the present check, so that bio-cache is not limited to iopoll. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230117120638.72254-3-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Anuj Gupta	888545cb43	nvme: set REQ_ALLOC_CACHE for uring-passthru request This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests. This is a prep-patch so that normal / IRQ-driven uring-passthru I/Os can also leverage bio-cache. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20230117120638.72254-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	4093cb5a06	ublk_drv: add mechanism for supporting unprivileged ublk device unprivileged ublk device is helpful for container use case, such as: ublk device created in one unprivileged container can be controlled and accessed by this container only. Implement this feature by adding flag of UBLK_F_UNPRIVILEGED_DEV, and if this flag isn't set, any control command has been run from privileged user. Otherwise, any control command can be sent from any unprivileged user, but the user has to be permitted to access the ublk char device to be controlled. In case of UBLK_F_UNPRIVILEGED_DEV: 1) for command UBLK_CMD_ADD_DEV, it is always allowed, and user needs to provide owner's uid/gid in this command, so that udev can set correct ownership for the created ublk device, since the device owner uid/gid can be queried via command of UBLK_CMD_GET_DEV_INFO. 2) for other control commands, they can only be run successfully if the current user is allowed to access the specified ublk char device, for running the permission check, path of the ublk char device has to be provided by these commands. Also add one control of command UBLK_CMD_GET_DEV_INFO2 which always include the char dev path in payload since userspace may not have knowledge if this device is created in unprivileged mode. For applying this mechanism, system administrator needs to take the following policies: 1) chmod 0666 /dev/ublk-control 2) change ownership of ublkcN & ublkbN - chown owner_uid:owner_gid /dev/ublkcN - chown owner_uid:owner_gid /dev/ublkbN Both can be done via one simple udev rule. Userspace: https://github.com/ming1/ubdsrv/tree/unprivileged-ublk 'ublk add -t $TYPE --un_privileged=1' is for creating one un-privileged ublk device if the user is un-privileged. Link: https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	403ebc8778	ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev Prepare for supporting unprivileged ublk device by limiting max number ublk devices added. Otherwise too many ublk devices could be added by un-trusted user, which can be thought as one DoS. Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	abb864d380	ublk_drv: add device parameter UBLK_PARAM_TYPE_DEVT Userspace side only knows device ID, but the associated path of ublkc* and ublkb* could be changed by udev, and that depends on userspace's policy, so add parameter of UBLK_PARAM_TYPE_DEVT for retrieving major/minor of the ublkc* and ublkb*, then user may figure out major/minor of the ublk disks he/she owns. With major/minor, it is easy to find the device node path. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	bfbcef0363	ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmd It is annoying for each control command handler to get/put ublk device and deal with failure. Control command handler is simplified a lot by moving ublk_get_device_from_id into ublk_ctrl_uring_cmd(). Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	73a166d974	ublk_drv: don't probe partitions if the ubq daemon isn't trusted If any ubq daemon is unprivileged, the ublk char device is allowed for unprivileged user actually, and we can't trust the current user, so not probe partitions. Fixes: `71f28f3136` ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Ming Lei	ed878d1c1c	ublk_drv: remove nr_aborted_queues from ublk_device No one uses 'nr_aborted_queues' any more, so remove it. Reviewed-by: ZiyangZhang <ZiyangZhang@linux.alibaba.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230106041711.914434-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Jens Axboe	67d59247d4	block: don't allow multiple bios for IOCB_NOWAIT issue If we're doing a large IO request which needs to be split into multiple bios for issue, then we can run into the same situation as the below marked commit fixes - parts will complete just fine, one or more parts will fail to allocate a request. This will result in a partially completed read or write request, where the caller gets EAGAIN even though parts of the IO completed just fine. Do the same for large bios as we do for splits - fail a NOWAIT request with EAGAIN. This isn't technically fixing an issue in the below marked patch, but for stable purposes, we should have either none of them or both. This depends on: `613b14884b` ("block: handle bio_split_to_limits() NULL return") Cc: stable@vger.kernel.org # 5.15+ Fixes: `9cea62b2cb` ("block: don't allow splitting of a REQ_NOWAIT bio") Link: https://github.com/axboe/liburing/issues/766 Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:34 -07:00
Andreas Gruenbacher	2bb34fa6ff	drbd: drbd_insert_interval(): Clarify comment Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-9-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Lars Ellenberg	2990ca29f3	drbd: interval tree: make removing an "empty" interval a no-op Trying to remove an "empty" (just initialized, or "cleared") interval from the tree, this results in an endless loop. As we typically protect the tree with a spinlock_irq, the result is a hung system. Be nice to error cleanup code paths, ignore removal of empty intervals. Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-8-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	6d9be160df	MAINTAINERS: add drbd headers Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-7-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	9cf766a457	drbd: remove macros using require_context This require_context attribute originated in a proposed sparse patch by Philipp Reisner back in 2008. Johannes Berg had a different solution to a similar problem, and that patch "won" in the end; so the require_context thing never got merged. The whole history can be read at [0]. DRBD kept using these annotations anyway for a while. Nowadays, on a modern unmodified sparse, they obviously do nothing, and they are hardly used anymore anyway. So, just remove the definitions of these macros. [0] https://www.spinics.net/lists/linux-sparse/msg01150.html Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-6-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	069182007d	drbd: remove unnecessary assignment in vli_encode_bits Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-5-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	c10bdcf983	drbd: make limits unsigned These are almost always used as unsigned integers, so mark them as such. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-4-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Robert Altnoeder	2167879655	drbd: fix DRBD_VOLUME_MAX 65535 -> 65534 The protocol uses -1 as a reserved value for 'no specific volume', and since the protocol field is a 16 bit unsigned value, -1 is converted to 65535. Therefore, limit the range of valid volume numbers to [0, 65534]. Signed-off-by: Robert Altnoeder <robert.altnoeder@linbit.com> Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-3-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	3780006867	drbd: adjust drbd_limits license header See also commit `93c68cc46a` ("drbd: use consistent license"). We only want to license drbd under GPL-2.0, so use the corresponding SPDX header consistently. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123538.144276-2-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	20f2a34a42	drbd: split off drbd_config into separate file To be more similar to what we do in the out-of-tree module and ease the upstreaming process. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123506.144082-4-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	4e2da933b9	drbd: drop API_VERSION define Use the genetlink api version as defined in drbd_genl_api.h. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123506.144082-3-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Christoph Böhmwalder	887b98c74f	drbd: split off drbd_buildtag into separate file To be more similar to what we do in the out-of-tree module and ease the upstreaming process. Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Reviewed-by: Joel Colledge <joel.colledge@linbit.com> Link: https://lore.kernel.org/r/20230113123506.144082-2-christoph.boehmwalder@linbit.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Jens Axboe	a3df2e456c	block: add a BUILD_BUG_ON() for adding more bio flags than we have space We have BIO_FLAG_LAST in the enum for bio specific flags, but it's not used to check that we're not exceeding the size of them. Add such a check. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Keith Busch	c9c77418a9	block: save user max_sectors limit The user can set the max_sectors limit to any valid value via sysfs /sys/block/<dev>/queue/max_sectors_kb attribute. If the device limits are ever rescanned, though, the limit reverts back to the potentially artificially low BLK_DEF_MAX_SECTORS value. Preserve the user's setting as the max_sectors limit as long as it's valid. The user can reset back to defaults by writing 0 to the sysfs file. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230105205146.3610282-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Keith Busch	0a26f327e4	block: make BLK_DEF_MAX_SECTORS unsigned This is used as an unsigned value, so define it that way to avoid having to cast it. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230105205146.3610282-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Davide Zini	1bd43e19de	block, bfq: balance I/O injection among underutilized actuators Upon the invocation of its dispatch function, BFQ returns the next I/O request of the in-service bfq_queue, unless some exception holds. One such exception is that there is some underutilized actuator, different from the actuator for which the in-service queue contains I/O, and that some other bfq_queue happens to contain I/O for such an actuator. In this case, the next I/O request of the latter bfq_queue, and not of the in-service bfq_queue, is returned (I/O is injected from that bfq_queue). To find such an actuator, a linear scan, in increasing index order, is performed among actuators. Performing a linear scan entails a prioritization among actuators: an underutilized actuator may be considered for injection only if all actuators with a lower index are currently fully utilized, or if there is no pending I/O for any lower-index actuator that happens to be underutilized. This commits breaks this prioritization and tends to distribute injection uniformly across actuators. This is obtained by adding the following condition to the linear scan: even if an actuator A is underutilized, A is however skipped if its load is higher than that of the next actuator. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Davide Zini	2d31c684a0	block, bfq: inject I/O to underutilized actuators The main service scheme of BFQ for sync I/O is serving one sync bfq_queue at a time, for a while. In particular, BFQ enforces this scheme when it deems the latter necessary to boost throughput or to preserve service guarantees. Unfortunately, when BFQ enforces this policy, only one actuator at a time gets served for a while, because each bfq_queue contains I/O only for one actuator. The other actuators may remain underutilized. Actually, BFQ may serve (inject) extra I/O, taken from other bfq_queues, in parallel with that of the in-service queue. This injection mechanism may provide the ground for dealing also with the above actuator-underutilization problem. Yet BFQ does not take the actuator load into account when choosing which queue to pick extra I/O from. In addition, BFQ may happen to inject extra I/O only when the in-service queue is temporarily empty. In view of these facts, this commit extends the injection mechanism in such a way that the latter: (1) takes into account also the actuator load; (2) checks such a load on each dispatch, and injects I/O for an underutilized actuator, if there is one and there is I/O for it. To perform the check in (2), this commit introduces a load threshold, currently set to 4. A linear scan of each actuator is performed, until an actuator is found for which the following two conditions hold: the load of the actuator is below the threshold, and there is at least one non-in-service queue that contains I/O for that actuator. If such a pair (actuator, queue) is found, then the head request of that queue is returned for dispatch, instead of the head request of the in-service queue. We have set the threshold, empirically, to the minimum possible value for which an actuator is fully utilized, or close to be fully utilized. By doing so, injected I/O 'steals' as few drive-queue slots as possibile to the in-service queue. This reduces as much as possible the probability that the service of I/O from the in-service bfq_queue gets delayed because of slot exhaustion, i.e., because all the slots of the drive queue are filled with I/O injected from other queues (NCQ provides for 32 slots). This new mechanism also counters actuator underutilization in the case of asymmetric configurations of bfq_queues. Namely if there are few bfq_queues containing I/O for some actuators and many bfq_queues containing I/O for other actuators. Or if the bfq_queues containing I/O for some actuators have lower weights than the other bfq_queues. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Federico Gavioli	4fdb3b9f2a	block, bfq: retrieve independent access ranges from request queue This patch implements the code to gather the content of the independent_access_ranges structure from the request_queue and copy it into the queue's bfq_data. This copy is done at queue initialization. We copy the access ranges into the bfq_data to avoid taking the queue lock each time we access the ranges. This implementation, however, puts a limit to the maximum independent ranges supported by the scheduler. Such a limit is equal to the constant BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of dynamic memory. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Co-developed-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Davide Zini	8b7fd74111	block, bfq: split also async bfq_queues on a per-actuator basis Similarly to sync bfq_queues, also async bfq_queues need to be split on a per-actuator basis. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:33 -07:00
Paolo Valente	fd571df0ac	block, bfq: turn bfqq_data into an array in bfq_io_cq When a bfq_queue Q is merged with another queue, several pieces of information are saved about Q. These pieces are stored in the bfqq_data field in the bfq_io_cq data structure of the process associated with Q. Yet, with a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. Each of these queues may undergo a merge. So, the bfq_io_cq data structure must be able to accommodate the above information for N queues. This commit solves this problem by turning the bfqq_data scalar field into an array of N elements (and by changing code so as to handle this array). This solution is written under the assumption that bfq_queues associated with different actuators cannot be cross-merged. This assumption holds naturally with basic queue merging: the latter is triggered by spatial locality, and sectors for different actuators are not close to each other (apart from the corner case of the last sectors served by a given actuator and the first sectors served by the next actuator). As for stable cross-merging, the assumption here is that it is disabled. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00
Paolo Valente	a61230470c	block, bfq: move io_cq-persistent bfqq data into a dedicated struct With a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq data structure must be able to accommodate its per-queue persistent information for N queues. Currently it stores this information for just one queue, in several scalar fields. This is a preparatory commit for moving to accommodating persistent information for N queues. In particular, this commit packs all the above scalar fields into a single data structure. Then there is now only one field, in bfq_io_cq, that stores all the above information. This scalar field will then be turned into an array by a following commit. Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00
Paolo Valente	b752989897	block, bfq: forbid stable merging of queues associated with different actuators If queues associated with different actuators are merged, then control is lost on each actuator. Therefore some actuator may be underutilized, and throughput may decrease. This problem cannot occur with basic queue merging, because the latter is triggered by spatial locality, and sectors for different actuators are not close to each other. Yet it may happen with stable merging. To address this issue, this commit prevents stable merging from occurring among queues associated with different actuators. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2023-01-29 15:18:32 -07:00

1 2 3 4 5 ...

1154960 Commits