linux/block
Damien Le Moal dd291d77cc block: Introduce zone write plugging
Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangle zone write ordering from block IO schedulers. This allows
   removing the restriction on using mq-deadline for writing to zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs from executing (reads or writes to
   other zones). Depending on the workload, this can significantly
   improve the device use (higher queue depth operation) and
   performance.
 - Both blk-mq (request based) zoned devices and BIO-based zoned devices
   (e.g.  device mapper) can use zone write plugging. It is mandatory
   for the former but optional for the latter. BIO-based drivers can
   use zone write plugging to implement write ordering guarantees, or
   the drivers can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.

Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.

If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.

In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.

If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hans Holmberg <hans.holmberg@wdc.com>
Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17 08:44:03 -06:00
..
partitions block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC 2024-03-09 07:31:42 -07:00
badblocks.c badblocks: avoid checking invalid range in badblocks_check() 2023-12-23 18:38:08 -07:00
bdev.c fs,block: get holder during claim 2024-03-18 10:32:44 +01:00
bfq-cgroup.c block: add blk_time_get_ns() and blk_time_get() helpers 2024-02-05 10:07:22 -07:00
bfq-iosched.c block: add blk_time_get_ns() and blk_time_get() helpers 2024-02-05 10:07:22 -07:00
bfq-iosched.h block, bfq: remove BFQ_WEIGHT_LEGACY_DFL 2023-04-06 16:17:32 -06:00
bfq-wf2q.c block, bfq: inject I/O to underutilized actuators 2023-01-29 15:18:33 -07:00
bio-integrity.c block: support PI at non-zero offset within metadata 2024-02-12 08:49:31 -07:00
bio.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
blk-cgroup-fc-appid.c block: Replace all non-returning strlcpy with strscpy 2023-06-01 09:13:31 -06:00
blk-cgroup-rwstat.c blk-cgroup: use group allocation/free of per-cpu counters API 2024-04-03 09:10:17 -06:00
blk-cgroup-rwstat.h block: Use the new blk_opf_t type 2022-07-14 12:14:30 -06:00
blk-cgroup.c blk-cgroup: use bio_list_merge_init 2024-04-01 11:53:37 -06:00
blk-cgroup.h block: move cgroup time handling code into blk.h 2024-02-05 10:07:17 -07:00
blk-core.c block: pass a queue_limits argument to blk_alloc_queue 2024-02-13 08:56:59 -07:00
blk-crypto-fallback.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
blk-crypto-internal.h blk-crypto: remove blk_crypto_insert_cloned_request() 2023-03-16 09:35:09 -06:00
blk-crypto-profile.c blk-crypto: use dynamic lock class for blk_crypto_profile::lock 2023-07-05 16:36:12 -06:00
blk-crypto-sysfs.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-crypto.c blk-crypto: make blk_crypto_evict_key() more robust 2023-03-16 09:35:09 -06:00
blk-flush.c block: Restore sector of flush requests 2024-04-17 08:44:02 -06:00
blk-ia-ranges.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-integrity.c block: support PI at non-zero offset within metadata 2024-02-12 08:49:31 -07:00
blk-ioc.c blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue() 2023-06-07 07:51:00 -06:00
blk-iocost.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
blk-iolatency.c block: add blk_time_get_ns() and blk_time_get() helpers 2024-02-05 10:07:22 -07:00
blk-ioprio.c blk-ioprio: Introduce promote-to-rt policy 2023-06-06 22:26:26 -06:00
blk-ioprio.h blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit 2022-09-26 19:09:31 -06:00
blk-lib.c Revert "blk-lib: check for kill signal" 2024-03-13 20:35:48 -06:00
blk-map.c block: Fix WARNING in _copy_from_iter 2024-01-23 08:56:55 -07:00
blk-merge.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
blk-mq-cpumap.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-debugfs-zoned.c block: move zone related fields to struct gendisk 2022-07-06 06:46:26 -06:00
blk-mq-debugfs.c blk-mq: Remove the hctx 'run' debugfs attribute 2024-01-17 14:16:34 -07:00
blk-mq-debugfs.h block: remove per-disk debugfs files in blk_unregister_queue 2022-06-17 07:31:05 -06:00
blk-mq-pci.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-sched.c blk-mq: Remove the hctx 'run' debugfs attribute 2024-01-17 14:16:34 -07:00
blk-mq-sched.h blk-mq: make sure elevator callbacks aren't called for passthrough request 2023-05-18 19:42:54 -06:00
blk-mq-sysfs.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-tag.c for-6.5/block-2023-06-23 2023-06-26 12:47:20 -07:00
blk-mq-virtio.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
blk-mq.h blk-mq: update driver tags request table when start request 2023-09-22 08:52:13 -06:00
blk-pm.c block: Remove blk_set_runtime_active() 2023-11-20 10:22:40 -07:00
blk-pm.h
blk-rq-qos.c block: correct stale comment in rq_qos_wait 2023-09-18 14:15:28 -06:00
blk-rq-qos.h block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io 2023-12-01 18:29:18 -07:00
blk-settings.c block: don't reject too large max_user_sectors in blk_validate_limits 2024-03-26 11:28:52 -06:00
blk-stat.c block: prevent division by zero in blk_rq_stat_sum() 2024-03-06 08:31:54 -07:00
blk-stat.h
blk-sysfs.c block: use queue_limits_commit_update in queue_discard_max_store 2024-02-13 08:56:59 -07:00
blk-throttle.c blk-throttle: Only use seq_printf() in tg_prfill_limit() 2024-04-01 11:53:37 -06:00
blk-throttle.h blk-throttle: print signed value 'carryover_bytes/ios' for user 2023-08-30 10:15:01 -06:00
blk-timeout.c
blk-wbt.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
blk-wbt.h blk-wbt: remove the separate write cache tracking 2023-12-26 09:28:10 -07:00
blk-zoned.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
blk.h block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
bounce.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
bsg-lib.c block: pass a queue_limits argument to blk_mq_init_queue 2024-02-13 08:56:59 -07:00
bsg.c SCSI misc on 20230629 2023-06-30 11:57:07 -07:00
disk-events.c block: move bdev_mark_dead out of disk_check_media_change 2023-10-28 13:29:23 +02:00
early-lookup.c block: don't return -EINVAL for not found names in devt_from_devname 2023-06-22 09:09:33 -06:00
elevator.c blk-mq: release scheduler resource when request completes 2023-08-19 07:47:17 -06:00
elevator.h blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
fops.c block: Call blkdev_dio_unaligned() from blkdev_direct_IO() 2024-04-15 08:12:22 -06:00
genhd.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
holder.c block: fix deadlock between bd_link_disk_holder and partition scan 2024-02-23 07:44:19 -07:00
ioctl.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
ioprio.c block: move __get_task_ioprio() into header file 2024-01-08 12:27:39 -07:00
Kconfig block: Add config option to not allow writing to mounted devices 2023-11-18 14:59:25 +01:00
Kconfig.iosched block: Default to use cgroup support for BFQ 2023-01-30 09:42:42 -07:00
kyber-iosched.c blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
Makefile block: move the code to do early boot lookup of block devices to block/ 2023-06-05 10:57:40 -06:00
mq-deadline.c Revert "block/mq-deadline: use correct way to throttling write requests" 2024-03-13 15:56:14 -06:00
opal_proto.h block: sed-opal: handle empty atoms when parsing response 2024-02-16 15:52:45 -07:00
sed-opal.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
t10-pi.c block: support PI at non-zero offset within metadata 2024-02-12 08:49:31 -07:00