linux/block
Damien Le Moal 7b29518728 block: Do not remove zone write plugs still in use
Large write BIOs that span a zone boundary are split in
blk_mq_submit_bio() before being passed to blk_zone_plug_bio() for zone
write plugging. Such split BIO will be chained with one fragment
targeting one zone and the remainder of the BIO targeting the next
zone. The two BIOs can be executed in parallel, without a predetermine
order relative to eachother and their completion may be reversed: the
remainder first completing and the first fragment then completing. In
such case, bio_endio() will not immediately execute
blk_zone_write_plug_bio_endio() for the parent BIO (the remainder of the
split BIO) as the BIOs are chained. blk_zone_write_plug_bio_endio() for
the parent BIO will be executed only once the first fragment completes.

In the case of a device with small zones and very large BIOs, uch
completion pattern can lead to disk_should_remove_zone_wplug() to return
true for the zone of the parent BIO when the parent BIO request
completes and blk_zone_write_plug_complete_request() is executed. This
triggers the removal of the zone write plug from the hash table using
disk_remove_zone_wplug(). With the zone write plug of the parent BIO
missing, the call to disk_get_zone_wplug() in
blk_zone_write_plug_bio_endio() returns NULL and triggers a warning.

This patterns can be recreated fairly easily using a scsi_debug device
with small zone and btrfs. E.g.

modprobe scsi_debug delay=0 dev_size_mb=1024 sector_size=4096 \
	zbc=host-managed zone_cap_mb=3 zone_nr_conv=0 zone_size_mb=4
mkfs.btrfs -f -O zoned /dev/sda
mount -t btrfs /dev/sda /mnt
fio --name=wrtest --rw=randwrite --direct=1 --ioengine=libaio \
	--bs=4k --iodepth=16 --size=1M --directory=/mnt --time_based \
	--runtime=10
umount /dev/sda

Will result in the warning:

[   29.035538] WARNING: CPU: 3 PID: 37 at block/blk-zoned.c:1207 blk_zone_write_plug_bio_endio+0xee/0x1e0
...
[   29.058682] Call Trace:
[   29.059095]  <TASK>
[   29.059473]  ? __warn+0x80/0x120
[   29.059983]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.060728]  ? report_bug+0x160/0x190
[   29.061283]  ? handle_bug+0x36/0x70
[   29.061830]  ? exc_invalid_op+0x17/0x60
[   29.062399]  ? asm_exc_invalid_op+0x1a/0x20
[   29.063025]  ? blk_zone_write_plug_bio_endio+0xee/0x1e0
[   29.063760]  bio_endio+0xb7/0x150
[   29.064280]  btrfs_clone_write_end_io+0x2b/0x60 [btrfs]
[   29.065049]  blk_update_request+0x17c/0x500
[   29.065666]  scsi_end_request+0x27/0x1a0 [scsi_mod]
[   29.066356]  scsi_io_completion+0x5b/0x690 [scsi_mod]
[   29.067077]  blk_complete_reqs+0x3a/0x50
[   29.067692]  __do_softirq+0xcf/0x2b3
[   29.068248]  ? sort_range+0x20/0x20
[   29.068791]  run_ksoftirqd+0x1c/0x30
[   29.069339]  smpboot_thread_fn+0xcc/0x1b0
[   29.069936]  kthread+0xcf/0x100
[   29.070438]  ? kthread_complete_and_exit+0x20/0x20
[   29.071314]  ret_from_fork+0x31/0x50
[   29.071873]  ? kthread_complete_and_exit+0x20/0x20
[   29.072563]  ret_from_fork_asm+0x11/0x20
[   29.073146]  </TASK>

either when fio executes or when unmount is executed.

Fix this by modifying disk_should_remove_zone_wplug() to check that the
reference count to a zone write plug is not larger than 2, that is, that
the only references left on the zone are the caller held reference
(blk_zone_write_plug_complete_request()) and the initial extra reference
for the zone write plug taken when it was initialized (and that is
dropped when the zone write plug is removed from the hash table).

To be consistent with this change, make sure to drop the request or BIO
held reference to the zone write plug before calling
disk_zone_wplug_unplug_bio(). All references are also dropped using
disk_put_zone_wplug() instead of atomic_dec() to ensure that the zone
write plug is freed if it needs to be.

Comments are also improved to clarify zone write plugs reference
handling.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240501110907.96950-8-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-01 08:08:43 -06:00
..
partitions block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC 2024-03-09 07:31:42 -07:00
badblocks.c badblocks: avoid checking invalid range in badblocks_check() 2023-12-23 18:38:08 -07:00
bdev.c fs,block: get holder during claim 2024-03-18 10:32:44 +01:00
bfq-cgroup.c block: add blk_time_get_ns() and blk_time_get() helpers 2024-02-05 10:07:22 -07:00
bfq-iosched.c block: add blk_time_get_ns() and blk_time_get() helpers 2024-02-05 10:07:22 -07:00
bfq-iosched.h block, bfq: remove BFQ_WEIGHT_LEGACY_DFL 2023-04-06 16:17:32 -06:00
bfq-wf2q.c block, bfq: inject I/O to underutilized actuators 2023-01-29 15:18:33 -07:00
bio-integrity.c block: support PI at non-zero offset within metadata 2024-02-12 08:49:31 -07:00
bio.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
blk-cgroup-fc-appid.c block: Replace all non-returning strlcpy with strscpy 2023-06-01 09:13:31 -06:00
blk-cgroup-rwstat.c blk-cgroup: use group allocation/free of per-cpu counters API 2024-04-03 09:10:17 -06:00
blk-cgroup-rwstat.h block: Use the new blk_opf_t type 2022-07-14 12:14:30 -06:00
blk-cgroup.c blk-cgroup: use bio_list_merge_init 2024-04-01 11:53:37 -06:00
blk-cgroup.h block: move cgroup time handling code into blk.h 2024-02-05 10:07:17 -07:00
blk-core.c block: Do not special-case plugging of zone write operations 2024-04-17 08:44:03 -06:00
blk-crypto-fallback.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
blk-crypto-internal.h blk-crypto: remove blk_crypto_insert_cloned_request() 2023-03-16 09:35:09 -06:00
blk-crypto-profile.c blk-crypto: use dynamic lock class for blk_crypto_profile::lock 2023-07-05 16:36:12 -06:00
blk-crypto-sysfs.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-crypto.c blk-crypto: make blk_crypto_evict_key() more robust 2023-03-16 09:35:09 -06:00
blk-flush.c block: Restore sector of flush requests 2024-04-17 08:44:02 -06:00
blk-ia-ranges.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-integrity.c block: support PI at non-zero offset within metadata 2024-02-12 08:49:31 -07:00
blk-ioc.c blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue() 2023-06-07 07:51:00 -06:00
blk-iocost.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
blk-iolatency.c block: add blk_time_get_ns() and blk_time_get() helpers 2024-02-05 10:07:22 -07:00
blk-ioprio.c blk-ioprio: Introduce promote-to-rt policy 2023-06-06 22:26:26 -06:00
blk-ioprio.h blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit 2022-09-26 19:09:31 -06:00
blk-lib.c Revert "blk-lib: check for kill signal" 2024-03-13 20:35:48 -06:00
blk-map.c block: Fix WARNING in _copy_from_iter 2024-01-23 08:56:55 -07:00
blk-merge.c block: Do not special-case plugging of zone write operations 2024-04-17 08:44:03 -06:00
blk-mq-cpumap.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-debugfs.c block: Remove zone write locking 2024-04-17 08:44:03 -06:00
blk-mq-debugfs.h block: Replace zone_wlock debugfs entry with zone_wplugs entry 2024-04-17 08:44:03 -06:00
blk-mq-pci.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-sched.c blk-mq: Remove the hctx 'run' debugfs attribute 2024-01-17 14:16:34 -07:00
blk-mq-sched.h blk-mq: make sure elevator callbacks aren't called for passthrough request 2023-05-18 19:42:54 -06:00
blk-mq-sysfs.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-tag.c for-6.5/block-2023-06-23 2023-06-26 12:47:20 -07:00
blk-mq-virtio.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq.c block: Do not special-case plugging of zone write operations 2024-04-17 08:44:03 -06:00
blk-mq.h block: Do not special-case plugging of zone write operations 2024-04-17 08:44:03 -06:00
blk-pm.c block: Remove blk_set_runtime_active() 2023-11-20 10:22:40 -07:00
blk-pm.h
blk-rq-qos.c block: correct stale comment in rq_qos_wait 2023-09-18 14:15:28 -06:00
blk-rq-qos.h block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io 2023-12-01 18:29:18 -07:00
blk-settings.c block: Remove elevator required features 2024-04-17 08:44:03 -06:00
blk-stat.c block: prevent division by zero in blk_rq_stat_sum() 2024-03-06 08:31:54 -07:00
blk-stat.h block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-sysfs.c block: Allow zero value of max_zone_append_sectors queue limit 2024-04-17 08:44:03 -06:00
blk-throttle.c blk-throttle: Only use seq_printf() in tg_prfill_limit() 2024-04-01 11:53:37 -06:00
blk-throttle.h blk-throttle: print signed value 'carryover_bytes/ios' for user 2023-08-30 10:15:01 -06:00
blk-timeout.c
blk-wbt.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
blk-wbt.h blk-wbt: remove the separate write cache tracking 2023-12-26 09:28:10 -07:00
blk-zoned.c block: Do not remove zone write plugs still in use 2024-05-01 08:08:43 -06:00
blk.h block: Implement zone append emulation 2024-04-17 08:44:03 -06:00
bounce.c block, fs: Restore the per-bio/request data lifetime fields 2024-02-06 14:31:05 +01:00
bsg-lib.c block: pass a queue_limits argument to blk_mq_init_queue 2024-02-13 08:56:59 -07:00
bsg.c SCSI misc on 20230629 2023-06-30 11:57:07 -07:00
disk-events.c block: move bdev_mark_dead out of disk_check_media_change 2023-10-28 13:29:23 +02:00
early-lookup.c block: don't return -EINVAL for not found names in devt_from_devname 2023-06-22 09:09:33 -06:00
elevator.c block: Remove elevator required features 2024-04-17 08:44:03 -06:00
elevator.h block: Remove elevator required features 2024-04-17 08:44:03 -06:00
fops.c block: Call blkdev_dio_unaligned() from blkdev_direct_IO() 2024-04-15 08:12:22 -06:00
genhd.c block: Introduce zone write plugging 2024-04-17 08:44:03 -06:00
holder.c block: fix deadlock between bd_link_disk_holder and partition scan 2024-02-23 07:44:19 -07:00
ioctl.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
ioprio.c block: move __get_task_ioprio() into header file 2024-01-08 12:27:39 -07:00
Kconfig block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED 2024-04-17 08:44:03 -06:00
Kconfig.iosched block: Default to use cgroup support for BFQ 2023-01-30 09:42:42 -07:00
kyber-iosched.c blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
Makefile block: Move zone related debugfs attribute to blk-zoned.c 2024-04-17 08:44:03 -06:00
mq-deadline.c block/mq-deadline: Remove some unused functions 2024-04-19 08:10:36 -06:00
opal_proto.h block: sed-opal: handle empty atoms when parsing response 2024-02-16 15:52:45 -07:00
sed-opal.c for-6.9/block-20240310 2024-03-11 11:43:44 -07:00
t10-pi.c block: support PI at non-zero offset within metadata 2024-02-12 08:49:31 -07:00