After blk_cleanup_queue() returns, disk may not be released yet, so
probably bio may still be submitted and ->q_usage_counter may be
touched, so far this way seems safe, but not good from API's viewpoint.
Move the release q_usage_counter into blk_queue_release().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308055200.735835-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The queue's top debugfs dir is removed from blk_release_queue(), so all
hctx's debugfs dirs are removed from there. Given blk_mq_exit_queue()
is only called from blk_cleanup_queue(), it isn't necessary to remove
hctx debugfs from blk_mq_exit_queue().
So remove it from blk_mq_exit_queue().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308055200.735835-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg works on FS bio level, so it is reasonable to make both blkcg and
gendisk sharing same lifetime. Meantime there won't be any FS IO when
releasing disk, so safe to move blkcg initialization/destroy into disk
allocation/release handler
Long term, we can move blkcg into gendisk completely.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220308055200.735835-10-hch@lst.de
[axboe: fixup missing blk-cgroup.h include]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To simplify further changes allow for double calling blk_mq_free_rqs on
a queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: split out from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220308055200.735835-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I/O accounting buckets I/O into the read/write/discard categories into
which passthrough I/O does not fit at all. It also accounts to the
block_device, which may not even exist for passthrough I/O.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220308055200.735835-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First code becomes more clean by switching to xarray from plain array.
Second use-after-free on q->queue_hw_ctx can be fixed because
queue_for_each_hw_ctx() may be run when updating nr_hw_queues is
in-progress. With this patch, q->hctx_table is defined as xarray, and
this structure will share same lifetime with request queue, so
queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably.
Reported-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com
[axboe: fix blk_mq_hw_ctx forward declaration]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is inevitable to cause use-after-free on q->queue_hw_ctx between
queue_for_each_hw_ctx() and blk_mq_update_nr_hw_queues(). And converting
to xarray can fix the uaf, meantime code gets cleaner.
Prepare for converting q->queue_hctx_ctx into xarray, one thing is that
xa_for_each() can only accept 'unsigned long' as index, so changes type
of hctx index of queue_for_each_hw_ctx() into 'unsigned long'.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
queue map can be changed when updating nr_hw_queues, so we need to
reconfigure queue's poll capability. Add one helper for doing this job.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_alloc_and_init_hctx() has already taken reuse into account, so
no need to do it outside, then we can simplify blk_mq_realloc_hw_ctxs().
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The current code always uses default queue map and hw queue index
for figuring out the numa node for hw queue, this way isn't correct
because blk-mq supports three queue maps, and the correct queue map
should be used for the specified hw queue.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220308073219.91173-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A crash [1] happened to be triggered in conjunction with commit
2d52c58b9c ("block, bfq: honor already-setup queue merges"). The
latter was then reverted by commit ebc69e897e ("Revert "block, bfq:
honor already-setup queue merges""). Yet, the reverted commit was not
the one introducing the bug. In fact, it actually triggered a UAF
introduced by a different commit, and now fixed by commit d29bd41428
("block, bfq: reset last_bfqq_created on group change").
So, there is no point in keeping commit 2d52c58b9c ("block, bfq:
honor already-setup queue merges") out. This commit restores it.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All callers are gone, so remove this wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220304180105.409765-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the %pg format specifier instead of the stack hungry bdevname
function, and remove handle_bad_sector given that it is not pointless.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220304180105.409765-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't use a WARN_ON when printing a potentially user triggered
condition. Also don't print the partno when the block device name
already includes it, and use the %pg specifier to simplify printing
the block device name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220304180105.409765-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
KASAN reports a use-after-free report when doing normal scsi-mq test
[69832.239032] ==================================================================
[69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0
[69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155
[69832.244656]
[69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8
[69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[69832.249069] Workqueue: kblockd blk_mq_run_work_fn
[69832.250022] Call Trace:
[69832.250541] dump_stack+0x9b/0xce
[69832.251232] ? bfq_dispatch_request+0x1045/0x44b0
[69832.252243] print_address_description.constprop.6+0x3e/0x60
[69832.253381] ? __cpuidle_text_end+0x5/0x5
[69832.254211] ? vprintk_func+0x6b/0x120
[69832.254994] ? bfq_dispatch_request+0x1045/0x44b0
[69832.255952] ? bfq_dispatch_request+0x1045/0x44b0
[69832.256914] kasan_report.cold.9+0x22/0x3a
[69832.257753] ? bfq_dispatch_request+0x1045/0x44b0
[69832.258755] check_memory_region+0x1c1/0x1e0
[69832.260248] bfq_dispatch_request+0x1045/0x44b0
[69832.261181] ? bfq_bfqq_expire+0x2440/0x2440
[69832.262032] ? blk_mq_delay_run_hw_queues+0xf9/0x170
[69832.263022] __blk_mq_do_dispatch_sched+0x52f/0x830
[69832.264011] ? blk_mq_sched_request_inserted+0x100/0x100
[69832.265101] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[69832.266206] ? blk_mq_do_dispatch_ctx+0x570/0x570
[69832.267147] ? __switch_to+0x5f4/0xee0
[69832.267898] blk_mq_sched_dispatch_requests+0xdf/0x140
[69832.268946] __blk_mq_run_hw_queue+0xc0/0x270
[69832.269840] blk_mq_run_work_fn+0x51/0x60
[69832.278170] process_one_work+0x6d4/0xfe0
[69832.278984] worker_thread+0x91/0xc80
[69832.279726] ? __kthread_parkme+0xb0/0x110
[69832.280554] ? process_one_work+0xfe0/0xfe0
[69832.281414] kthread+0x32d/0x3f0
[69832.282082] ? kthread_park+0x170/0x170
[69832.282849] ret_from_fork+0x1f/0x30
[69832.283573]
[69832.283886] Allocated by task 7725:
[69832.284599] kasan_save_stack+0x19/0x40
[69832.285385] __kasan_kmalloc.constprop.2+0xc1/0xd0
[69832.286350] kmem_cache_alloc_node+0x13f/0x460
[69832.287237] bfq_get_queue+0x3d4/0x1140
[69832.287993] bfq_get_bfqq_handle_split+0x103/0x510
[69832.289015] bfq_init_rq+0x337/0x2d50
[69832.289749] bfq_insert_requests+0x304/0x4e10
[69832.290634] blk_mq_sched_insert_requests+0x13e/0x390
[69832.291629] blk_mq_flush_plug_list+0x4b4/0x760
[69832.292538] blk_flush_plug_list+0x2c5/0x480
[69832.293392] io_schedule_prepare+0xb2/0xd0
[69832.294209] io_schedule_timeout+0x13/0x80
[69832.295014] wait_for_common_io.constprop.1+0x13c/0x270
[69832.296137] submit_bio_wait+0x103/0x1a0
[69832.296932] blkdev_issue_discard+0xe6/0x160
[69832.297794] blk_ioctl_discard+0x219/0x290
[69832.298614] blkdev_common_ioctl+0x50a/0x1750
[69832.304715] blkdev_ioctl+0x470/0x600
[69832.305474] block_ioctl+0xde/0x120
[69832.306232] vfs_ioctl+0x6c/0xc0
[69832.306877] __se_sys_ioctl+0x90/0xa0
[69832.307629] do_syscall_64+0x2d/0x40
[69832.308362] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[69832.309382]
[69832.309701] Freed by task 155:
[69832.310328] kasan_save_stack+0x19/0x40
[69832.311121] kasan_set_track+0x1c/0x30
[69832.311868] kasan_set_free_info+0x1b/0x30
[69832.312699] __kasan_slab_free+0x111/0x160
[69832.313524] kmem_cache_free+0x94/0x460
[69832.314367] bfq_put_queue+0x582/0x940
[69832.315112] __bfq_bfqd_reset_in_service+0x166/0x1d0
[69832.317275] bfq_bfqq_expire+0xb27/0x2440
[69832.318084] bfq_dispatch_request+0x697/0x44b0
[69832.318991] __blk_mq_do_dispatch_sched+0x52f/0x830
[69832.319984] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[69832.321087] blk_mq_sched_dispatch_requests+0xdf/0x140
[69832.322225] __blk_mq_run_hw_queue+0xc0/0x270
[69832.323114] blk_mq_run_work_fn+0x51/0x60
[69832.323942] process_one_work+0x6d4/0xfe0
[69832.324772] worker_thread+0x91/0xc80
[69832.325518] kthread+0x32d/0x3f0
[69832.326205] ret_from_fork+0x1f/0x30
[69832.326932]
[69832.338297] The buggy address belongs to the object at ffff88802622b968
[69832.338297] which belongs to the cache bfq_queue of size 512
[69832.340766] The buggy address is located 288 bytes inside of
[69832.340766] 512-byte region [ffff88802622b968, ffff88802622bb68)
[69832.343091] The buggy address belongs to the page:
[69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228
[69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0
[69832.347719] flags: 0x1fffff80010200(slab|head)
[69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840
[69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000
[69832.356547] page dumped because: kasan: bad access detected
[69832.357652]
[69832.357970] Memory state around the buggy address:
[69832.358926] ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.360358] ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.363273] ^
[69832.363975] ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
[69832.375960] ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[69832.377405] ==================================================================
In bfq_dispatch_requestfunction, it may have function call:
bfq_dispatch_request
__bfq_dispatch_request
bfq_select_queue
bfq_bfqq_expire
__bfq_bfqd_reset_in_service
bfq_put_queue
kmem_cache_free
In this function call, in_serv_queue has beed expired and meet the
conditions to free. In the function bfq_dispatch_request, the address
of in_serv_queue pointing to has been released. For getting the value
of idle_timer_disabled, it will get flags value from the address which
in_serv_queue pointing to, then the problem of use-after-free happens;
Fix the problem by check in_serv_queue == bfqd->in_service_queue, to
get the value of idle_timer_disabled if in_serve_queue is equel to
bfqd->in_service_queue. If the space of in_serv_queue pointing has
been released, this judge will aviod use-after-free problem.
And if in_serv_queue may be expired or finished, the idle_timer_disabled
will be false which would not give effects to bfq_update_dispatch_stats.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add sysfs files that expose the inline encryption capabilities of
request queues:
/sys/block/$disk/queue/crypto/max_dun_bits
/sys/block/$disk/queue/crypto/modes/$mode
/sys/block/$disk/queue/crypto/num_keyslots
Userspace can use these new files to decide what encryption settings to
use, or whether to use inline encryption at all. This also brings the
crypto capabilities in line with the other queue properties, which are
already discoverable via the queue directory in sysfs.
Design notes:
- Place the new files in a new subdirectory "crypto" to group them
together and to avoid complicating the main "queue" directory. This
also makes it possible to replace "crypto" with a symlink later if
we ever make the blk_crypto_profiles into real kobjects (see below).
- It was necessary to define a new kobject that corresponds to the
crypto subdirectory. For now, this kobject just contains a pointer
to the blk_crypto_profile. Note that multiple queues (and hence
multiple such kobjects) may refer to the same blk_crypto_profile.
An alternative design would more closely match the current kernel
data structures: the blk_crypto_profile could be a kobject itself,
located directly under the host controller device's kobject, while
/sys/block/$disk/queue/crypto would be a symlink to it.
I decided not to do that for now because it would require a lot more
changes, such as no longer embedding blk_crypto_profile in other
structures, and also because I'm not sure we can rule out moving the
crypto capabilities into 'struct queue_limits' in the future. (Even
if multiple queues share the same crypto engine, maybe the supported
data unit sizes could differ due to other queue properties.) It
would also still be possible to switch to that design later without
breaking userspace, by replacing the directory with a symlink.
- Use "max_dun_bits" instead of "max_dun_bytes". Currently, the
kernel internally stores this value in bytes, but that's an
implementation detail. It probably makes more sense to talk about
this value in bits, and choosing bits is more future-proof.
- "modes" is a sub-subdirectory, since there may be multiple supported
crypto modes, sysfs is supposed to have one value per file, and it
makes sense to group all the mode files together.
- Each mode had to be named. The crypto API names like "xts(aes)" are
not appropriate because they don't specify the key size. Therefore,
I assigned new names. The exact names chosen are arbitrary, but
they happen to match the names used in log messages in fs/crypto/.
- The "num_keyslots" file is a bit different from the others in that
it is only useful to know for performance reasons. However, it's
included as it can still be useful. For example, a user might not
want to use inline encryption if there aren't very many keyslots.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220124215938.2769-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kobjects aren't supposed to be deleted before their child kobjects are
deleted. Apparently this is usually benign; however, a WARN will be
triggered if one of the child kobjects has a named attribute group:
sysfs group 'modes' not found for kobject 'crypto'
WARNING: CPU: 0 PID: 1 at fs/sysfs/group.c:278 sysfs_remove_group+0x72/0x80
...
Call Trace:
sysfs_remove_groups+0x29/0x40 fs/sysfs/group.c:312
__kobject_del+0x20/0x80 lib/kobject.c:611
kobject_cleanup+0xa4/0x140 lib/kobject.c:696
kobject_release lib/kobject.c:736 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x53/0x70 lib/kobject.c:753
blk_crypto_sysfs_unregister+0x10/0x20 block/blk-crypto-sysfs.c:159
blk_unregister_queue+0xb0/0x110 block/blk-sysfs.c:962
del_gendisk+0x117/0x250 block/genhd.c:610
Fix this by moving the kobject_del() and the corresponding
kobject_uevent() to the correct place.
Fixes: 2c2086afc2 ("block: Protect less code with sysfs_lock in blk_{un,}register_queue()")
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220124215938.2769-3-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make elv_unregister_queue() a no-op if q->elevator is NULL or is not
registered.
This simplifies the existing callers, as well as the future caller in
the error path of blk_register_queue().
Also don't bother checking whether q is NULL, since it never is.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220124215938.2769-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As Luis reported, losetup currently doesn't properly create the loop
device without this if the device node already exists because old
scripts created it manually. So default to y for now and remove the
aggressive removal schedule.
Reported-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220225181440.1351591-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the inflight IOs are slow and no new IOs are issued, we expect
iostat could manifest the IO hang problem. However after
commit 5b18b5a737 ("block: delete part_round_stats and switch to less
precise counting"), io_tick and time_in_queue will not be updated until
the end of IO, and the avgqu-sz and %util columns of iostat will be zero.
Because it has using stat.nsecs accumulation to express time_in_queue
which is not suitable to change, and may %util will express the status
better when io hang occur. To fix io_ticks, we use update_io_ticks and
inflight to update io_ticks when diskstats_show and part_stat_show
been called.
Fixes: 5b18b5a737 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220217064247.4041435-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use bfq_group() instead, which do the same thing.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0.
What we want is ioprio or 0.
Correct this by changing the calculation.
Signed-off-by: Yahu Gao <gaoyahu19@gmail.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220107065859.25689-1-gaoyahu19@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.
To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a method to notify the driver that the gendisk is about to be freed.
This allows drivers to tie the lifetime of their private data to that of
the gendisk and thus deal with device removal races without expensive
synchronization and boilerplate code.
A new flag is added so that ->free_disk is only called after a successful
call to add_disk, which significantly simplifies the error handling path
during probing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220215094514.3828912-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Revert commit 4f1e9630af ("blk-throtl: optimize IOPS throttle for large
IO scenarios") since we have another easier way to address this issue and
get better iops throttling result.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to throttle split bio in case of IOPS limit even though the
split bio has been marked as BIO_THROTTLED since block layer
accounts split bio actually.
If only throughput throttle is setup, no need to throttle any more
if BIO_THROTTLED is set since we have accounted & considered the
whole bio bytes already.
Add one flag of THROTL_TG_HAS_IOPS_LIMIT for serving this purpose.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 111be88398 ("block-throttle: avoid double charge") marks bio as
BIO_THROTTLED unconditionally if __blk_throtl_bio() is called on this bio,
then this bio won't be called into __blk_throtl_bio() any more. This way
is to avoid double charge in case of bio splitting. It is reasonable for
read/write throughput limit, but not reasonable for IOPS limit because
block layer provides io accounting against split bio.
Chunguang Xu has already observed this issue and fixed it in commit
4f1e9630af ("blk-throtl: optimize IOPS throttle for large IO scenarios").
However, that patch only covers bio splitting in __blk_queue_split(), and
we have other kind of bio splitting, such as bio_split() &
submit_bio_noacct() and other ways.
This patch tries to fix the issue in one generic way by always charging
the bio for iops limit in blk_throtl_bio(). This way is reasonable:
re-submission & fast-cloned bio is charged if it is submitted to same
disk/queue, and BIO_THROTTLED will be cleared if bio->bi_bdev is changed.
This new approach can get much more smooth/stable iops limit compared with
commit 4f1e9630af ("blk-throtl: optimize IOPS throttle for large IO
scenarios") since that commit can't throttle current split bios actually.
Also this way won't cause new double bio iops charge in
blk_throtl_dispatch_work_fn() in which blk_throtl_bio() won't be called
any more.
Reported-by: Ning Li <lining2020x@163.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Chunguang Xu <brookxu@tencent.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now submit_bio_checks() is only called by submit_bio_noacct(), so merge
it into submit_bio_noacct().
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio has been checked already before throttling, so no need to check
it again before dispatching it from throttle queue.
Add a helper of submit_bio_noacct_nocheck() for this purpose.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220216044514.2903784-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
submit_bio_checks() won't be called outside of block/blk-core.c any more
since commit 9d497e2941 ("block: don't protect submit_bio_checks by
q_usage_counter"), so mark it as one local helper.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_crypto_bio_prep() is called for both bio based and blk-mq drivers,
so move it out of blk-mq.c, then we can unify this kind of handling.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is more clean & readable to check bio when starting to submit it,
instead of just before calling ->submit_bio() or blk_mq_submit_bio().
Also it provides us chance to optimize bio submission without checking
bio.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220216044514.2903784-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The request must be submitted to the queue it was allocated for, so
remove the extra request_queue argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fold blk_cloned_rq_check_limits into its only caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code to stack blk-mq drivers is only used by dm-multipath, and
will preferably stay that way. Make it optional and only selected
by device mapper, so that the buildbots more easily catch abuses
like the one that slipped in in the ufs driver in the last merged
window. Another positive side effects is that kernel builds without
device mapper shrink a little bit as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220215100540.3892965-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't need to do blkg_iostat_set for top blkg iostat on each CPU,
so move it after percpu stat aggregation.
Fixes: ef45fe470e ("blk-cgroup: show global disk stats in root cgroup io.stat")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220213085902.88884-1-zhouchengming@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Based on the comment present in the bdev_get_queue()
bdev->bd_queue can never be NULL. Remove the NULL check for the local
variable q that is set from bdev_get_queue() for discard, write_same,
and write_zeroes.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220215115247.11717-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Partition include/linux/blk-cgroup.h into two parts: one is public part,
the other is block layer private part.
Suggested by Christoph Hellwig.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220211101149.2368042-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
q->blkg_list is only used by blkcg code, so move it into
blkcg_init_queue.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220211101149.2368042-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, rasdaemon uses the existing tracepoint block_rq_complete
and filters out non-error cases in order to capture block disk errors.
But there are a few problems with this approach:
1. Even kernel trace filter could do the filtering work, there is
still some overhead after we enable this tracepoint.
2. The filter is merely based on errno, which does not align with kernel
logic to check the errors for print_req_error().
3. block_rq_complete only provides dev major and minor to identify
the block device, it is not convenient to use in user-space.
So introduce a new tracepoint block_rq_error just for the error case.
With this patch, rasdaemon could switch to block_rq_error.
Since the new tracepoint has the similar implementation with
block_rq_complete, so move the existing code from TRACE_EVENT
block_rq_complete() into new event class block_rq_completion(). Then add
event for block_rq_complete and block_rq_err respectively from the newly
created event class per the suggestion from Chaitanya Kulkarni.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220210225222.260069-1-shy828301@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since __sbitmap_queue_get_shallow() was introduced in commit c05e667337
("sbitmap: add sbitmap_get_shallow() operation"), it has not been used.
Delete __sbitmap_queue_get_shallow() and rename public
__sbitmap_queue_get_shallow() -> sbitmap_queue_get_shallow() as it is odd
to have public __foo but no foo at all.
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1644322024-105340-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All callers of __bio_clone_fast initialize the bio first. Move that
initialization into __bio_clone_fast instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__bio_clone_fast should also clone integrity and crypto data, as a clone
without those is incomplete. Right now the only caller that can actually
support crypto and integrity data (dm) does it manually for the one
callchain that supports these, but we better do it properly in the core.
Note that all callers except for the above mentioned one also don't need
to handle failure at all, given that the integrity and crypto clones are
based on mempool allocations that won't fail for sleeping allocations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change the user visible return value for BLK_STS_OFFLINE to -ENODEV, which
is more descriptive than existing -EIO.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220203192827.1370270-3-song@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, drivers reports BLK_STS_IOERR for devices that are not full
online or being removed. This behavior could cause confusion for users,
as they are not really I/O errors from the device.
Solve this issue with a new state BLK_STS_OFFLINE, which reports "device
offline error" in dmesg instead of "I/O error".
EIO is intentionally kept to not change user visible return value.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220203192827.1370270-2-song@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>