elevator_get_default() uses the following algorithm to select an I/O
scheduler from inside add_disk():
- In case of a single hardware queue or if sharing hardware queues across
multiple request queues (BLK_MQ_F_TAG_HCTX_SHARED), use mq-deadline.
- Otherwise, use 'none'.
This is a good choice for most but not for all block drivers. Make it
possible to override the selection of mq-deadline with a new flag,
namely BLK_MQ_F_NO_SCHED_BY_DEFAULT.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Martijn Coenen <maco@android.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210805174200.3250718-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If queue is dying while iolatency_set_limit() is in progress,
blk_get_queue() won't increment the refcount of the queue. However,
blk_put_queue() will still decrement the refcount later, which will
cause the refcout to be unbalanced.
Thus error out in such case to fix the problem.
Fixes: 8c772a9bfc ("blk-iolatency: fix IO hang due to negative inflight counter")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210805124645.543797-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to
define any attribute since the "mq" sysfs directory contains only
sub-directories (no attribute files). As a result, blk_mq_sysfs_show(),
blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all
unused and unnecessary. Remove all this unused code.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new sysfs handle to export the new diskseq value.
Place it in <sysfs>/block/<disk>/diskseq and document it.
$ grep . /sys/class/block/*/diskseq
/sys/class/block/loop0/diskseq:13
/sys/class/block/loop1/diskseq:14
/sys/class/block/loop2/diskseq:5
/sys/class/block/loop3/diskseq:6
/sys/class/block/ram0/diskseq:1
/sys/class/block/ram1/diskseq:2
/sys/class/block/vda/diskseq:7
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-5-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.
Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.
Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.
Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
I have compiled the kernel with a cross compiler "hppa-linux-gnu-" v9.3.0
on x86-64 host machine. I got the following warning:
block/genhd.c: In function ‘diskstats_show’:
block/genhd.c:1227:1: warning: the frame size of 1688 bytes is larger
than 1280 bytes [-Wframe-larger-than=]
1227 | }
By Reduced the stack footprint by using the %pg printk specifier instead
of disk_name to remove the need for the on-stack buffer.
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of acquiring an inode reference on open make sure partitions
always hold device model references to the disk while alive, and switch
open to grab only a device model reference to the opened block device.
If that is a partition the disk reference is transitively held by the
partition already.
Link: https://lore.kernel.org/r/20210722075402.983367-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Unhash the whole device inode early in del_gendisk. This allows to
remove the first GENHD_FL_UP check in the open path as we simply
won't find a just removed inode. The second non-racy check after
taking open_mutex is still kept.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210722075402.983367-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the sg_timeout and sg_reserved_size fields into the bsg_device and
scsi_device structures as they have nothing to do with generic block I/O.
Note that these values are now separate for bsg vs. SCSI device node
access, but that just matches how /dev/sg vs the other nodes has always
behaved.
Link: https://lore.kernel.org/r/20210729064845.1044147-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull block fixes from Jens Axboe:
- gendisk freeing fix (Christoph)
- blk-iocost wake ordering fix (Tejun)
- tag allocation error handling fix (John)
- loop locking fix. While this isn't the prettiest fix in the world,
nobody has any good alternatives for 5.14. Something to likely
revisit for 5.15. (Tetsuo)
* tag 'block-5.14-2021-07-30' of git://git.kernel.dk/linux-block:
block: delay freeing the gendisk
blk-iocost: fix operation ordering in iocg_wake_fn()
blk-mq-sched: Fix blk_mq_sched_alloc_tags() error handling
loop: reintroduce global lock for safe loop_validate_file() traversal
CONFIG_BLK_SCSI_REQUEST is rather misnamed as it enables building a small
amount of code shared by the SCSI initiator, target, and consumers of the
scsi_request passthrough API. Rename it and also allow building it as a
module.
[mkp: add module license]
Link: https://lore.kernel.org/r/20210724072033.1284840-20-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
blkdev_get_no_open acquires a reference to the block_device through
the block device inode and then tries to acquire a device model
reference to the gendisk. But at this point the disk migh already
be freed (although the race is free). Fix this by only freeing the
gendisk from the whole device bdevs ->free_inode callback as well.
Fixes: 22ae8ce8b8 ("block: simplify bdev/disk lookup in blkdev_get")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it
wants the wq_entry to be always removed whether it ended up waking the
task or not. finish_wait() tests whether wq_entry needs removal without
grabbing the wait_queue lock and expects the waker to use
list_del_init_careful() after all waking operations are complete, which
iocg_wake_fn() didn't do. The operation order was wrong and the regular
list_del_init() was used.
The result is that if a waiter wakes up racing the waker, it can free pop
the wq_entry off stack before the waker is still looking at it, which can
lead to a backtrace like the following.
[7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP
...
[7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0
...
[7312084.858314] Call Trace:
[7312084.863548] _raw_spin_lock_irqsave+0x22/0x30
[7312084.872605] try_to_wake_up+0x4c/0x4f0
[7312084.880444] iocg_wake_fn+0x71/0x80
[7312084.887763] __wake_up_common+0x71/0x140
[7312084.895951] iocg_kick_waitq+0xe8/0x2b0
[7312084.903964] ioc_rqos_throttle+0x275/0x650
[7312084.922423] __rq_qos_throttle+0x20/0x30
[7312084.930608] blk_mq_make_request+0x120/0x650
[7312084.939490] generic_make_request+0xca/0x310
[7312084.957600] submit_bio+0x173/0x200
[7312084.981806] swap_readpage+0x15c/0x240
[7312084.989646] read_swap_cache_async+0x58/0x60
[7312084.998527] swap_cluster_readahead+0x201/0x320
[7312085.023432] swapin_readahead+0x2df/0x450
[7312085.040672] do_swap_page+0x52f/0x820
[7312085.058259] handle_mm_fault+0xa16/0x1420
[7312085.066620] do_page_fault+0x2c6/0x5c0
[7312085.074459] page_fault+0x2f/0x40
Fix it by switching to list_del_init_careful() and putting it at the end.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
0fa294fb19 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
context. However, rstat paths use u64_stats_sync to synchronize access to
64bit stat counters on 32bit machines. u64_stats_sync is implemented using
seq_lock and trying to read from an irq context can lead to A-A deadlock if
the irq happens to interrupt the stat update.
Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
u64_stats_update_end_irqrestore() - in the update paths. Note that none of
this matters on 64bit machines. All these are just for 32bit SMP setups.
Note that the interface was introduced way back, its first and currently
only use was recently added by 2d146aa3aa ("mm: memcontrol: switch to
rstat"). Stable tagging targets this commit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 2d146aa3aa ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+
Pull more block updates from Jens Axboe:
"A combination of changes that ended up depending on both the driver
and core branch (and/or the IDE removal), and a few late arriving
fixes. In detail:
- Fix io ticks wrap-around issue (Chunguang)
- nvme-tcp sock locking fix (Maurizio)
- s390-dasd fixes (Kees, Christoph)
- blk_execute_rq polling support (Keith)
- blk-cgroup RCU iteration fix (Yu)
- nbd backend ID addition (Prasanna)
- Partition deletion fix (Yufen)
- Use blk_mq_alloc_disk for mmc, mtip32xx, ubd (Christoph)
- Removal of now dead block request types due to IDE removal
(Christoph)
- Loop probing and control device cleanups (Christoph)
- Device uevent fix (Christoph)
- Misc cleanups/fixes (Tetsuo, Christoph)"
* tag 'block-5.14-2021-07-08' of git://git.kernel.dk/linux-block: (34 commits)
blk-cgroup: prevent rcu_sched detected stalls warnings while iterating blkgs
block: fix the problem of io_ticks becoming smaller
nvme-tcp: can't set sk_user_data without write_lock
loop: remove unused variable in loop_set_status()
block: remove the bdgrab in blk_drop_partitions
block: grab a device refcount in disk_uevent
s390/dasd: Avoid field over-reading memcpy()
dasd: unexport dasd_set_target_state
block: check disk exist before trying to add partition
ubd: remove dead code in ubd_setup_common
nvme: use return value from blk_execute_rq()
block: return errors from blk_execute_rq()
nvme: use blk_execute_rq() for passthrough commands
block: support polling through blk_execute_rq
block: remove REQ_OP_SCSI_{IN,OUT}
block: mark blk_mq_init_queue_data static
loop: rewrite loop_exit using idr_for_each_entry
loop: split loop_lookup
loop: don't allow deleting an unspecified loop device
loop: move loop_ctl_mutex locking into loop_add
...
We run a test that create millions of cgroups and blkgs, and then trigger
blkg_destroy_all(). blkg_destroy_all() will hold spin lock for a long
time in such situation. Thus release the lock when a batch of blkgs are
destroyed.
blkcg_activate_policy() and blkcg_deactivate_policy() might have the
same problem, however, as they are basically only called from module
init/exit paths, let's leave them alone for now.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210707015649.1929797-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>