Since blk_mq_map_queues() and the .map_queues() callbacks always return 0,
change their return type into void. Most callers ignore the returned value
anyway.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org
[axboe: fold in fix from Bart]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of returning -EINVAL if an internal inconsistency is detected,
fall back to a single submission queue. This patch prepares for changing
the return value of the .map_queues() callbacks into void.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220815170043.19489-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There needs to be some error checking if ida_simple_get() fails.
Also call ida_free() if there are errors later.
Fixes: 94bc02e30f ("nullb: use ida to manage index")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YtEhXsr6vJeoiYhd@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow setting via configfs these two options:
no_sched
shared_tag_bitmap
Previously these could only be activated as module parameters.
Still missing are:
shared_tags
timeout
requeue
init_hctx
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220708174943.87787-3-vincent.fu@samsung.com
[axboe: fold in nullb == NULL fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add as module parameters these options:
memory_backed
discard
mbps
cache_size
Previously these could only be set via configfs.
Still missing is bad_blocks.
The kernel test robot found a documentation formatting issue in v1 of
this patch.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Link: https://lore.kernel.org/r/20220708174943.87787-2-vincent.fu@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The type name enum req_opf is misleading since it suggests that values of
this type include both an operation type and flags. Since values of this
type represent an operation only, change the type name into enum req_op.
Convert the enum req_op documentation into kernel-doc format. Move a few
definitions such that the enum req_op documentation occurs just above
the enum req_op definition.
The name "req_opf" was introduced by commit ef295ecf09 ("block: better op
and flags encoding").
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the zone related fields that are currently stored in
struct request_queue to struct gendisk as these are part of the highlevel
block layer API and are only used for non-passthrough I/O that requires
the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass a block_device instead of a request_queue as that is what most
callers have at hand.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Switch to a gendisk based API in preparation for moving all zone related
fields from the request_queue to the gendisk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare for storing the zone related field in struct gendisk instead
of struct request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220706070350.1703384-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With new API blk_mq_is_reserved_rq() we can tell if a request is from
the reserved pool, so stop passing 'reserved' arg. There is actually
only a single user of that arg for all the callback implementations, which
can use blk_mq_is_reserved_rq() instead.
This will also allow us to stop passing the same 'reserved' around the
blk-mq iter functions next.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The bio and rq fields of struct nullb_cmd are now overlapping in a
union. So we cannot use a test on ->bio being non-NULL to detect the
NULL_Q_BIO queue mode. null_zone_write() use such broken test to set the
sector position of a zone append write in the command bio or request.
When the null_blk device uses the NULL_Q_MQ queue mode,
null_zone_write() wrongly end up setting the bio sector position,
resulting in the command request to be broken and random crashes
following.
Fix this by testing the device queue mode directly.
Fixes: 8ba816b23a ("null-blk: save memory footprint for struct nullb_cmd")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220602120344.1365329-1-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, the directory name used to create a nullb device through
sysfs is not used as the device name, potentially causing headaches for
users if devices are already created through the modprobe operation
withe the nr_device module parameter not set to 0. E.g. a user can do
"mkdir /sys/kernel/config/nullb/nullb0" to create a nullb device even
though /dev/nullb0 was already created by modprobe. In this case, the
configfs nullb device will be named nullb1, causing confusion for the
user.
Simplify this by using the configfs directory name as the nullb device
name, always, unless another nullb device is already using the same
name. E.g. if modprobe created nullb0, then:
$ mkdir /sys/kernel/config/nullb/nullb0
mkdir: cannot create directory '/sys/kernel/config/nullb/nullb0': File
exists
will be reported to the user.
To implement this, the function null_find_dev_by_name() is added to
check for the existence of a nullb device with the name used for a new
configfs device directory. nullb_group_make_item() uses this new
function to check if the directory name can be used as the disk name.
Finally, null_add_dev() is modified to use the device config item name
as the disk name for a new nullb device created using configfs.
The naming of devices created though modprobe remains unchanged.
Of note is that it is possible for a user to create through configfs a
nullb device with the same name as an existing device. E.g.
$ mkdir /sys/kernel/config/nullb/null
will successfully create the nullb device named "null" but this block
device will however not appear under /dev/ since /dev/null already
exists.
Suggested-by: Joseph Bacik <josef@toxicpanda.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220420005718.3780004-5-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the pr_fmt() macro to prefix all null_blk pr_xxx() messages with
"null_blk:" to clarify which module is printing the messages. Also add
a pr_info() message in null_add_dev() to print the name of a newly
created disk.
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220420005718.3780004-4-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce the null_create_dev() and null_destroy_dev() helper functions
to respectivel create nullb devices on modprobe and destroy them on
rmmod. The null_destroy_dev() helper avoids duplicated code in the
null_init() and null_exit() functions for deleting devices.
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220420005718.3780004-3-damien.lemoal@opensource.wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The discard_alignment queue limit is named a bit misleading means the
offset into the block device at which the discard granularity starts.
Setting it to the discard granularity as done by null_blk is mostly
harmless but also useless.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20220418045314.360785-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Total 16 bytes can be saved in two ways:
1) The field 'bio' will only be used in bio based mode, and the field
'rq' will only be used in mq mode. Since they won't be used in the
same time, declare a union for them.
2) The field 'bool fake_timeout' can be placed in the hole after the
field 'error'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220426022133.3999006-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.
The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When poll request is timed out, it is removed from the poll list,
but not completed, so the request is leaked, and never get chance
to complete.
Fix the issue by ending it in timeout handler.
Fixes: 0a593fbbc2 ("null_blk: poll queue support")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220413084836.1571995-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove goto labels and use direct returns as error unwinding code only
needs to free t_page variable if we alloc_pages() call fails as having
two labels for one kfree() can be avoided easily.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220222152852.26043-3-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only caller of null_alloc_page() is null_insert_page() unconditionally
sets only parameter to GFP_NOIO and that is statically hard-coded in
null_blk. There is no point in having statically hardcoded function
parameter.
Remove the unnecessary parameter gfp_flags and adjust the code, so it
can retain existing behavior null_alloc_page() with GFP_NOIO.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220222152852.26043-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only caller of alloc_cmd() is null_submit_bio() unconditionally sets
second parameter to true and that is statically hard-coded in null_blk.
There is no point in having statically hardcoded function parameter.
Remove the unnecessary parameter can_wait and adjust the code so it
can retain existing behavior of waiting when we don't get valid
nullb_cmd from __alloc_cmd() in alloc_cmd().
The restructured code avoids multiple return statements, multiple
calls to __alloc_cmd() and resulting a fast path call to
prepare_to_wait() due to removal of first alloc_cmd() call.
Follow the pattern that we have in bio_alloc() to set the structure
members in the structure allocation function in alloc_cmd() and pass
bio to initialize newly allocated cmd->bio member.
Follow the pattern in copy_to_nullb() to use result of one function call
(null_cache_active()) to be used as a parameter to another function call
(null_insert_page()), use result of alloc_cmd() as a first parameter to
the null_handle_cmd() in null_submit_bio() function. This allow us to
remove the local variable cmd on stack in null_submit_bio() that is in
fast path.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20220216172945.31124-2-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function nullb_device_power_store() returns -ENOMEM when
null_add_dev() fails. null_add_dev() can fail with return value
other than -ENOMEM such as -EINVAL when Zoned Block Device option
is used, see :
nullb_device_power_store()
null_add_dev()
null_init_zoned_dev()
return -EINVAL;
When trying to load the module having -ENOMEM value returned on the
command line creates confusion when pleanty of memory is free on the
machine.
Instead of hardcoding -ENOMEM return the value of null_add_dev()
function.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220215115951.15945-1-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It isn't correct to set set->nr_maps as 3 if g_poll_queues is > 0 since
we can change it via configfs for null_blk device created there, so only
set it as 3 if active poll_queues is > 0.
Fixes divide zero exception reported by Shinichiro.
Fixes: 2bfdbe8b7e ("null_blk: allow zero poll queues")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20211224010831.1521805-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
kernel test robot reports that sparse now triggers a warning on null_blk:
>> drivers/block/null_blk/main.c:1577:55: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int ioerror @@ got restricted blk_status_t [usertype] error @@
drivers/block/null_blk/main.c:1577:55: sparse: expected int ioerror
drivers/block/null_blk/main.c:1577:55: sparse: got restricted blk_status_t [usertype] error
because blk_mq_add_to_batch() takes an integer instead of a blk_status_t.
Just cast this to an integer to silence it, null_blk is the odd one out
here since the command status is the "right" type. If we change the
function type, then we'll have do that for other callers too (existing and
future ones).
Fixes: 2385ebf38f ("block: null_blk: batched complete poll requests")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Complete poll requests via blk_mq_add_to_batch() and
blk_mq_end_request_batch(), so that we can cover batched complete
code path by running null_blk test.
Meantime this way shows ~14% IOPS boost on 't/io_uring /dev/nullb0'
in my test.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203081703.3506020-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There isn't any reason to not allow zero poll queues from user
viewpoint.
Also sometimes we need to compare io poll between poll mode and irq
mode, so not allowing poll queues is bad.
Fixes: 15dfc662ef ("null_blk: Fix handling of submit_queues and poll_queues attributes")
Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203023935.3424042-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Just use the disk attached to the request_queue instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All modern drivers can support extra partitions using the extended
dev_t. In fact except for the ioctl method drivers never even see
partitions in normal operation.
So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This manually reverts commit 27290b469051 ("null_blk: suppress invalid
partition info"). The message in that commit log can't appearch as
the flag is never checked during probing, and there is no good reason
to treat null_blk special in /proc/partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 0a593fbbc2 ("null_blk: poll queue support") introduced the poll
queue feature to null_blk. After this change, null_blk device has both
submit queues and poll queues, and null_map_queues() callback maps the
both queues for corresponding hardware contexts. The commit also added
the device configuration attribute 'poll_queues' in same manner as the
existing attribute 'submit_queues'. These attributes allow to modify the
numbers of queues. However, when the new values are stored to these
attributes, the values are just handled only for the corresponding
queue. When number of submit_queue is updated, number of poll_queue is
not counted, or vice versa. This caused inconsistent number of queues
and queue mapping and resulted in null-ptr-dereference. This failure was
observed in blktests block/029 and block/030.
To avoid the inconsistency, fix the attribute updates to care both
submit_queues and poll_queues. Introduce the helper function
nullb_update_nr_hw_queues() to handle stores to the both two attributes.
Add poll_queues field to the struct nullb_device to track the number in
same manner as submit_queues. Add two more fields prev_submit_queues and
prev_poll_queues to keep the previous values before change. In case the
block layer failed to update the nr_hw_queues, refer the previous values
in null_map_queues() to map queues in same manner as before change.
Also add poll_queues value checks in nullb_update_nr_hw_queues() and
null_validate_conf(). They ensure the poll_queues value of each device
is within the range from 1 to module parameter value of poll_queues.
Fixes: 0a593fbbc2 ("null_blk: poll queue support")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20211029103926.845635-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's currently no way to experiment with polled IO with null_blk,
which seems like an oversight. This patch adds support for polled IO.
We keep a list of issued IOs on submit, and then process that list
when mq_ops->poll() is invoked.
A new parameter is added, poll_queues. It defaults to 1 like the
submit queues, meaning we'll have 1 poll queue available.
Fixes-by: Bart Van Assche <bvanassche@acm.org>
Fixes-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/baca710d-0f2a-16e2-60bd-b105b854e0ae@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.
Polling for the bio itself leads to a few advantages:
- the cookie construction can made entirely private in blk-mq.c
- the caller does not need to remember the request_queue and cookie
separately and thus sidesteps their lifetime issues
- keeping the device and the cookie inside the bio allows to trivially
support polling BIOs remapping by stacking drivers
- a lot of code to propagate the cookie back up the submission path can
be removed entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling. The actual cleanup in case of error is
already handled by the caller of null_gendisk_register().
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210818144542.19305-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use blk_mq_alloc_disk and blk_cleanup_disk to simplify the gendisk and
request_queue allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210602065345.355274-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The error handling on a nullb->disk allocation currently jumps to
out_cleanup_disk that calls blk_cleanup_disk with a null pointer causing
a null pointer dereference issue. Fix this by jumping to out_cleanup_tags
instead.
Addresses-Coverity: ("Dereference after null check")
Fixes: 132226b301 ("null_blk: convert to blk_alloc_disk/blk_cleanup_disk")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210602100659.11058-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the null_blk driver to use the blk_alloc_disk and blk_cleanup_disk
helpers to simplify gendisk and request_queue allocation. Note that the
blk-mq mode is left with its own allocations scheme, to be handled later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In null_init, null_add_dev(dev) is called.
In null_add_dev, it calls null_free_zoned_dev(dev) to free dev->zones
via kvfree(dev->zones) in out_cleanup_zone branch and returns err.
Then null_init accept the err code and then calls null_free_dev(dev).
But in null_free_dev(dev), dev->zones is freed again by
null_free_zoned_dev().
My patch set dev->zones to NULL in null_free_zoned_dev() after
kvfree(dev->zones) is called, to avoid the double free.
Fixes: 2984c8684f ("nullb: factor disk parameters")
Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn>
Link: https://lore.kernel.org/r/20210426143229.7374-1-lyl2019@mail.ustc.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will enable changing the virtual boundary of null blk devices. For
now, null blk devices didn't have any restriction on the scatter/gather
elements received from the block layer. Add a module parameter and a
configfs option that will control the virtual boundary. This will
enable testing the efficiency of the block layer bounce buffer in case
a suitable application will send discontiguous IO to the given device.
Initial testing with patched FIO showed the following results (64 jobs,
128 iodepth, 1 nullb device):
IO size READ (virt=false) READ (virt=true) Write (virt=false) Write (virt=true)
---------- ------------------- ----------------- ------------------- -------------------
1k 10.7M 8482k 10.8M 8471k
2k 10.4M 8266k 10.4M 8271k
4k 10.4M 8274k 10.3M 8226k
8k 10.2M 8131k 9800k 7933k
16k 9567k 7764k 8081k 6828k
32k 8865k 7309k 5570k 5153k
64k 7695k 6586k 2682k 2617k
128k 5346k 5489k 1320k 1296k
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210412095523.278632-1-mgurtovoy@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Memory backed or zoned null block devices may generate actual request
timeout errors due to the submission path being blocked on memory
allocation or zone locking. Unlike fake timeouts or injected timeouts,
the request submission path will call blk_mq_complete_request() or
blk_mq_end_request() for these real timeout errors, causing a double
completion and use after free situation as the block layer timeout
handler executes blk_mq_rq_timed_out() and __blk_mq_free_request() in
blk_mq_check_expired(). This problem often triggers a NULL pointer
dereference such as:
BUG: kernel NULL pointer dereference, address: 0000000000000050
RIP: 0010:blk_mq_sched_mark_restart_hctx+0x5/0x20
...
Call Trace:
dd_finish_request+0x56/0x80
blk_mq_free_request+0x37/0x130
null_handle_cmd+0xbf/0x250 [null_blk]
? null_queue_rq+0x67/0xd0 [null_blk]
blk_mq_dispatch_rq_list+0x122/0x850
__blk_mq_do_dispatch_sched+0xbb/0x2c0
__blk_mq_sched_dispatch_requests+0x13d/0x190
blk_mq_sched_dispatch_requests+0x30/0x60
__blk_mq_run_hw_queue+0x49/0x90
process_one_work+0x26c/0x580
worker_thread+0x55/0x3c0
? process_one_work+0x580/0x580
kthread+0x134/0x150
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x1f/0x30
This problem very often triggers when running the full btrfs xfstests
on a memory-backed zoned null block device in a VM with limited amount
of memory.
Avoid this by executing blk_mq_complete_request() in null_timeout_rq()
only for commands that are marked for a fake timeout completion using
the fake_timeout boolean in struct null_cmd. For timeout errors injected
through debugfs, the timeout handler will execute
blk_mq_complete_request()i as before. This is safe as the submission
path does not execute complete requests in this case.
In null_timeout_rq(), also make sure to set the command error field to
BLK_STS_TIMEOUT and to propagate this error through to the request
completion.
Reported-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210331225244.126426-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
WC22RGC2MA==
=os1E
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:
- Two series of BFQ improvements (Paolo, Jan, Jia)
- Block iov_iter improvements (Pavel)
- bsg error path fix (Pan)
- blk-mq scheduler improvements (Jan)
- -EBUSY discard fix (Jan)
- bvec allocation improvements (Ming, Christoph)
- bio allocation and init improvements (Christoph)
- Store bdev pointer in bio instead of gendisk + partno (Christoph)
- Block trace point cleanups (Christoph)
- hard read-only vs read-only split (Christoph)
- Block based swap cleanups (Christoph)
- Zoned write granularity support (Damien)
- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"
* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...
Use blk_queue_set_zoned() to set a nullb device zone model instead of
directly assigning the device queue zoned limit. This initialization of
the devicve zoned model as well as the setup of the queue flag
QUEUE_FLAG_ZONE_RESETALL and of the device queue elevator feature are
moved from null_init_zoned_dev() to null_register_zoned_dev() so that
the initialization of the queue limits is done when the gendisk of the
nullb device is available.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To avoid potential compilation problems, replaced the badly written
MB_TO_SECTS() macro (missing parenthesis around the argument use) with
the inline function mb_to_sects(). And while at it, simplify the
calculation of the total number of zones of the device using the
round_up() macro.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device. From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>