Commit Graph

7322 Commits

Author SHA1 Message Date
Kunwu Chan
48ff13a618 block: Simplify the allocation of slab caches
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20240131094323.146659-1-chentao@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 11:29:40 -07:00
Pavel Begunkov
e516c3fc6c block: optimise in irq bio put caching
When enlisting a bio into ->free_list_irq we protect the list by
disabling irqs. It's likely they're already disabled and performance of
local_irq_{save,restore}() is decent, but it's not zero cost.

Let's only use the irq cache when when we're serving a hard irq, which
allows to remove local_irq_{save,restore}(), and fall back to bio_free()
in all left cases.

Profiles indicate that the bio_put() cost is reduced by ~3.5 times
(1.76% -> 0.49%), and total throughput of a CPU bound benchmark improve
by around 1% (t/io_uring with high QD and several drives).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/36d207540b7046c653cc16e5ff08fe7234b19f81.1707314970.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 10:18:48 -07:00
Pavel Begunkov
c9f5f3aa19 block: extend bio caching to task context
bio_put_percpu_cache() puts all non-iopoll bios into the irq-safe list,
which entails disabling irqs. The overhead of that is not that bad when
interrupts are already off but getting worse otherwise. We can optimise
it when we're in the task context by using ->free_list directly just as
the IOPOLL path does.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4774e1a0f905f96c63174b0f3e4f79f0d9b63246.1707314970.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 10:18:47 -07:00
Tejun Heo
2a427b49d0 blk-iocost: Fix an UBSAN shift-out-of-bounds warning
When iocg_kick_delay() is called from a CPU different than the one which set
the delay, @now may be in the past of @iocg->delay_at leading to the
following warning:

  UBSAN: shift-out-of-bounds in block/blk-iocost.c:1359:23
  shift exponent 18446744073709 is too large for 64-bit type 'u64' (aka 'unsigned long long')
  ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x79/0xc0
   __ubsan_handle_shift_out_of_bounds+0x2ab/0x300
   iocg_kick_delay+0x222/0x230
   ioc_rqos_merge+0x1d7/0x2c0
   __rq_qos_merge+0x2c/0x80
   bio_attempt_back_merge+0x83/0x190
   blk_attempt_plug_merge+0x101/0x150
   blk_mq_submit_bio+0x2b1/0x720
   submit_bio_noacct_nocheck+0x320/0x3e0
   __swap_writepage+0x2ab/0x9d0

The underflow itself doesn't really affect the behavior in any meaningful
way; however, the past timestamp may exaggerate the delay amount calculated
later in the code, which shouldn't be a material problem given the nature of
the delay mechanism.

If @now is in the past, this CPU is racing another CPU which recently set up
the delay and there's nothing this CPU can contribute w.r.t. the delay.
Let's bail early from iocg_kick_delay() in such cases.

Reported-by: Breno Leitão <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5160a5a53c ("blk-iocost: implement delay adjustment hysteresis")
Link: https://lore.kernel.org/r/ZVvc9L_CYk5LO1fT@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08 10:11:39 -07:00
Jan Kara
f814bdda77 blk-wbt: Fix detection of dirty-throttled tasks
The detection of dirty-throttled tasks in blk-wbt has been subtly broken
since its beginning in 2016. Namely if we are doing cgroup writeback and
the throttled task is not in the root cgroup, balance_dirty_pages() will
set dirty_sleep for the non-root bdi_writeback structure. However
blk-wbt checks dirty_sleep only in the root cgroup bdi_writeback
structure. Thus detection of recently throttled tasks is not working in
this case (we noticed this when we switched to cgroup v2 and suddently
writeback was slow).

Since blk-wbt has no easy way to get to proper bdi_writeback and
furthermore its intention has always been to work on the whole device
rather than on individual cgroups, just move the dirty_sleep timestamp
from bdi_writeback to backing_dev_info. That fixes the checking for
recently throttled task and saves memory for everybody as a bonus.

CC: stable@vger.kernel.org
Fixes: b57d74aff9 ("writeback: track if we're sleeping on progress in balance_dirty_pages()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240123175826.21452-1-jack@suse.cz
[axboe: fixup indentation errors]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-06 09:44:03 -07:00
Bart Van Assche
449813515d
block, fs: Restore the per-bio/request data lifetime fields
Restore support for passing data lifetime information from filesystems to
block drivers. This patch reverts commit b179c98f76 ("block: Remove
request.write_hint") and commit c75e707fe1 ("block: remove the
per-bio/request write hint").

This patch does not modify the size of struct bio because the new
bi_write_hint member fills a hole in struct bio. pahole reports the
following for struct bio on an x86_64 system with this patch applied:

        /* size: 112, cachelines: 2, members: 20 */
        /* sum members: 110, holes: 1, sum holes: 2 */
        /* last cacheline: 48 bytes */

Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240202203926.2478590-7-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-06 14:31:05 +01:00
Tang Yizhou
3bca7640b4 blk-throttle: Eliminate redundant checks for data direction
After calling throtl_peek_queued(), the data direction can be determined so
there is no need to call bio_data_dir() to check the direction again.

Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240123081248.3752878-1-yizhou.tang@shopee.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:16:12 -07:00
Jens Axboe
06b23f92af block: update cached timestamp post schedule/preemption
Mark the task as having a cached timestamp when set assign it, so we
can efficiently check if it needs updating post being scheduled back in.
This covers both the actual schedule out case, which would've flushed
the plug, and the preemption case which doesn't touch the plugged
requests (for many reasons, one of them being then we'd need to have
preemption disabled around plug state manipulation).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:07:34 -07:00
Jens Axboe
da4c8c3d09 block: cache current nsec time in struct blk_plug
Querying the current time is the most costly thing we do in the block
layer per IO, and depending on kernel config settings, we may do it
many times per IO.

None of the callers actually need nsec granularity. Take advantage of
that by caching the current time in the plug, with the assumption here
being that any time checking will be temporally close enough that the
slight loss of precision doesn't matter.

If the block plug gets flushed, eg on preempt or schedule out, then
we invalidate the cached clock.

On a basic peak IOPS test case with iostats enabled, this changes
the performance from:

IOPS=108.41M, BW=52.93GiB/s, IOS/call=31/31
IOPS=108.43M, BW=52.94GiB/s, IOS/call=32/32
IOPS=108.29M, BW=52.88GiB/s, IOS/call=31/32
IOPS=108.35M, BW=52.91GiB/s, IOS/call=32/32
IOPS=108.42M, BW=52.94GiB/s, IOS/call=31/31
IOPS=108.40M, BW=52.93GiB/s, IOS/call=32/32
IOPS=108.31M, BW=52.89GiB/s, IOS/call=32/31

to

IOPS=118.79M, BW=58.00GiB/s, IOS/call=31/32
IOPS=118.62M, BW=57.92GiB/s, IOS/call=31/31
IOPS=118.80M, BW=58.01GiB/s, IOS/call=32/31
IOPS=118.78M, BW=58.00GiB/s, IOS/call=32/32
IOPS=118.69M, BW=57.95GiB/s, IOS/call=32/31
IOPS=118.62M, BW=57.92GiB/s, IOS/call=32/31
IOPS=118.63M, BW=57.92GiB/s, IOS/call=31/32

which is more than a 9% improvement in performance. Looking at perf diff,
we can see a huge reduction in time overhead:

    10.55%     -9.88%  [kernel.vmlinux]  [k] read_tsc
     1.31%     -1.22%  [kernel.vmlinux]  [k] ktime_get

Note that since this relies on blk_plug for the caching, it's only
applicable to the issue side. But this is where most of the time calls
happen anyway. On the completion side, cached time stamping is done with
struct io_comp patch, as long as the driver supports it.

It's also worth noting that the above testing doesn't enable any of the
higher cost CPU items on the block layer side, like wbt, cgroups,
iocost, etc, which all would add additional time querying and hence
overhead. IOW, results would likely look even better in comparison with
those enabled, as distros would do.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:07:28 -07:00
Jens Axboe
08420cf70c block: add blk_time_get_ns() and blk_time_get() helpers
Convert any user of ktime_get_ns() to use blk_time_get_ns(), and
ktime_get() to blk_time_get(), so we have a unified API for querying the
current time in nanoseconds or as ktime.

No functional changes intended, this patch just wraps ktime_get_ns()
and ktime_get() with a block helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:07:22 -07:00
Jens Axboe
c4e47bbb00 block: move cgroup time handling code into blk.h
In preparation for moving time keeping into blk.h, move the cgroup
related code for timestamps in here too. This will help avoid a circular
dependency, and also moves it into a more appropriate header as this one
is private to the block layer code.

Leave struct bio_issue in blk_types.h as it's a proper time definition.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:07:17 -07:00
Christoph Hellwig
72e84e909e blk-mq: special case cached requests less
Share the main merge / split / integrity preparation code between the
cached request vs newly allocated request cases, and add comments
explaining the cached request handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240124092658.2258309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:06:53 -07:00
Christoph Hellwig
337e89feb7 blk-mq: introduce a blk_mq_peek_cached_request helper
Add a new helper to check if there is suitable cached request in
blk_mq_submit_bio.  This removes open coded logic in blk_mq_submit_bio
and moves some checks that so far are in blk_mq_use_cached_rq to
be performed earlier.  This avoids the case where we first do check
with the cached request but then later end up allocating a new one
anyway and need to grab a queue reference.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240124092658.2258309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:05:16 -07:00
Christoph Hellwig
0f299da55a blk-mq: move blk_mq_attempt_bio_merge out blk_mq_get_new_requests
blk_mq_attempt_bio_merge has nothing to do with allocating a new
request, it avoids allocating a new request.  Move the call out of
blk_mq_get_new_requests and into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20240124092658.2258309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05 10:03:51 -07:00
Hongyu Jin
f3c89983cb block: Fix where bio IO priority gets set
Commit 82b74cac28 ("blk-ioprio: Convert from rqos policy to direct
call") pushed setting bio I/O priority down into blk_mq_submit_bio()
-- which is too low within block core's submit_bio() because it
skips setting I/O priority for block drivers that implement
fops->submit_bio() (e.g. DM, MD, etc).

Fix this by moving bio_set_ioprio() up from blk-mq.c to blk-core.c and
call it from submit_bio().  This ensures all block drivers call
bio_set_ioprio() during initial bio submission.

Fixes: a78418e6a0 ("block: Always initialize bio IO priority on submit")
Co-developed-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Yibin Ding <yibin.ding@unisoc.com>
Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
[snitzer: revised commit header]
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240130202638.62600-2-snitzer@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-01 11:00:06 -07:00
Christoph Hellwig
19871b5c7a iomap: pass the length of the dirty region to ->map_blocks
Let the file system know how much dirty data exists at the passed
in offset.  This allows file systems to allocate the right amount
of space that actually is written back if they can't eagerly
convert (e.g. because they don't support unwritten extents).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231207072710.176093-15-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-01 14:20:13 +01:00
Christian A. Ehrhardt
13f3956eb5 block: Fix WARNING in _copy_from_iter
Syzkaller reports a warning in _copy_from_iter because an
iov_iter is supposedly used in the wrong direction. The reason
is that syzcaller managed to generate a request with
a transfer direction of SG_DXFER_TO_FROM_DEV. This instructs
the kernel to copy user buffers into the kernel, read into
the copied buffers and then copy the data back to user space.

Thus the iovec is used in both directions.

Detect this situation in the block layer and construct a new
iterator with the correct direction for the copy-in.

Reported-by: syzbot+a532b03fdfee2c137666@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/0000000000009b92c10604d7a5e9@google.com/t/
Reported-by: syzbot+63dec323ac56c28e644f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/0000000000003faaa105f6e7c658@google.com/T/
Signed-off-by: Christian A. Ehrhardt <lk@c--e.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240121202634.275068-1-lk@c--e.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-23 08:56:55 -07:00
Li Lingfeng
7777f47f2e block: Move checking GENHD_FL_NO_PART to bdev_add_partition()
Commit 1a721de848 ("block: don't add or resize partition on the disk
with GENHD_FL_NO_PART") prevented all operations about partitions on disks
with GENHD_FL_NO_PART in blkpg_do_ioctl() since they are meaningless.
However, it changed error code in some scenarios. So move checking
GENHD_FL_NO_PART to bdev_add_partition() to eliminate impact.

Fixes: 1a721de848 ("block: don't add or resize partition on the disk with GENHD_FL_NO_PART")
Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Closes: https://lore.kernel.org/all/CAOYeF9VsmqKMcQjo1k6YkGNujwN-nzfxY17N3F-CMikE1tYp+w@mail.gmail.com/
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240118130401.792757-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-22 09:51:29 -07:00
Linus Torvalds
9d1694dc91 for-6.8/block-2024-01-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWpoCgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqIUEADFvJdC2izkPzYzsOrMK5Rt1H7vaHGKhbA+
 zWCuQaa1xQd8bazq+NVnQpbzgclkE/WodTCNfNXcTTjzeQEmcZC888llP3Y9vwyP
 XfEKH7fSaeKvGigJLro1oPe3YV7/t89F5ol3BoZayfzJF8GEU9BXRWzgOkZzijnk
 xdm5wUyn/GknksMuQQraZ+U6bQRFLBOulzoaQeMD6Dosx+uRlM4WvAJawC+uOV6R
 qPT2BVSfYGzmgEKvoaphw0FMkUhFBMDHfXTpQBi5tIzTKOaof8tynYEGz0FHZWeh
 V0JEEp+3jLWFxFXeEcXgBVPJPE8J0DzGm9g17/uwC2Yhmlbw4FKZVRvGG+PpeUso
 D5aqhqm3w0x7HgZ7JKwy/aUctADYvjVcSVzPHTaFK0aCSYCIAXxqv4p7fOoxPqyx
 T32IUHTzGtkCdqzv/xFdtTYhTNM2vyzzbbWj5lXgCBqHsXOVbCh8UM2p+9ec2Umq
 Fo1XF9eoCDe6Sn4s15hJ5G4DEhKGOKkHluvRUdM+0selA5b0sNOeUqlAf2v+0ve3
 Pv3e3X4NPssNIEcsDHf5pc3zGC+LXRS0oFvfIvDESBjwXc3iHIMl+SkjyS57P4Fd
 RKrHEUUiACuCKO/IWqFYLiNBNHnP3RmV5gSxIZr9QJhFSwOzP+/+4++TCdF5vdAV
 amhv+0PdCw==
 =DLW9
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes,
        Christoph)
      - discard fixes and improvements (Christoph)
      - timeout debug improvements (Keith, Max)
      - various cleanups (Daniel, Max, Giuxen)
      - trace event string fixes (Arnd)
      - shadow doorbell setup on reset fix (William)
      - a write zeroes quirk for SK Hynix (Jim)

 - MD pull request via Song:
      - Sparse warning since v6.0 (Bart)
      - /proc/mdstat regression since v6.7 (Yu Kuai)

 - Use symbolic error value (Christian)

 - IO Priority documentation update (Christian)

 - Fix for accessing queue limits without having entered the queue
   (Christoph, me)

 - Fix for loop dio support (Christoph)

 - Move null_blk off deprecated ida interface (Christophe)

 - Ensure nbd initializes full msghdr (Eric)

 - Fix for a regression with the folio conversion, which is now easier
   to hit because of an unrelated change (Matthew)

 - Remove redundant check in virtio-blk (Li)

 - Fix for a potential hang in sbitmap (Ming)

 - Fix for partial zone appending (Damien)

 - Misc changes and fixes (Bart, me, Kemeng, Dmitry)

* tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits)
  Documentation: block: ioprio: Update schedulers
  loop: fix the the direct I/O support check when used on top of block devices
  blk-mq: Remove the hctx 'run' debugfs attribute
  nbd: always initialize struct msghdr completely
  block: Fix iterating over an empty bio with bio_for_each_folio_all
  block: bio-integrity: fix kcalloc() arguments order
  virtio_blk: remove duplicate check if queue is broken in virtblk_done
  sbitmap: remove stale comment in sbq_calc_wake_batch
  block: Correct a documentation comment in blk-cgroup.c
  null_blk: Remove usage of the deprecated ida_simple_xx() API
  block: ensure we hold a queue reference when using queue limits
  blk-mq: rename blk_mq_can_use_cached_rq
  block: print symbolic error name instead of error code
  blk-mq: fix IO hang from sbitmap wakeup race
  nvmet-rdma: avoid circular locking dependency on install_queue()
  nvmet-tcp: avoid circular locking dependency on install_queue()
  nvme-pci: set doorbell config before unquiescing
  block: fix partial zone append completion handling in req_bio_endio()
  block/iocost: silence warning on 'last_period' potentially being unused
  md/raid1: Use blk_opf_t for read and write operations
  ...
2024-01-18 18:22:40 -08:00
Bart Van Assche
49e60333d7 blk-mq: Remove the hctx 'run' debugfs attribute
Nobody uses the debugfs hctx 'run' attribute. Hence remove this
attribute and also the code that updates the corresponding member
variable.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Gabriel Ryan <gabe@cs.columbia.edu>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240117203609.4122520-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17 14:16:34 -07:00
Dmitry Antipov
be50df31c4 block: bio-integrity: fix kcalloc() arguments order
When compiling with gcc version 14.0.1 20240116 (experimental)
and W=1, I've noticed the following warning:

block/bio-integrity.c: In function 'bio_integrity_map_user':
block/bio-integrity.c:339:38: warning: 'kcalloc' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]
  339 |                 bvec = kcalloc(sizeof(*bvec), nr_vecs, GFP_KERNEL);
      |                                      ^
block/bio-integrity.c:339:38: note: earlier argument should specify number of
elements, later size of each element

Since 'n' and 'size' arguments of 'kcalloc()' are multiplied to
calculate the final size, their actual order doesn't affect the
result and so this is not a bug. But it's still worth to fix it.

Fixes: 492c5d4559 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20240116143437.89060-1-dmantipov@yandex.ru
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-16 09:51:22 -07:00
Nicky Chorley
521277d12b block: Correct a documentation comment in blk-cgroup.c
Commit 99e6038743
("blk-cgroup: pass a gendisk to the blkg allocation helpers") changed
blkg_alloc() to take a struct gendisk instead of a struct request_queue,
but the documentation comment still referred to q.

So, update that comment to refer to disk instead and fix a typo.

Signed-off-by: Nicky Chorley <ndchorley@gmail.com>
Link: https://lore.kernel.org/r/20240114191056.6992-1-ndchorley@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-15 07:23:38 -07:00
Jens Axboe
7b4f36cd22 block: ensure we hold a queue reference when using queue limits
q_usage_counter is the only thing preventing us from the limits changing
under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold
it while calling into it.

Move the splitting inside the region where we know we've got a queue
reference. Ideally this could still remain a shared section of code, but
let's keep the fix simple and defer any refactoring here to later.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: 900e080752 ("block: move queue enter logic into blk_mq_submit_bio()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-12 21:09:42 -07:00
Christoph Hellwig
309ce67414 blk-mq: rename blk_mq_can_use_cached_rq
blk_mq_can_use_cached_rq doesn't just check if we can use the request,
but also performs the work to actually use it.  Remove the _can in the
naming, and improve the comment describing the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240111135705.2155518-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-12 09:11:34 -07:00
Christian Heusel
25c1772a04 block: print symbolic error name instead of error code
Utilize the %pe print specifier to get the symbolic error name as a
string (i.e "-ENOMEM") in the log message instead of the error code to
increase its readablility.

This change was suggested in
https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/

Signed-off-by: Christian Heusel <christian@heusel.eu>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20240111231521.1596838-1-christian@heusel.eu
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-12 09:07:46 -07:00
Ming Lei
5266caaf56 blk-mq: fix IO hang from sbitmap wakeup race
In blk_mq_mark_tag_wait(), __add_wait_queue() may be re-ordered
with the following blk_mq_get_driver_tag() in case of getting driver
tag failure.

Then in __sbitmap_queue_wake_up(), waitqueue_active() may not observe
the added waiter in blk_mq_mark_tag_wait() and wake up nothing, meantime
blk_mq_mark_tag_wait() can't get driver tag successfully.

This issue can be reproduced by running the following test in loop, and
fio hang can be observed in < 30min when running it on my test VM
in laptop.

	modprobe -r scsi_debug
	modprobe scsi_debug delay=0 dev_size_mb=4096 max_queue=1 host_max_queue=1 submit_queues=4
	dev=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
	fio --filename=/dev/"$dev" --direct=1 --rw=randrw --bs=4k --iodepth=1 \
       		--runtime=100 --numjobs=40 --time_based --name=test \
        	--ioengine=libaio

Fix the issue by adding one explicit barrier in blk_mq_mark_tag_wait(), which
is just fine in case of running out of tag.

Cc: Jan Kara <jack@suse.cz>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Reported-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20240112122626.4181044-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-12 08:48:46 -07:00
Linus Torvalds
01d550f0fc for-6.8/block-2024-01-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWcIOIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn6hD/9oO7U75PuxUwYYHZ9Uzxpw6gQ0LEmeyJmE
 NQYCkfYHVq3IsgOdF7elI9v3qtr6v8V8CdB7cByrnn3DgwsMuiTKZZ0dK7vH37PO
 DX+/xn349e8oH7RdRo7f3m95g1YbHfpfnj0Rc4mjTDV72Jr/HlLTVgGTQg8DEnCR
 wBIFmeuBHHgeeLh87gsWLAP7ReReiy9V1uqpDFsko2/4BxRAM/8eedkwcAxD8aEy
 rd+dT/SBQj2cOdQMUeExT3gWjwzHh6ZHx3f1WCLK5fdck6BogH2hBUeri6F/H98L
 HoaXjBZYBTH68hB/mnO5I4g1ZlrVM74Vp7JPa3e1SFFtyEi6lsyrk2J3GoNh0E7r
 pXqH5kAcaJwBsBrbRGuvEyGbn9RLTaN5Gvseud0VE4oMruyodTniQaHXuIGackgz
 sMavMho4486EUWPaF7gIBdLNK1hO13w+IDZ4+3oBxhudMqdgZbk4iYpOCqQ7QY5G
 2vkzAE/sZ+aVNXeaIQOI8dE5clBy8gJ+6+t8dm3DY1r1xdbcnU40iZ8/fri3h69r
 vHs9bpQnVWZF0gEyEflY1pkcAPpIkvMmWCR7Ehy5YCkIfa+qfSL05o3dicpWovLP
 N+gCtpkhTK2AvmUWsUMypMLRvoSOImyCIiobrr3qNBaUdgRP8xKfUa72RuRp8cGl
 Vrj5oAiE3w==
 =YAfp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round this time around. This contains:

   - NVMe updates via Keith:
        - nvme fabrics spec updates (Guixin, Max)
        - nvme target udpates (Guixin, Evan)
        - nvme attribute refactoring (Daniel)
        - nvme-fc numa fix (Keith)

   - MD updates via Song:
        - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai)
        - Fix raid5 hang issue (Junxiao Bi)
        - Add Yu Kuai as Reviewer of the md subsystem
        - Remove deprecated flavors (Song Liu)
        - raid1 read error check support (Li Nan)
        - Better handle events off-by-1 case (Alex Lyakas)

   - Efficiency improvements for passthrough (Kundan)

   - Support for mapping integrity data directly (Keith)

   - Zoned write fix (Damien)

   - rnbd fixes (Kees, Santosh, Supriti)

   - Default to a sane discard size granularity (Christoph)

   - Make the default max transfer size naming less confusing
     (Christoph)

   - Remove support for deprecated host aware zoned model (Christoph)

   - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel,
     Bart, Christoph)"

* tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits)
  block: Treat sequential write preferred zone type as invalid
  block: remove disk_clear_zoned
  sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics
  drivers/block/xen-blkback/common.h: Fix spelling typo in comment
  blk-cgroup: fix rcu lockdep warning in blkg_lookup()
  blk-cgroup: don't use removal safe list iterators
  block: floor the discard granularity to the physical block size
  mtd_blkdevs: use the default discard granularity
  bcache: use the default discard granularity
  zram: use the default discard granularity
  null_blk: use the default discard granularity
  nbd: use the default discard granularity
  ubd: use the default discard granularity
  block: default the discard granularity to sector size
  bcache: discard_granularity should not be smaller than a sector
  block: remove two comments in bio_split_discard
  block: rename and document BLK_DEF_MAX_SECTORS
  loop: don't abuse BLK_DEF_MAX_SECTORS
  aoe: don't abuse BLK_DEF_MAX_SECTORS
  null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS
  ...
2024-01-11 13:58:04 -08:00
Damien Le Moal
748dc0b65e block: fix partial zone append completion handling in req_bio_endio()
Partial completions of zone append request is not allowed but if a zone
append completion indicates a number of completed bytes different from
the original BIO size, only the BIO status is set to error. This leads
to bio_advance() not setting the BIO size to 0 and thus to not call
bio_endio() at the end of req_bio_endio().

Make sure a partially completed zone append is failed and completed
immediately by forcing the completed number of bytes (nbytes) to be
equal to the BIO size, thus ensuring that bio_endio() is called.

Fixes: 297db73184 ("block: fix req_bio_endio append error handling")
Cc: stable@kernel.vger.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20240110092942.442334-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-10 09:01:16 -07:00
Jens Axboe
742e324a06 block/iocost: silence warning on 'last_period' potentially being unused
If CONFIG_TRACEPOINTS isn't enabled, we assign this variable but then
never use it. This can cause the compiler to complain about that:

block/blk-iocost.c:1264:6: warning: variable 'last_period' set but not used [-Wunused-but-set-variable]
 1264 |         u64 last_period, cur_period;
      |             ^

Rather than add ifdefs to guard this, just mark it __maybe_unused.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202401102335.GiWdeIo9-lkp@intel.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-10 08:43:06 -07:00
Linus Torvalds
fb46e22a9e Many singleton patches against the MM code. The patch series which
are included in this merge do the following:
 
 - Peng Zhang has done some mapletree maintainance work in the
   series
 
 	"maple_tree: add mt_free_one() and mt_attr() helpers"
 	"Some cleanups of maple tree"
 
 - In the series "mm: use memmap_on_memory semantics for dax/kmem"
   Vishal Verma has altered the interworking between memory-hotplug
   and dax/kmem so that newly added 'device memory' can more easily
   have its memmap placed within that newly added memory.
 
 - Matthew Wilcox continues folio-related work (including a few
   fixes) in the patch series
 
 	"Add folio_zero_tail() and folio_fill_tail()"
 	"Make folio_start_writeback return void"
 	"Fix fault handler's handling of poisoned tail pages"
 	"Convert aops->error_remove_page to ->error_remove_folio"
 	"Finish two folio conversions"
 	"More swap folio conversions"
 
 - Kefeng Wang has also contributed folio-related work in the series
 
 	"mm: cleanup and use more folio in page fault"
 
 - Jim Cromie has improved the kmemleak reporting output in the
   series "tweak kmemleak report format".
 
 - In the series "stackdepot: allow evicting stack traces" Andrey
   Konovalov to permits clients (in this case KASAN) to cause
   eviction of no longer needed stack traces.
 
 - Charan Teja Kalla has fixed some accounting issues in the page
   allocator's atomic reserve calculations in the series "mm:
   page_alloc: fixes for high atomic reserve caluculations".
 
 - Dmitry Rokosov has added to the samples/ dorectory some sample
   code for a userspace memcg event listener application.  See the
   series "samples: introduce cgroup events listeners".
 
 - Some mapletree maintanance work from Liam Howlett in the series
   "maple_tree: iterator state changes".
 
 - Nhat Pham has improved zswap's approach to writeback in the
   series "workload-specific and memory pressure-driven zswap
   writeback".
 
 - DAMON/DAMOS feature and maintenance work from SeongJae Park in
   the series
 
 	"mm/damon: let users feed and tame/auto-tune DAMOS"
 	"selftests/damon: add Python-written DAMON functionality tests"
 	"mm/damon: misc updates for 6.8"
 
 - Yosry Ahmed has improved memcg's stats flushing in the series
   "mm: memcg: subtree stats flushing and thresholds".
 
 - In the series "Multi-size THP for anonymous memory" Ryan Roberts
   has added a runtime opt-in feature to transparent hugepages which
   improves performance by allocating larger chunks of memory during
   anonymous page faults.
 
 - Matthew Wilcox has also contributed some cleanup and maintenance
   work against eh buffer_head code int he series "More buffer_head
   cleanups".
 
 - Suren Baghdasaryan has done work on Andrea Arcangeli's series
   "userfaultfd move option".  UFFDIO_MOVE permits userspace heap
   compaction algorithms to move userspace's pages around rather than
   UFFDIO_COPY'a alloc/copy/free.
 
 - Stefan Roesch has developed a "KSM Advisor", in the series
   "mm/ksm: Add ksm advisor".  This is a governor which tunes KSM's
   scanning aggressiveness in response to userspace's current needs.
 
 - Chengming Zhou has optimized zswap's temporary working memory
   use in the series "mm/zswap: dstmem reuse optimizations and
   cleanups".
 
 - Matthew Wilcox has performed some maintenance work on the
   writeback code, both code and within filesystems.  The series is
   "Clean up the writeback paths".
 
 - Andrey Konovalov has optimized KASAN's handling of alloc and
   free stack traces for secondary-level allocators, in the series
   "kasan: save mempool stack traces".
 
 - Andrey also performed some KASAN maintenance work in the series
   "kasan: assorted clean-ups".
 
 - David Hildenbrand has gone to town on the rmap code.  Cleanups,
   more pte batching, folio conversions and more.  See the series
   "mm/rmap: interface overhaul".
 
 - Kinsey Ho has contributed some maintenance work on the MGLRU
   code in the series "mm/mglru: Kconfig cleanup".
 
 - Matthew Wilcox has contributed lruvec page accounting code
   cleanups in the series "Remove some lruvec page accounting
   functions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
 jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
 Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
 =0NHs
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Peng Zhang has done some mapletree maintainance work in the series

	'maple_tree: add mt_free_one() and mt_attr() helpers'
	'Some cleanups of maple tree'

   - In the series 'mm: use memmap_on_memory semantics for dax/kmem'
     Vishal Verma has altered the interworking between memory-hotplug
     and dax/kmem so that newly added 'device memory' can more easily
     have its memmap placed within that newly added memory.

   - Matthew Wilcox continues folio-related work (including a few fixes)
     in the patch series

	'Add folio_zero_tail() and folio_fill_tail()'
	'Make folio_start_writeback return void'
	'Fix fault handler's handling of poisoned tail pages'
	'Convert aops->error_remove_page to ->error_remove_folio'
	'Finish two folio conversions'
	'More swap folio conversions'

   - Kefeng Wang has also contributed folio-related work in the series

	'mm: cleanup and use more folio in page fault'

   - Jim Cromie has improved the kmemleak reporting output in the series
     'tweak kmemleak report format'.

   - In the series 'stackdepot: allow evicting stack traces' Andrey
     Konovalov to permits clients (in this case KASAN) to cause eviction
     of no longer needed stack traces.

   - Charan Teja Kalla has fixed some accounting issues in the page
     allocator's atomic reserve calculations in the series 'mm:
     page_alloc: fixes for high atomic reserve caluculations'.

   - Dmitry Rokosov has added to the samples/ dorectory some sample code
     for a userspace memcg event listener application. See the series
     'samples: introduce cgroup events listeners'.

   - Some mapletree maintanance work from Liam Howlett in the series
     'maple_tree: iterator state changes'.

   - Nhat Pham has improved zswap's approach to writeback in the series
     'workload-specific and memory pressure-driven zswap writeback'.

   - DAMON/DAMOS feature and maintenance work from SeongJae Park in the
     series

	'mm/damon: let users feed and tame/auto-tune DAMOS'
	'selftests/damon: add Python-written DAMON functionality tests'
	'mm/damon: misc updates for 6.8'

   - Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
     memcg: subtree stats flushing and thresholds'.

   - In the series 'Multi-size THP for anonymous memory' Ryan Roberts
     has added a runtime opt-in feature to transparent hugepages which
     improves performance by allocating larger chunks of memory during
     anonymous page faults.

   - Matthew Wilcox has also contributed some cleanup and maintenance
     work against eh buffer_head code int he series 'More buffer_head
     cleanups'.

   - Suren Baghdasaryan has done work on Andrea Arcangeli's series
     'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
     compaction algorithms to move userspace's pages around rather than
     UFFDIO_COPY'a alloc/copy/free.

   - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
     Add ksm advisor'. This is a governor which tunes KSM's scanning
     aggressiveness in response to userspace's current needs.

   - Chengming Zhou has optimized zswap's temporary working memory use
     in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.

   - Matthew Wilcox has performed some maintenance work on the writeback
     code, both code and within filesystems. The series is 'Clean up the
     writeback paths'.

   - Andrey Konovalov has optimized KASAN's handling of alloc and free
     stack traces for secondary-level allocators, in the series 'kasan:
     save mempool stack traces'.

   - Andrey also performed some KASAN maintenance work in the series
     'kasan: assorted clean-ups'.

   - David Hildenbrand has gone to town on the rmap code. Cleanups, more
     pte batching, folio conversions and more. See the series 'mm/rmap:
     interface overhaul'.

   - Kinsey Ho has contributed some maintenance work on the MGLRU code
     in the series 'mm/mglru: Kconfig cleanup'.

   - Matthew Wilcox has contributed lruvec page accounting code cleanups
     in the series 'Remove some lruvec page accounting functions'"

* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
  mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
  mm, treewide: introduce NR_PAGE_ORDERS
  selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
  selftests/mm: skip test if application doesn't has root privileges
  selftests/mm: conform test to TAP format output
  selftests: mm: hugepage-mmap: conform to TAP format output
  selftests/mm: gup_test: conform test to TAP format output
  mm/selftests: hugepage-mremap: conform test to TAP format output
  mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
  mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
  mm/memcontrol: remove __mod_lruvec_page_state()
  mm/khugepaged: use a folio more in collapse_file()
  slub: use a folio in __kmalloc_large_node
  slub: use folio APIs in free_large_kmalloc()
  slub: use alloc_pages_node() in alloc_slab_page()
  mm: remove inc/dec lruvec page state functions
  mm: ratelimit stat flush from workingset shrinker
  kasan: stop leaking stack trace handles
  mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
  mm/mglru: add dummy pmd_dirty()
  ...
2024-01-09 11:18:47 -08:00
Jens Axboe
3b7cb74547 block: move __get_task_ioprio() into header file
We call this once per IO, which can be millions of times per second.
Since nobody really uses io priorities, or at least it isn't very
common, this is all wasted time and can amount to as much as 3% of
the total kernel time.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-08 12:27:39 -07:00
Linus Torvalds
3f6984e730 vfs-6.8.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZZUx4wAKCRCRxhvAZXjc
 osaNAQC/c+xXVfiq/pFbuK9MQLna4RGZaGcG9k312YniXbHq0AD9HAf4aPcZwPy1
 /wkD4pauj3UZ3f0xBSyazGBvAXyN0Qc=
 =iFAQ
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs super updates from Christian Brauner:
 "This contains the super work for this cycle including the long-awaited
  series by Jan to make it possible to prevent writing to mounted block
  devices:

   - Writing to mounted devices is dangerous and can lead to filesystem
     corruption as well as crashes. Furthermore syzbot comes with more
     and more involved examples how to corrupt block device under a
     mounted filesystem leading to kernel crashes and reports we can do
     nothing about. Add tracking of writers to each block device and a
     kernel cmdline argument which controls whether other writeable
     opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are
     allowed.

     Note that this effectively only prevents modification of the
     particular block device's page cache by other writers. The actual
     device content can still be modified by other means - e.g. by
     issuing direct scsi commands, by doing writes through devices lower
     in the storage stack (e.g. in case loop devices, DM, or MD are
     involved) etc. But blocking direct modifications of the block
     device page cache is enough to give filesystems a chance to perform
     data validation when loading data from the underlying storage and
     thus prevent kernel crashes.

     Syzbot can use this cmdline argument option to avoid uninteresting
     crashes. Also users whose userspace setup does not need writing to
     mounted block devices can set this option for hardening. We expect
     that this will be interesting to quite a few workloads.

     Btrfs is currently opted out of this because they still haven't
     merged patches we require for this to work from three kernel
     releases ago.

   - Reimplement block device freezing and thawing as holder operations
     on the block device.

     This allows us to extend block device freezing to all devices
     associated with a superblock and not just the main device. It also
     allows us to remove get_active_super() and thus another function
     that scans the global list of superblocks.

     Freezing via additional block devices only works if the filesystem
     chooses to use @fs_holder_ops for these additional devices as well.
     That currently only includes ext4 and xfs.

     Earlier releases switched get_tree_bdev() and mount_bdev() to use
     @fs_holder_ops. The remaining nilfs2 open-coded version of
     mount_bdev() has been converted to rely on @fs_holder_ops as well.
     So block device freezing for the main block device will continue to
     work as before.

     There should be no regressions in functionality. The only special
     case is btrfs where block device freezing for the main block device
     never worked because sb->s_bdev isn't set. Block device freezing
     for btrfs can be fixed once they can switch to @fs_holder_ops but
     that can happen whenever they're ready"

* tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
  block: Fix a memory leak in bdev_open_by_dev()
  super: don't bother with WARN_ON_ONCE()
  super: massage wait event mechanism
  ext4: Block writes to journal device
  xfs: Block writes to log device
  fs: Block writes to mounted block devices
  btrfs: Do not restrict writes to btrfs devices
  block: Add config option to not allow writing to mounted devices
  block: Remove blkdev_get_by_*() functions
  bcachefs: Convert to bdev_open_by_path()
  fs: handle freezing from multiple devices
  fs: remove dead check
  nilfs2: simplify device handling
  fs: streamline thaw_super_locked
  ext4: simplify device handling
  xfs: simplify device handling
  fs: simplify setup_bdev_super() calls
  blkdev: comment fs_holder_ops
  porting: document block device freeze and thaw changes
  fs: remove unused helper
  ...
2024-01-08 10:43:51 -08:00
Damien Le Moal
587371ed78 block: Treat sequential write preferred zone type as invalid
With the removal of the support for host-aware zoned devices,
blk_revalidate_zone_cb() should never see the zone type
BLK_ZONE_TYPE_SEQWRITE_PREF (sequential write preffered zones). Treat
this zone type as being invalid.

Fixes: 7437bb73f0 ("block: remove support for the host aware zone model")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240107072212.1071080-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-08 08:34:24 -07:00
Christoph Hellwig
4e33b071bb block: remove disk_clear_zoned
disk_clear_zoned is unused now that the last warts of the host-aware
model support in sd are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20231228075141.362560-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-08 08:27:22 -07:00
Ming Lei
393cd8ffd8 blk-cgroup: fix rcu lockdep warning in blkg_lookup()
blkg_lookup() is called with either queue_lock or rcu read lock, so
use rcu_dereference_check(lockdep_is_held(&q->queue_lock)) for
retrieving 'blkg', which way models the check exactly for covering
queue lock or rcu read lock.

Fix lockdep warning of "block/blk-cgroup.h:254 suspicious rcu_dereference_check() usage!"
from blkg_lookup().

Tested-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Fixes: 83462a6c97 ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()")
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231219012833.2129540-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-04 16:08:54 -07:00
Daniel Vacek
fab4c16c52 blk-cgroup: don't use removal safe list iterators
Commit f1c006f1c6 moved deletion of the list blkg->q_node from
blkg_destroy() to blkg_free_workfn(). Switch to using the list
iterators, as we don't need removal protection anymore.

Signed-off-by: Daniel Vacek <neelx@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20240104180031.148148-1-neelx@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-04 16:07:56 -07:00
Christoph Hellwig
458aa1a099 block: floor the discard granularity to the physical block size
Discarding less than a physical block doesn't make sense.  This fixes
the existing behavior for zram before the recent changes to default
the discard granularity to the logical block size, and is also a
generally useful sanity check.

Fixes: 3753039def ("zram: use the default discard granularity")
Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20240103081622.508754-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-04 16:05:49 -07:00
Matthew Wilcox (Oracle)
17bf23a981 fs: convert block_write_full_page to block_write_full_folio
Convert the function to be compatible with writepage_t so that it can be
passed to write_cache_pages() by blkdev.  This removes a call to
compound_head().  We can also remove the function export as both callers
are built-in.

Link: https://lkml.kernel.org/r/20231215200245.748418-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29 11:58:35 -08:00
Linus Torvalds
09c57a762e block-6.7-2023-12-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmWO7A0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjqsEAC4IpYMGeEiY2kEBPLKm/vUEFsQtT2YkdN0
 isc6h6xepmqeZO9a+6jLZ5zadCT6IbCdORN0zUrO91lRdfVolIGNDWnkArAye7Yj
 62FbYf7+G7WebW8XAMkbx43dFik7qneZpN852b/kaGSHF+SZPsj2hTG5yRuuahcA
 nVKRk+7U8Y0ZvjCpd6E2Ir2R86KU/JspCEWCYvQ/kz/AYyxQzZUElRO4ptPB/NTb
 EcWexEFaLi770UkNqrKf87YlcnWiWbeNyeTjRrzDsp4nf8MwxB+tgg2ApGf4lMLv
 B7lTEIjww7JbsVouLf/EWylp+PI+Ebl4tA3oiya/skby/eiO5fhXMM6NsnIxKx/a
 Lg3uOg4Xs2Se0GYiw63mDZCpf/+joKTike6smG76GQ5cdsDQoHczvGiAPul5Bp5w
 6s5whj4ODdxbLsUqZlfToeZnLL91TI8+UbVJt/fIp5lF0+tiICvdpa2b3hBn6rYP
 WmenW21OX1Pr4S7X1+lAc8MD/hYpEpbY4U5jWKOejxA9KeQNAK9TJOWQQF3Lu6/s
 EGzJpFJmZd+G6qSMuJHSoHlPHk9WfHXGatQ/T9ecizVfj7FlUPtbFvbyoKzkY4xK
 8Z1pRd2JhDS7GB2bgu/+Eagz0LksHnAe4eK52wgy/uvNdcbTsLGUgsQyAWHHaCKY
 7OUzMi/ekg==
 =wPUa
 -----END PGP SIGNATURE-----

Merge tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Fix for a badly numbered flag, and a regression fix for the badblocks
  updates from this merge window"

* tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux:
  block: renumber QUEUE_FLAG_HW_WC
  badblocks: avoid checking invalid range in badblocks_check()
2023-12-29 11:41:40 -08:00
Christoph Hellwig
3c407dc723 block: default the discard granularity to sector size
Current the discard granularity defaults to 0 and must be initialized by
any driver that wants to support discard.  Default to the sector size
instead, which is the smallest possible value, and a very useful default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christoph Hellwig
928a5dd3a8 block: remove two comments in bio_split_discard
A zero discard_granularity is not treated the same as a single-block one,
and not having any segments after taking alignment is perfectly fine
and does not need a warning.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231228075545.362768-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29 08:44:12 -07:00
Christophe JAILLET
8ff363ade3
block: Fix a memory leak in bdev_open_by_dev()
If we early exit here, 'handle' needs to be freed, or some memory leaks.

Fixes: ed5cc702d3 ("block: Add config option to not allow writing to mounted devices")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/8eaec334781e695810aaa383b55de00ca4ab1352.1703439383.git.christophe.jaillet@wanadoo.fr
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-28 11:48:17 +01:00
Christoph Hellwig
d6b9f4e6f7 block: rename and document BLK_DEF_MAX_SECTORS
Give BLK_DEF_MAX_SECTORS a _CAP postfix and document what it is used for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231227092305.279567-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-27 10:46:01 -07:00
Christoph Hellwig
5d13243820 blk-wbt: remove the separate write cache tracking
Use the queue wide write back cache tracking insted of duplicating the
value in strut rq_wb.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231226090747.204969-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-26 09:28:10 -07:00
Christoph Hellwig
1c042f8d4b block: reject invalid operation in submit_bio_noacct
submit_bio_noacct allows completely invalid operations, or operations
that are not supported in the bio path.  Extent the existing switch
statement to rejcect all invalid types.

Move the code point for REQ_OP_ZONE_APPEND so that it's not right in the
middle of the zone management operations and the switch statement can
follow the numerical order of the operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231221070538.1112446-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-26 09:27:14 -07:00
Coly Li
146e843f6b badblocks: avoid checking invalid range in badblocks_check()
If prev_badblocks() returns '-1', it means no valid badblocks record
before the checking range. It doesn't make sense to check whether
the input checking range is overlapped with the non-existed invalid
front range.

This patch checkes whether 'prev >= 0' is true before calling
overlap_front(), to void such invalid operations.

Fixes: 3ea3354cb9 ("badblocks: improve badblocks_check() for multiple ranges handling")
Reported-and-tested-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/nvdimm/3035e75a-9be0-4bc3-8d4a-6e52c207f277@leemhuis.info/
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/r/20231224002820.20234-1-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-23 18:38:08 -07:00
Jens Axboe
5165799f0d block: export disk_clear_zoned()
A previous commit split disk_set_zoned(..., bool) into not taking an
argument for whether to set or clear, and instead added
disk_clear_zoned() as the counterpart. However, that commit neglected
to export the new symbol, causing failures for modular drivers that
used it.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: d73e93b4df ("block: simplify disk_set_zoned")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-20 20:32:12 -07:00
Christoph Hellwig
d73e93b4df block: simplify disk_set_zoned
Only use disk_set_zoned to actually enable zoned device support.
For clearing it, call disk_clear_zoned, which is renamed from
disk_clear_zone_settings and now directly clears the zoned flag as
well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Christoph Hellwig
7437bb73f0 block: remove support for the host aware zone model
When zones were first added the SCSI and ATA specs, two different
models were supported (in addition to the drive managed one that
is invisible to the host):

 - host managed where non-conventional zones there is strict requirement
   to write at the write pointer, or else an error is returned
 - host aware where a write point is maintained if writes always happen
   at it, otherwise it is left in an under-defined state and the
   sequential write preferred zones behave like conventional zones
   (probably very badly performing ones, though)

Not surprisingly this lukewarm model didn't prove to be very useful and
was finally removed from the ZBC and SBC specs (NVMe never implemented
it).  Due to to the easily disappearing write pointer host software
could never rely on the write pointer to actually be useful for say
recovery.

Fortunately only a few HDD prototypes shipped using this model which
never made it to mass production.  Drop the support before it is too
late.  Note that any such host aware prototype HDD can still be used
with Linux as we'll now treat it as a conventional HDD.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 20:17:43 -07:00
Li Nan
4c434392c4 block: add check of 'minors' and 'first_minor' in device_add_disk()
'first_minor' represents the starting minor number of disks, and
'minors' represents the number of partitions in the device. Neither
of them can be greater than MINORMASK + 1.

Commit e338924bd0 ("block: check minor range in device_add_disk()")
only added the check of 'first_minor + minors'. However, their sum might
be less than MINORMASK but their values are wrong. Complete the checks now.

Fixes: e338924bd0 ("block: check minor range in device_add_disk()")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231219075942.840255-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19 08:23:12 -07:00
Kundan Kumar
6c9b97085c block: skip cgroups for passthrough io
Even if BLK_CGROUP is enabled, it does not work for passthrough io.
So skip setting up blkg for passthrough bio.

Reduced processing gives ~5% hike in peak-performance workload.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231218152722.1768-1-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-18 09:46:53 -07:00
Christoph Hellwig
6ef02df154 block: support adding less than len in bio_add_hw_page
bio_add_hw_page currently always fails or succeeds.  This is fine for
the existing callers that always add PAGE_SIZE worth given that the
max_segment_size and max_sectors must always allow at least a page
worth of data.  But when we want to add it for bigger amounts of data
this means it can also fail when adding the data to a bio, and creating
a fallback for that becomes really annoying in the callers.

Make use of the existing API design that allows to return a smaller
length than the one passed in and add up to max_segment_size worth
of data from a larger input.  All the existing callers are fine with
this - not because they handle this return correctly, but because they
never pass more than a page in.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-15 07:34:27 -07:00
Christoph Hellwig
3f034c374a block: prevent an integer overflow in bvec_try_merge_hw_page
Reordered a check to avoid a possible overflow when adding len to bv_len.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231204173419.782378-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-15 07:34:27 -07:00
Bart Van Assche
f19d1e3b17 block: Use pr_info() instead of printk(KERN_INFO ...)
Switch to the modern style of printing kernel messages. Use %u instead
of %d to print unsigned integers.

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231213194702.90381-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-14 10:28:56 -07:00
Min Li
6f64f866aa block: add check that partition length needs to be aligned with block size
Before calling add partition or resize partition, there is no check
on whether the length is aligned with the logical block size.
If the logical block size of the disk is larger than 512 bytes,
then the partition size maybe not the multiple of the logical block size,
and when the last sector is read, bio_truncate() will adjust the bio size,
resulting in an IO error if the size of the read command is smaller than
the logical block size.If integrity data is supported, this will also
result in a null pointer dereference when calling bio_integrity_free.

Cc:  <stable@vger.kernel.org>
Signed-off-by: Min Li <min15.li@samsung.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230629142517.121241-1-min15.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13 08:19:14 -07:00
Li Nan
5fa3d1a00c block: Set memalloc_noio to false on device_add_disk() error path
On the error path of device_add_disk(), device's memalloc_noio flag was
set but not cleared. As the comment of pm_runtime_set_memalloc_noio(),
"The function should be called between device_add() and device_del()".
Clear this flag before device_del() now.

Fixes: 25e823c8c3 ("block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231211075356.1839282-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13 08:17:02 -07:00
Matthew Wilcox (Oracle)
af7628d6ec fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.

Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 16:51:42 -08:00
Matthew Wilcox (Oracle)
1b151e2435 block: Remove special-casing of compound pages
The special casing was originally added in pre-git history; reproducing
the commit log here:

> commit a318a92567d77
> Author: Andrew Morton <akpm@osdl.org>
> Date:   Sun Sep 21 01:42:22 2003 -0700
>
>     [PATCH] Speed up direct-io hugetlbpage handling
>
>     This patch short-circuits all the direct-io page dirtying logic for
>     higher-order pages.  Without this, we pointlessly bounce BIOs up to
>     keventd all the time.

In the last twenty years, compound pages have become used for more than
just hugetlb.  Rewrite these functions to operate on folios instead
of pages and remove the special case for hugetlbfs; I don't think
it's needed any more (and if it is, we can put it back in as a call
to folio_test_hugetlb()).

This was found by inspection; as far as I can tell, this bug can lead
to pages used as the destination of a direct I/O read not being marked
as dirty.  If those pages are then reclaimed by the MM without being
dirtied for some other reason, they won't be written out.  Then when
they're faulted back in, they will not contain the data they should.
It'll take a pretty unusual setup to produce this problem with several
races all going the wrong way.

This problem predates the folio work; it could for example have been
triggered by mmaping a THP in tmpfs and using that as the target of an
O_DIRECT read.

Fixes: 800d8c63b2 ("shmem: add huge pages support")
Cc:  <stable@vger.kernel.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-07 14:02:20 -07:00
Kundan Kumar
847c5bcdfb block: skip QUEUE_FLAG_STATS and rq-qos for passthrough io
Write-back throttling (WBT) enables QUEUE_FLAG_STATS on the request
queue. But WBT does not make sense for passthrough io, so skip
QUEUE_FLAG_STATS processing.

Also skip rq_qos_issue/done for passthrough io.

Overall, the change gives ~11% hike in peak performance.

Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20231123190331.7934-1-kundan.kumar@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01 18:29:18 -07:00
Keith Busch
492c5d4559 block: bio-integrity: directly map user buffers
Passthrough commands that utilize metadata currently need to bounce the
user space buffer through the kernel. Add support for mapping user space
directly so that we can avoid this costly overhead. This is similar to
how the normal bio data payload utilizes user addresses with
bio_map_user_iov().

If the user address can't directly be used for reason, like too many
segments or address unalignement, fallback to a copy of the user vec
while keeping the user address pinned for the IO duration so that it
can safely be copied on completion in any process context.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20231130215309.2923568-2-kbusch@meta.com
[axboe: fold in fix from Kanchan Joshi]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01 18:29:00 -07:00
Linus Torvalds
ee0c8a9b34 block-6.7-2023-12-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVqKo4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqL9D/9bPvuA+Oogx+C/kNConjxnuyPBiXcZjb/4
 5gO/6N0FC8yu+HQqgscGTyEjJO2FKfLx+YxxBs1UVIt4Tm+jZwC3nPqw9X4W3RCz
 pK9fxCNlzxey0SZU3ZJQIOtqP3df5Yuas9V/h35GS4m1XaoDE6cPpsIVUrAnoNwg
 W990L8sOy6y4XzMPzyHJCyoDCay1Qp2ly0Vdlz4/ESRmEp564i42nFN+8zpZ/w7h
 V+Ekn6JwP1ssqUeY/k43QcfRzYwSvvnTQJ1y9t3erf6HcHtpbCgnL1jTaGEmr4IS
 1sw3ffqo23xBSsGP+D2OF4+9pwGI9+xwNpYnRdrpDPxKhCn5EEh+g6+f+m7YEnFV
 q1swlMTqHtRLFdYbKe8Tl8hPRwEeSpKy8sXph56hwGZY0T/IyB+Pe3aXrh1DYPA5
 4+GASZHFQPH82P1ibVNdpMRZe4rPPblw38GZauZ1JbI0m0zXqEveB2AgZeCcw1ky
 l7KBdMdGBqSWYVmfKcJd3f30vKPyhMSp4eE9/LFp24vmyIIw+dSp6vup0yrM6jk9
 taUU6PCHzaxmI1YGz1BzNVa8cfYKB6aiWeQ2OGa4Z7ba4TuksMLkbfVvu21jdi+z
 PsL/KlqPSPwFL/3XAZagIb3BXUhoQyfwIU8GnAuw2wTU5RJzWnbwF3wXpNaBIJxI
 8y5OWsFqIg==
 =5kb6
 -----END PGP SIGNATURE-----

Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
     - Invalid namespace identification error handling (Marizio Ewan,
       Keith)
     - Fabrics keep-alive tuning (Mark)

 - Fix for a bad error check regression in bcache (Markus)

 - Fix for a performance regression with O_DIRECT (Ming)

 - Fix for a flush related deadlock (Ming)

 - Make the read-only warn on per-partition (Yu)

* tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux:
  nvme-core: check for too small lba shift
  blk-mq: don't count completed flush data request as inflight in case of quiesce
  block: Document the role of the two attribute groups
  block: warn once for each partition in bio_check_ro()
  block: move .bd_inode into 1st cacheline of block_device
  nvme: check for valid nvme_identify_ns() before using it
  nvme-core: fix a memory leak in nvme_ns_info_from_identify()
  nvme: fine-tune sending of first keep-alive
  bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
2023-12-02 06:39:30 +09:00
Ming Lei
0e4237ae8d blk-mq: don't count completed flush data request as inflight in case of quiesce
Request queue quiesce may interrupt flush sequence, and the original request
may have been marked as COMPLETE, but can't get finished because of
queue quiesce.

This way is fine from driver viewpoint, because flush sequence is block
layer concept, and it isn't related with driver.

However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count &
drain inflight requests, then the wait & drain never gets done because
the completed & not-finished flush request is counted as inflight.

Fix this issue by not counting completed flush data request as inflight in
case of quiesce.

Cc: Mike Snitzer <snitzer@kernel.org>
Cc: David Jeffery <djeffery@redhat.com>
Cc: John Pittman <jpittman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01 07:34:47 -07:00
Bart Van Assche
3649ff0a0b block: Document the role of the two attribute groups
It is nontrivial to derive the role of the two attribute groups in source
file block/blk-sysfs.c. Hence add a comment that explains their roles. See
also commit 6d85ebf95c ("blk-sysfs: add a new attr_group for blk_mq").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20231128194019.72762-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-29 10:18:38 -07:00
Yu Kuai
67d995e069 block: warn once for each partition in bio_check_ro()
Commit 1b0a151c10 ("blk-core: use pr_warn_ratelimited() in
bio_check_ro()") fix message storm by limit the rate, however, there
will still be lots of message in the long term. Fix it better by warn
once for each partition.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231128123027.971610-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28 12:11:08 -07:00
Linus Torvalds
fa2b906f51 vfs-6.7-rc3.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZWBq0gAKCRCRxhvAZXjc
 ot4EAP48O5ExMtQ3/AIkNDo+/9/Iz4g7bE1HYmdyiMPO3Ou/uwEAySwBXRJrFAsS
 9omvkEdqrfyguW0xgoYwcxBdATVHnAE=
 =ScR3
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:

 - Avoid calling back into LSMs from vfs_getattr_nosec() calls.

   IMA used to query inode properties accessing raw inode fields without
   dedicated helpers. That was finally fixed a few releases ago by
   forcing IMA to use vfs_getattr_nosec() helpers.

   The goal of the vfs_getattr_nosec() helper is to query for attributes
   without calling into the LSM layer which would be quite problematic
   because incredibly IMA is called from __fput()...

     __fput()
       -> ima_file_free()

   What it does is to call back into the filesystem to update the file's
   IMA xattr. Querying the inode without using vfs_getattr_nosec() meant
   that IMA didn't handle stacking filesystems such as overlayfs
   correctly. So the switch to vfs_getattr_nosec() is quite correct. But
   the switch to vfs_getattr_nosec() revealed another bug when used on
   stacking filesystems:

     __fput()
       -> ima_file_free()
          -> vfs_getattr_nosec()
             -> i_op->getattr::ovl_getattr()
                -> vfs_getattr()
                   -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr()
                      -> security_inode_getattr() # calls back into LSMs

   Now, if that __fput() happens from task_work_run() of an exiting task
   current->fs and various other pointer could already be NULL. So
   anything in the LSM layer relying on that not being NULL would be
   quite surprised.

   Fix that by passing the information that this is a security request
   through to the stacking filesystem by adding a new internal
   ATT_GETATTR_NOSEC flag. Now the callchain becomes:

     __fput()
       -> ima_file_free()
          -> vfs_getattr_nosec()
             -> i_op->getattr::ovl_getattr()
                -> if (AT_GETATTR_NOSEC)
                          vfs_getattr_nosec()
                   else
                          vfs_getattr()
                   -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr()

 - Fix a bug introduced with the iov_iter rework from last cycle.

   This broke /proc/kcore by copying too much and without the correct
   offset.

 - Add a missing NULL check when allocating the root inode in
   autofs_fill_super().

 - Fix stable writes for multi-device filesystems (xfs, btrfs etc) and
   the block device pseudo filesystem.

   Stable writes used to be a superblock flag only, making it a per
   filesystem property. Add an additional AS_STABLE_WRITES mapping flag
   to allow for fine-grained control.

 - Ensure that offset_iterate_dir() returns 0 after reaching the end of
   a directory so it adheres to getdents() convention.

* tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  libfs: getdents() should return 0 after reaching EOD
  xfs: respect the stable writes flag on the RT device
  xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags
  block: update the stable_writes flag in bdev_add
  filemap: add a per-mapping stable writes flag
  autofs: add: new_inode check in autofs_fill_super()
  iov_iter: fix copy_page_to_iter_nofault()
  fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
2023-11-24 09:45:40 -08:00
Damien Le Moal
c96b817552 block: Remove blk_set_runtime_active()
The function blk_set_runtime_active() is called only from
blk_post_runtime_resume(), so there is no need for that function to be
exported. Open-code this function directly in blk_post_runtime_resume()
and remove it.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20231120070611.33951-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20 10:22:40 -07:00
Christoph Hellwig
1898efcdbe block: update the stable_writes flag in bdev_add
Propagate the per-queue stable_write flags into each bdev inode in bdev_add.
This makes sure devices that require stable writes have it set for I/O
on the block device node as well.

Note that this doesn't cover the case of a flag changing on a live device
yet.  We should handle that as well, but I plan to cover it as part of a
more general rework of how changing runtime paramters on block devices
works.

Fixes: 1cb039f3dc ("bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag")
Reported-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231025141020.192413-3-hch@lst.de
Tested-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-20 15:05:18 +01:00
Jan Kara
ed5cc702d3
block: Add config option to not allow writing to mounted devices
Writing to mounted devices is dangerous and can lead to filesystem
corruption as well as crashes. Furthermore syzbot comes with more and
more involved examples how to corrupt block device under a mounted
filesystem leading to kernel crashes and reports we can do nothing
about. Add tracking of writers to each block device and a kernel cmdline
argument which controls whether other writeable opens to block devices
open with BLK_OPEN_RESTRICT_WRITES flag are allowed. We will make
filesystems use this flag for used devices.

Note that this effectively only prevents modification of the particular
block device's page cache by other writers. The actual device content
can still be modified by other means - e.g. by issuing direct scsi
commands, by doing writes through devices lower in the storage stack
(e.g. in case loop devices, DM, or MD are involved) etc. But blocking
direct modifications of the block device page cache is enough to give
filesystems a chance to perform data validation when loading data from
the underlying storage and thus prevent kernel crashes.

Syzbot can use this cmdline argument option to avoid uninteresting
crashes. Also users whose userspace setup does not need writing to
mounted block devices can set this option for hardening.

Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231101174325.10596-3-jack@suse.cz
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:59:25 +01:00
Jan Kara
cd34758c52
block: Remove blkdev_get_by_*() functions
blkdev_get_by_*() and blkdev_put() functions are now unused. Remove
them.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231101174325.10596-2-jack@suse.cz
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:59:25 +01:00
Christian Brauner
49ef8832fb
bdev: implement freeze and thaw holder operations
The old method of implementing block device freeze and thaw operations
required us to rely on get_active_super() to walk the list of all
superblocks on the system to find any superblock that might use the
block device. This is wasteful and not very pleasant overall.

Now that we can finally go straight from block device to owning
superblock things become way simpler.

Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-5-599c19f4faac@kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:59:23 +01:00
Christian Brauner
fbcb8f39e9
bdev: surface the error from sync_blockdev()
When freeze_super() is called, sync_filesystem() will be called which
calls sync_blockdev() and already surfaces any errors. Do the same for
block devices that aren't owned by a superblock and also for filesystems
that don't call sync_blockdev() internally but implicitly rely on
bdev_freeze() to do it.

Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-3-599c19f4faac@kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:59:23 +01:00
Christian Brauner
982c3b3058
bdev: rename freeze and thaw helpers
We have bdev_mark_dead() etc and we're going to move block device
freezing to holder ops in the next patch. Make the naming consistent:

* freeze_bdev() -> bdev_freeze()
* thaw_bdev()   -> bdev_thaw()

Also document the return code.

Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18 14:59:23 +01:00
Ming Lei
e63a573035 blk-cgroup: bypass blkcg_deactivate_policy after destroying
blkcg_deactivate_policy() can be called after blkg_destroy_all()
returns, and it isn't necessary since blkg_destroy_all has covered
policy deactivation.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231117023527.3188627-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-17 10:48:58 -07:00
Ming Lei
35a99d6557 blk-cgroup: avoid to warn !rcu_read_lock_held() in blkg_lookup()
So far, all callers either holds spin lock or rcu read explicitly, and
most of the caller has added WARN_ON_ONCE(!rcu_read_lock_held()) or
lockdep_assert_held(&disk->queue->queue_lock).

Remove WARN_ON_ONCE(!rcu_read_lock_held()) from blkg_lookup() for
killing the false positive warning from blkg_conf_prep().

Reported-by: Changhui Zhong <czhong@redhat.com>
Fixes: 83462a6c97 ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231117023527.3188627-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-17 10:48:58 -07:00
Ming Lei
27b13e209d blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!"
Inside blkg_for_each_descendant_pre(), both
css_for_each_descendant_pre() and blkg_lookup() requires RCU read lock,
and either cgroup_assert_mutex_or_rcu_locked() or rcu_read_lock_held()
is called.

Fix the warning by adding rcu read lock.

Reported-by: Changhui Zhong <czhong@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20231117023527.3188627-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-17 10:48:58 -07:00
Christoph Hellwig
b0077e269f blk-mq: make sure active queue usage is held for bio_integrity_prep()
blk_integrity_unregister() can come if queue usage counter isn't held
for one bio with integrity prepared, so this request may be completed with
calling profile->complete_fn, then kernel panic.

Another constraint is that bio_integrity_prep() needs to be called
before bio merge.

Fix the issue by:

- call bio_integrity_prep() with one queue usage counter grabbed reliably

- call bio_integrity_prep() before bio merge

Fixes: 900e080752 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/r/20231113035231.2708053-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-13 08:52:52 -07:00
Yu Kuai
1b0a151c10 blk-core: use pr_warn_ratelimited() in bio_check_ro()
If one of the underlying disks of raid or dm is set to read-only, then
each io will generate new log, which will cause message storm. This
environment is indeed problematic, however we can't make sure our
naive custormer won't do this, hence use pr_warn_ratelimited() to
prevent message storm in this case.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Fixes: 57e95e4670 ("block: fix and cleanup bio_check_ro")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20231107111247.2157820-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-07 08:15:23 -07:00
Linus Torvalds
8f6f76a6a2 As usual, lots of singleton and doubleton patches all over the tree and
there's little I can say which isn't in the individual changelogs.
 
 The lengthier patch series are
 
 - "kdump: use generic functions to simplify crashkernel reservation in
   arch", from Baoquan He.  This is mainly cleanups and consolidation of
   the "crashkernel=" kernel parameter handling.
 
 - After much discussion, David Laight's "minmax: Relax type checks in
   min() and max()" is here.  Hopefully reduces some typecasting and the
   use of min_t() and max_t().
 
 - A group of patches from Oleg Nesterov which clean up and slightly fix
   our handling of reads from /proc/PID/task/...  and which remove
   task_struct.therad_group.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
 jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
 =On4u
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "As usual, lots of singleton and doubleton patches all over the tree
  and there's little I can say which isn't in the individual changelogs.

  The lengthier patch series are

   - 'kdump: use generic functions to simplify crashkernel reservation
     in arch', from Baoquan He. This is mainly cleanups and
     consolidation of the 'crashkernel=' kernel parameter handling

   - After much discussion, David Laight's 'minmax: Relax type checks in
     min() and max()' is here. Hopefully reduces some typecasting and
     the use of min_t() and max_t()

   - A group of patches from Oleg Nesterov which clean up and slightly
     fix our handling of reads from /proc/PID/task/... and which remove
     task_struct.thread_group"

* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
  scripts/gdb/vmalloc: disable on no-MMU
  scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
  .mailmap: add address mapping for Tomeu Vizoso
  mailmap: update email address for Claudiu Beznea
  tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
  .mailmap: map Benjamin Poirier's address
  scripts/gdb: add lx_current support for riscv
  ocfs2: fix a spelling typo in comment
  proc: test ProtectionKey in proc-empty-vm test
  proc: fix proc-empty-vm test with vsyscall
  fs/proc/base.c: remove unneeded semicolon
  do_io_accounting: use sig->stats_lock
  do_io_accounting: use __for_each_thread()
  ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
  ocfs2: fix a typo in a comment
  scripts/show_delta: add __main__ judgement before main code
  treewide: mark stuff as __ro_after_init
  fs: ocfs2: check status values
  proc: test /proc/${pid}/statm
  compiler.h: move __is_constexpr() to compiler.h
  ...
2023-11-02 20:53:31 -10:00
Linus Torvalds
90d624af2e for-6.7/block-2023-10-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmU/vjMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqVcEADaNf6X7LVKKrdQ4sA38dBZYGM3kNz0SCYV
 vkjQAs0Fyylbu6EhYOLO/R+UCtpytLlnbr4NmFDbhaEG4OJcwoDLDxpMQ7Gda58v
 4RBXAiIlhZX3g99/ebvtNtVEvQa9gF4h8k2n/gKsG+PoS+cbkKAI0Na2duI1d/pL
 B5nQ31VAHhsyjUv1nIPLrQS6lsL7ZTFvH8L6FLcEVM03poy8PE2H6kN7WoyXwtfo
 LN3KK0Nu7B0Wx2nDx0ffisxcDhbChGs7G2c9ndPTvxg6/4HW+2XSeNUwTxXYpyi2
 ZCD+AHCzMB/w6GNNWFw4xfau5RrZ4c4HdBnmyR6+fPb1u6nGzjgquzFyLyLu5MkA
 n/NvOHP1Cbd3QIXG1TnBi2kDPkQ5FOIAjFSe9IZAGT4dUkZ63wBoDil1jCgMLuCR
 C+AFPLhiIg3cFvu9+fdZ6BkCuZYESd3YboBtRKeMionEexrPTKt4QWqIoVJgd/Y7
 nwvR8jkIBpVgQZT8ocYqhSycLCYV2lGqEBSq4rlRiEb/W1G9Awmg8UTGuUYFSC1G
 vGPCwhGi+SBsbo84aPCfSdUkKDlruNWP0GwIFxo0hsiTOoHP+7UWeenJ2Jw5lNPt
 p0Y72TEDDaSMlE4cJx6IWdWM/B+OWzCyRyl3uVcy7bToEsVhIbBSSth7+sh2n7Cy
 WgH1lrtMzg==
 =sace
 -----END PGP SIGNATURE-----

Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Improvements to the queue_rqs() support, and adding null_blk support
   for that as well (Chengming)

 - Series improving badblocks support (Coly)

 - Key store support for sed-opal (Greg)

 - IBM partition string handling improvements (Jan)

 - Make number of ublk devices supported configurable (Mike)

 - Cancelation improvements for ublk (Ming)

 - MD pull requests via Song:
     - Handle timeout in md-cluster, by Denis Plotnikov
     - Cleanup pers->prepare_suspend, by Yu Kuai
     - Rewrite mddev_suspend(), by Yu Kuai
     - Simplify md_seq_ops, by Yu Kuai
     - Reduce unnecessary locking array_state_store(), by Mariusz
       Tkaczyk
     - Make rdev add/remove independent from daemon thread, by Yu Kuai
     - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai

 - NVMe pull request via Keith:
     - nvme-auth updates (Mark)
     - nvme-tcp tls (Hannes)
     - nvme-fc annotaions (Kees)

 - Misc cleanups and improvements (Jiapeng, Joel)

* tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits)
  block: ublk_drv: Remove unused function
  md: cleanup pers->prepare_suspend()
  nvme-auth: allow mixing of secret and hash lengths
  nvme-auth: use transformed key size to create resp
  nvme-auth: alloc nvme_dhchap_key as single buffer
  nvmet-tcp: use 'spin_lock_bh' for state_lock()
  powerpc/pseries: PLPKS SED Opal keystore support
  block: sed-opal: keystore access for SED Opal keys
  block:sed-opal: SED Opal keystore
  ublk: simplify aborting request
  ublk: replace monitor with cancelable uring_cmd
  ublk: quiesce request queue when aborting queue
  ublk: rename mm_lock as lock
  ublk: move ublk_cancel_dev() out of ub->mutex
  ublk: make sure io cmd handled in submitter task context
  ublk: don't get ublk device reference in ublk_abort_queue()
  ublk: Make ublks_max configurable
  ublk: Limit dev_id/ub_number values
  md-cluster: check for timeout while a new disk adding
  nvme: rework NVME_AUTH Kconfig selection
  ...
2023-11-01 12:30:07 -10:00
Linus Torvalds
d4e175f2c4 vfs-6.7.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZT0C2gAKCRCRxhvAZXjc
 otV8AQCK5F9ONoQ7ISpdrKyUJiswySGXx0CYPfXbSg5gHH87zgEAua3vwVKeGXXF
 5iVsdiNzIIQDwGDx7FyxufL4ggcN6gQ=
 =E1kV
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull vfs superblock updates from Christian Brauner:
 "This contains the work to make block device opening functions return a
  struct bdev_handle instead of just a struct block_device. The same
  struct bdev_handle is then also passed to block device closing
  functions.

  This allows us to propagate context from opening to closing a block
  device without having to modify all users everytime.

  Sidenote, in the future we might even want to try and have block
  device opening functions return a struct file directly but that's a
  series on top of this.

  These are further preparatory changes to be able to count writable
  opens and blocking writes to mounted block devices. That's a separate
  piece of work for next cycle and for that we absolutely need the
  changes to btrfs that have been quietly dropped somehow.

  Originally the series contained a patch that removed the old
  blkdev_*() helpers. But since this would've caused needles churn in
  -next for bcachefs we ended up delaying it.

  The second piece of work addresses one of the major annoyances about
  the work last cycle, namely that we required dropping s_umount
  whenever we used the superblock and fs_holder_ops for a block device.

  The reason for that requirement had been that in some codepaths
  s_umount could've been taken under disk->open_mutex (that's always
  been the case, at least theoretically). For example, on surprise block
  device removal or media change. And opening and closing block devices
  required grabbing disk->open_mutex as well.

  So we did the work and went through the block layer and fixed all
  those places so that s_umount is never taken under disk->open_mutex.
  This means no more brittle games where we yield and reacquire s_umount
  during block device opening and closing and no more requirements where
  block devices need to be closed. Filesystems don't need to care about
  this.

  There's a bunch of other follow-up work such as moving block device
  freezing and thawing to holder operations which makes it work for all
  block devices and not just the main block device just as we did for
  surprise removal. But that is for next cycle.

  Tested with fstests for all major fses, blktests, LTP"

* tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits)
  porting: update locking requirements
  fs: assert that open_mutex isn't held over holder ops
  block: assert that we're not holding open_mutex over blk_report_disk_dead
  block: move bdev_mark_dead out of disk_check_media_change
  block: WARN_ON_ONCE() when we remove active partitions
  block: simplify bdev_del_partition()
  fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  jfs: fix log->bdev_handle null ptr deref in lbmStartIO
  bcache: Fixup error handling in register_cache()
  xfs: Convert to bdev_open_by_path()
  reiserfs: Convert to bdev_open_by_dev/path()
  ocfs2: Convert to use bdev_open_by_dev()
  nfs/blocklayout: Convert to use bdev_open_by_dev/path()
  jfs: Convert to bdev_open_by_dev()
  f2fs: Convert to bdev_open_by_dev/path()
  ext4: Convert to bdev_open_by_dev()
  erofs: Convert to use bdev_open_by_path()
  btrfs: Convert to bdev_open_by_path()
  fs: Convert to bdev_open_by_dev()
  mm/swap: Convert to use bdev_open_by_dev()
  ...
2023-10-30 08:59:05 -10:00
Christian Brauner
f61033390b
block: assert that we're not holding open_mutex over blk_report_disk_dead
blk_report_disk_dead() has the following major callers:

(1) del_gendisk()
(2) blk_mark_disk_dead()

Since del_gendisk() acquires disk->open_mutex it's clear that all
callers are assumed to be called without disk->open_mutex held.
In turn, blk_report_disk_dead() is called without disk->open_mutex held
in del_gendisk().

All callers of blk_mark_disk_dead() call it without disk->open_mutex as
well.

Ensure that it is clear that blk_report_disk_dead() is called without
disk->open_mutex on purpose by asserting it and a comment in the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231017184823.1383356-5-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:23 +02:00
Christoph Hellwig
6e57236ed6
block: move bdev_mark_dead out of disk_check_media_change
disk_check_media_change is mostly called from ->open where it makes
little sense to mark the file system on the device as dead, as we
are just opening it.  So instead of calling bdev_mark_dead from
disk_check_media_change move it into the few callers that are not
in an open instance.  This avoid calling into bdev_mark_dead and
thus taking s_umount with open_mutex held.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231017184823.1383356-4-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:23 +02:00
Christian Brauner
51b4cb4f3e
block: WARN_ON_ONCE() when we remove active partitions
The logic for disk->open_partitions is:

blkdev_get_by_*()
-> bdev_is_partition()
   -> blkdev_get_part()
      -> blkdev_get_whole() // bdev_whole->bd_openers++
      -> if (part->bd_openers == 0)
                 disk->open_partitions++
         part->bd_openers

In other words, when we first claim/open a partition we increment
disk->open_partitions and only when all part->bd_openers are closed will
disk->open_partitions be zero. That should mean that
disk->open_partitions is always > 0 as long as there's anyone that
has an open partition.

So the check for disk->open_partitions should mean that we can never
remove an active partition that has a holder and holder ops set. Assert
that in the code. The main disk isn't removed so that check doesn't work
for disk->part0 which is what we want. After all we only care about
partition not about the main disk.

Link: https://lore.kernel.org/r/20231017184823.1383356-3-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:22 +02:00
Christian Brauner
c30b9787a4
block: simplify bdev_del_partition()
BLKPG_DEL_PARTITION refuses to delete partitions that still have
openers, i.e., that has an elevated @bdev->bd_openers count. If a device
is claimed by setting @bdev->bd_holder and @bdev->bd_holder_ops
@bdev->bd_openers and @bdev->bd_holders are incremented.
@bdev->bd_openers is effectively guaranteed to be >= @bdev->bd_holders.
So as long as @bdev->bd_openers isn't zero we know that this partition
is still in active use and that there might still be @bdev->bd_holder
and @bdev->bd_holder_ops set.

The only current example is @fs_holder_ops for filesystems. But that
means bdev_mark_dead() which calls into
bdev->bd_holder_ops->mark_dead::fs_bdev_mark_dead() is a nop. As long as
there's an elevated @bdev->bd_openers count we can't delete the
partition and if there isn't an elevated @bdev->bd_openers count then
there's no @bdev->bd_holder or @bdev->bd_holder_ops.

So simply open-code what we need to do. This gets rid of one more
instance where we acquire s_umount under @disk->open_mutex.

Link: https://lore.kernel.org/r/20231016-fototermin-umriss-59f1ea6c1fe6@brauner
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20231017184823.1383356-2-hch@lst.de
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:22 +02:00
Jan Kara
fd1464105c
fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
The implementation of bdev holder operations such as fs_bdev_mark_dead()
and fs_bdev_sync() grab sb->s_umount semaphore under
bdev->bd_holder_lock. This is problematic because it leads to
disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
(usually we grab higher level (e.g. filesystem) locks first and lower
level (e.g. block layer) locks later) and indeed makes lockdep complain
about possible locking cycles whenever we open a block device while
holding sb->s_umount semaphore. Implement a function
bdev_super_lock_shared() which safely transitions from holding
bdev->bd_holder_lock to holding sb->s_umount on alive superblock without
introducing the problematic lock dependency. We use this function
fs_bdev_sync() and fs_bdev_mark_dead().

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz
Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:22 +02:00
Jan Kara
acb083b555
block: Use bdev_open_by_dev() in disk_scan_partitions() and blkdev_bszset()
Convert disk_scan_partitions() and blkdev_bszset() to use
bdev_open_by_dev().

Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-3-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:16 +02:00
Jan Kara
841dd789b8
block: Use bdev_open_by_dev() in blkdev_open()
Convert blkdev_open() to use bdev_open_by_dev(). To be able to propagate
handle from blkdev_open() to blkdev_release() we need to stop using
existence of file->private_data to determine exclusive block device
opens. Use bdev_handle->mode for this purpose since file->f_flags
isn't usable for this (O_EXCL is cleared from the flags during open).

Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-2-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:16 +02:00
Jan Kara
e719b4d156
block: Provide bdev_open_* functions
Create struct bdev_handle that contains all parameters that need to be
passed to blkdev_put() and provide bdev_open_* functions that return
this structure instead of plain bdev pointer. This will eventually allow
us to pass one more argument to blkdev_put() (renamed to bdev_release())
without too much hassle.

Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-1-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28 13:29:16 +02:00
Khazhismel Kumykov
2dd710d476 blk-throttle: check for overflow in calculate_bytes_allowed
Inexact, we may reject some not-overflowing values incorrectly, but
they'll be on the order of exabytes allowed anyways.

This fixes divide error crash on x86 if bps_limit is not configured or
is set too high in the rare case that jiffy_elapsed is greater than HZ.

Fixes: e8368b57c0 ("blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()")
Fixes: 8d6bbaada2 ("blk-throttle: prevent overflow while calculating wait time")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20231020223617.2739774-1-khazhy@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-20 18:38:17 -06:00
Alexey Dobriyan
68279f9c9f treewide: mark stuff as __ro_after_init
__read_mostly predates __ro_after_init. Many variables which are marked
__read_mostly should have been __ro_after_init from day 1.

Also, mark some stuff as "const" and "__init" while I'm at it.

[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning]
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:43:23 -07:00
Greg Joyce
ec8cf230ce powerpc/pseries: PLPKS SED Opal keystore support
Define operations for SED Opal to read/write keys
from POWER LPAR Platform KeyStore(PLPKS). This allows
non-volatile storage of SED Opal keys.

Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com>
Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev>
Link: https://lore.kernel.org/r/20231004201957.1451669-4-gjoyce@linux.vnet.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 09:10:06 -06:00
Greg Joyce
5dd339722f block: sed-opal: keystore access for SED Opal keys
Allow for permanent SED authentication keys by
reading/writing to the SED Opal non-volatile keystore.

Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com>
Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev>
Link: https://lore.kernel.org/r/20231004201957.1451669-3-gjoyce@linux.vnet.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-17 09:10:06 -06:00
Milan Broz
4eaf0932c6 block: Fix regression in sed-opal for a saved key.
The commit 3bfeb61256
introduced the use of keyring for sed-opal.

Unfortunately, there is also a possibility to save
the Opal key used in opal_lock_unlock().

This patch switches the order of operation, so the cached
key is used instead of failure for opal_get_key.

The problem was found by the cryptsetup Opal test recently
added to the cryptsetup tree.

Fixes: 3bfeb61256 ("block: sed-opal: keyring support for SED keys")
Tested-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Milan Broz <gmazyland@gmail.com>
Link: https://lore.kernel.org/r/20231003100209.380037-1-gmazyland@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-13 08:16:08 -06:00
Sarthak Kukreti
1364a3c391 block: Don't invalidate pagecache for invalid falloc modes
Only call truncate_bdev_range() if the fallocate mode is supported. This
fixes a bug where data in the pagecache could be invalidated if the
fallocate() was called on the block device with an invalid mode.

Fixes: 25f4c41415 ("block: implement (some of) fallocate for block devices")
Cc: stable@vger.kernel.org
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Fixes: line?  I've never seen those wrapped.
Link: https://lore.kernel.org/r/20231011201230.750105-1-sarthakkukreti@chromium.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-11 15:53:17 -06:00
Jan Höppner
a31281acc4 partitions/ibm: Introduce defines for magic string length values
The length values for volume label type and volume label id are
hard-coded in several places. Provide defines for those values and
replace all occurrences accordingly.

Note that the length is defined and used, and not the size since the
volume label type string and volume label id string are not
nul-terminated.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230915131001.697070-4-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-04 08:04:08 -06:00
Jan Höppner
f5f43aae6f partitions/ibm: Replace strncpy() and improve readability
strncpy() is deprecated and needs to be replaced. The volume label
information strings are not nul-terminated and strncpy() can simply be
replaced with memcpy().

To enhance the readability of find_label() alongside this change, the
following improvements are made:
- Introduce the array dasd_vollabels[] containing all information
  necessary for the label detection.
- Provide a helper function to obtain an index value corresponding to a
  volume label type. This allows the use of a switch statement to reduce
  indentation levels.
- The 'temp' variable is used to check against valid volume label types.
  In the good case, this variable already contains the volume label type
  making it unnecessary to copy the information again from e.g.
  label->vol.vollbl. Remove the 'temp' variable and the second copy as
  all information are already provided.
- Remove the 'found' variable and replace it with early returns

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230915131001.697070-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-04 08:04:08 -06:00
Jan Höppner
d323c1a947 partitions/ibm: Remove unnecessary memset
The data holding the volume label information is zeroed in case no valid
volume label was found. Since the label information isn't used in that
case, zeroing the data doesn't provide any value whatsoever.

Remove the unnecessary memset() call accordingly.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230915131001.697070-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-04 08:04:08 -06:00
Coly Li
aa511ff821 badblocks: switch to the improved badblock handling code
This patch removes old code of badblocks_set(), badblocks_clear() and
badblocks_check(), and make them as wrappers to call _badblocks_set(),
_badblocks_clear() and _badblocks_check().

By this change now the badblock handing switch to the improved algorithm
in  _badblocks_set(), _badblocks_clear() and _badblocks_check().

This patch only contains the changes of old code deletion, new added
code for the improved algorithms are in previous patches.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Acked-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/20230811170513.2300-7-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-26 00:44:33 -06:00
Coly Li
3ea3354cb9 badblocks: improve badblocks_check() for multiple ranges handling
This patch rewrites badblocks_check() with similar coding style as
_badblocks_set() and _badblocks_clear(). The only difference is bad
blocks checking may handle multiple ranges in bad tables now.

If a checking range covers multiple bad blocks range in bad block table,
like the following condition (C is the checking range, E1, E2, E3 are
three bad block ranges in bad block table),
  +------------------------------------+
  |                C                   |
  +------------------------------------+
    +----+      +----+      +----+
    | E1 |      | E2 |      | E3 |
    +----+      +----+      +----+
The improved badblocks_check() algorithm will divide checking range C
into multiple parts, and handle them in 7 runs of a while-loop,
  +--+ +----+ +----+ +----+ +----+ +----+ +----+
  |C1| | C2 | | C3 | | C4 | | C5 | | C6 | | C7 |
  +--+ +----+ +----+ +----+ +----+ +----+ +----+
       +----+        +----+        +----+
       | E1 |        | E2 |        | E3 |
       +----+        +----+        +----+
And the start LBA and length of range E1 will be set as first_bad and
bad_sectors for the caller.

The return value rule is consistent for multiple ranges. For example if
there are following bad block ranges in bad block table,
   Index No.     Start        Len         Ack
       0          400          20          1
       1          500          50          1
       2          650          20          0
the return value, first_bad, bad_sectors by calling badblocks_set() with
different checking range can be the following values,
    Checking Start, Len     Return Value   first_bad    bad_sectors
               100, 100          0           N/A           N/A
               100, 310          1           400           10
               100, 440          1           400           10
               100, 540          1           400           10
               100, 600         -1           400           10
               100, 800         -1           400           10

In order to make code review easier, this patch names the improved bad
block range checking routine as _badblocks_check() and does not change
existing badblock_check() code yet. Later patch will delete old code of
badblocks_check() and make it as a wrapper to call _badblocks_check().
Then the new added code won't mess up with the old deleted code, it will
be more clear and easier for code review.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Acked-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/20230811170513.2300-6-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-26 00:44:33 -06:00
Coly Li
db448eb686 badblocks: improve badblocks_clear() for multiple ranges handling
With the fundamental ideas and helper routines from badblocks_set()
improvement, clearing bad block for multiple ranges is much simpler.

With a similar idea from badblocks_set() improvement, this patch
simplifies bad block range clearing into 5 situations. No matter how
complicated the clearing condition is, we just look at the head part
of clearing range with relative already set bad block range from the
bad block table. The rested part will be handled in next run of the
while-loop.

Based on existing helpers added from badblocks_set(), this patch adds
two more helpers,
- front_clear()
  Clear the bad block range from bad block table which is front
  overlapped with the clearing range.
- front_splitting_clear()
  Handle the condition that the clearing range hits middle of an
  already set bad block range from bad block table.

Similar as badblocks_set(), the first part of clearing range is handled
with relative bad block range which is find by prev_badblocks(). In most
cases a valid hint is provided to prev_badblocks() to avoid unnecessary
bad block table iteration.

This patch also explains the detail algorithm code comments at beginning
of badblocks.c, including which five simplified situations are
categrized and how all the bad block range clearing conditions are
handled by these five situations.

Again, in order to make the code review easier and avoid the code
changes mixed together, this patch does not modify badblock_clear() and
implement another routine called _badblock_clear() for the improvement.
Later patch will delete current code of badblock_clear() and make it as
a wrapper to _badblock_clear(), so the code change can be much clear for
review.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Acked-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/20230811170513.2300-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-26 00:44:33 -06:00
Coly Li
1726c77467 badblocks: improve badblocks_set() for multiple ranges handling
Recently I received a bug report that current badblocks code does not
properly handle multiple ranges. For example,
        badblocks_set(bb, 32, 1, true);
        badblocks_set(bb, 34, 1, true);
        badblocks_set(bb, 36, 1, true);
        badblocks_set(bb, 32, 12, true);
Then indeed badblocks_show() reports,
        32 3
        36 1
But the expected bad blocks table should be,
        32 12
Obviously only the first 2 ranges are merged and badblocks_set() returns
and ignores the rest setting range.

This behavior is improper, if the caller of badblocks_set() wants to set
a range of blocks into bad blocks table, all of the blocks in the range
should be handled even the previous part encountering failure.

The desired way to set bad blocks range by badblocks_set() is,
- Set as many as blocks in the setting range into bad blocks table.
- Merge the bad blocks ranges and occupy as less as slots in the bad
  blocks table.
- Fast.

Indeed the above proposal is complicated, especially with the following
restrictions,
- The setting bad blocks range can be acknowledged or not acknowledged.
- The bad blocks table size is limited.
- Memory allocation should be avoided.

The basic idea of the patch is to categorize all possible bad blocks
range setting combinations into much less simplified and more less
special conditions. Inside badblocks_set() there is an implicit loop
composed by jumping between labels 're_insert' and 'update_sectors'. No
matter how large the setting bad blocks range is, in every loop just a
minimized range from the head is handled by a pre-defined behavior from
one of the categorized conditions. The logic is simple and code flow is
manageable.

The different relative layout between the setting range and existing bad
block range are checked and handled (merge, combine, overwrite, insert)
by the helpers in previous patch. This patch is to make all the helpers
work together with the above idea.

This patch only has the algorithm improvement for badblocks_set(). There
are following patches contain improvement for badblocks_clear() and
badblocks_check(). But the algorithm in badblocks_set() is fundamental
and typical, other improvement in clear and check routines are based on
all the helpers and ideas in this patch.

In order to make the change to be more clear for code review, this patch
does not directly modify existing badblocks_set(), and just add a new
one named _badblocks_set(). Later patch will remove current existing
badblocks_set() code and make it as a wrapper of _badblocks_set(). So
the new added change won't be mixed with deleted code, the code review
can be easier.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Wols Lists <antlists@youngman.org.uk>
Cc: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Acked-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/20230811170513.2300-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-26 00:44:33 -06:00
Coly Li
c3c6a86e9e badblocks: add helper routines for badblock ranges handling
This patch adds several helper routines to improve badblock ranges
handling. These helper routines will be used later in the improved
version of badblocks_set()/badblocks_clear()/badblocks_check().

- Helpers prev_by_hint() and prev_badblocks() are used to find the bad
  range from bad table which the searching range starts at or after.

- The following helpers are to decide the relative layout between the
  manipulating range and existing bad block range from bad table.
  - can_merge_behind()
    Return 'true' if the manipulating range can backward merge with the
    bad block range.
  - can_merge_front()
    Return 'true' if the manipulating range can forward merge with the
    bad block range.
  - can_combine_front()
    Return 'true' if two adjacent bad block ranges before the
    manipulating range can be merged.
  - overlap_front()
    Return 'true' if the manipulating range exactly overlaps with the
    bad block range in front of its range.
  - overlap_behind()
    Return 'true' if the manipulating range exactly overlaps with the
    bad block range behind its range.
  - can_front_overwrite()
    Return 'true' if the manipulating range can forward overwrite the
    bad block range in front of its range.

- The following helpers are to add the manipulating range into the bad
  block table. Different routine is called with the specific relative
  layout between the manipulating range and other bad block range in the
  bad block table.
  - behind_merge()
    Merge the manipulating range with the bad block range behind its
    range, and return the number of merged length in unit of sector.
  - front_merge()
    Merge the manipulating range with the bad block range in front of
    its range, and return the number of merged length in unit of sector.
  - front_combine()
    Combine the two adjacent bad block ranges before the manipulating
    range into a larger one.
  - front_overwrite()
    Overwrite partial of whole bad block range which is in front of the
    manipulating range. The overwrite may split existing bad block range
    and generate more bad block ranges into the bad block table.
  - insert_at()
    Insert the manipulating range at a specific location in the bad
    block table.

All the above helpers are used in later patches to improve the bad block
ranges handling for badblocks_set()/badblocks_clear()/badblocks_check().

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Geliang Tang <geliang.tang@suse.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Acked-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/20230811170513.2300-3-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-26 00:44:33 -06:00
Randy Dunlap
a578a25339 block: fix kernel-doc for disk_force_media_change()
Drop one function parameter's kernel-doc comment since the parameter
was removed. This prevents a kernel-doc warning:

block/disk-events.c:300: warning: Excess function parameter 'events' description in 'disk_force_media_change'

Fixes: ab6860f62b ("block: simplify the disk_force_media_change interface")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: lore.kernel.org/r/202309060957.vfl0mUur-lkp@intel.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230926005232.23666-1-rdunlap@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-26 00:43:34 -06:00
Chengming Zhou
217b613a53 blk-mq: update driver tags request table when start request
Now we update driver tags request table in blk_mq_get_driver_tag(),
so the driver that support queue_rqs() have to update that inflight
table by itself.

Move it to blk_mq_start_request(), which is a better place where
we setup the deadline for request timeout check. And it's just
where the request becomes inflight.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
434097ee37 blk-mq: support batched queue_rqs() on shared tags queue
Since active requests have been accounted when allocate driver tags,
we can remove this limit now.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-4-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
48554df6bf blk-mq: remove RQF_MQ_INFLIGHT
Since the previous patch change to only account active requests when
we really allocate the driver tag, the RQF_MQ_INFLIGHT can be removed
and no double account problem.

1. none elevator: flush request will use the first pending request's
   driver tag, won't double account.

2. other elevator: flush request will be accounted when allocate driver
   tag when issue, and will be unaccounted when it put the driver tag.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Chengming Zhou
b8643d6826 blk-mq: account active requests when get driver tag
There is a limit that batched queue_rqs() can't work on shared tags
queue, since the account of active requests can't be done there.

Now we account the active requests only in blk_mq_get_driver_tag(),
which is not the time we get driver tag actually (with none elevator).

To support batched queue_rqs() on shared tags queue, we move the
account of active requests to where we get the driver tag:

1. none elevator: blk_mq_get_tags() and blk_mq_get_tag()
2. other elevator: __blk_mq_alloc_driver_tag()

This is clearer and match with the unaccount side, which just happen
when we put the driver tag.

The other good point is that we don't need RQF_MQ_INFLIGHT trick
anymore, which used to avoid double account of flush request.
Now we only account when actually get the driver tag, so all is good.
We will remove RQF_MQ_INFLIGHT in the next patch.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230913151616.3164338-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22 08:52:13 -06:00
Kemeng Shi
e599ed7866 block: correct stale comment in rq_qos_wait
The rq_qos_wait calls common wake-up function rq_qos_wake_function to get
token. Just replace stale wbt_wake_function with rq_qos_wake_function in
comment.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230914091508.36232-1-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-18 14:15:28 -06:00
Chengming Zhou
6be6d11241 blk-mq: fix tags UAF when shrinking q->nr_hw_queues
When nr_hw_queues shrink, we free the excess tags before realloc'ing
hw_ctxs for each queue. During that resize, we may need to access those
tags, like blk_mq_tag_idle(hctx) will access queue shared tags.

This can cause a slab use-after-free, as reported by KASAN. Fix it by
moving the releasing of excess tags to the end.

Fixes: e1dd7bc930 ("blk-mq: fix tags leak when shrink nr_hw_queues")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/all/CAHj4cs_CK63uoDpGBGZ6DN4OCTpzkR3UaVgK=LX8Owr8ej2ieQ@mail.gmail.com/
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230908005702.2183908-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-11 16:17:34 -06:00
Christoph Hellwig
5905afc2c7 block: fix pin count management when merging same-page segments
There is no need to unpin the added page when adding it to the bio fails
as that is done by the loop below.  Instead we want to unpin it when adding
a single page to the bio more than once as bio_release_pages will only
unpin it once.

Fixes: d1916c86cc ("block: move same page handling from __bio_add_pc_page to the callers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230905124731.328255-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-06 07:32:27 -06:00
Li Lingfeng
1a721de848 block: don't add or resize partition on the disk with GENHD_FL_NO_PART
Commit a33df75c63 ("block: use an xarray for disk->part_tbl") remove
disk_expand_part_tbl() in add_partition(), which means all kinds of
devices will support extended dynamic `dev_t`.
However, some devices with GENHD_FL_NO_PART are not expected to add or
resize partition.
Fix this by adding check of GENHD_FL_NO_PART before add or resize
partition.

Fixes: a33df75c63 ("block: use an xarray for disk->part_tbl")
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230831075900.1725842-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-31 08:00:35 -06:00
Christoph Hellwig
0d997f1de8 block: remove the call to file_remove_privs in blkdev_write_iter
file_remove_privs instantly returns 0 when not called for regular files,
so don't bother.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230831121911.280155-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-31 08:00:23 -06:00
Yu Kuai
eead005664 blk-throttle: consider 'carryover_ios/bytes' in throtl_trim_slice()
Currently, 'carryover_ios/bytes' is not handled in throtl_trim_slice(),
for consequence, 'carryover_ios/bytes' will be used to throttle bio
multiple times, for example:

1) set iops limit to 100, and slice start is 0, slice end is 100ms;
2) current time is 0, and 10 ios are dispatched, those io won't be
   throttled and io_disp is 10;
3) still at current time 0, update iops limit to 1000, carryover_ios is
   updated to (0 - 10) = -10;
4) in this slice(0 - 100ms), io_allowed = 100 + (-10) = 90, which means
   only 90 ios can be dispatched without waiting;
5) assume that io is throttled in slice(0 - 100ms), and
   throtl_trim_slice() update silce to (100ms - 200ms). In this case,
   'carryover_ios/bytes' is not cleared and still only 90 ios can be
   dispatched between 100ms - 200ms.

Fix this problem by updating 'carryover_ios/bytes' in
throtl_trim_slice().

Fixes: a880ae93e5 ("blk-throttle: fix io hung due to configuration updates")
Reported-by: zhuxiaohui <zhuxiaohui.400@bytedance.com>
Link: https://lore.kernel.org/all/20230812072116.42321-1-zhuxiaohui.400@bytedance.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230816012708.1193747-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30 10:15:01 -06:00
Yu Kuai
e8368b57c0 blk-throttle: use calculate_io/bytes_allowed() for throtl_trim_slice()
There are no functional changes, just make the code cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230816012708.1193747-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30 10:15:01 -06:00
Yu Kuai
bb8d5587bd blk-throttle: fix wrong comparation while 'carryover_ios/bytes' is negative
carryover_ios/bytes[] can be negative in the case that ios are
dispatched in the slice in advance, and then configuration is updated.
For example:

1) set iops limit to 1000, and slice start is 0, slice end is 100ms;
2) current time is 0, and 100 ios are dispatched, those ios will not be
   throttled, hence io_disp is 100;
3) still at current time 0, update iops limit to 100, then carryover_ios
   is (0 - 100) = -100;
4) then, dispatch a new io at time 0, the expected result is that this
   io will wait for 1s. The calculation in tg_within_iops_limit:

   io_disp = 0;
   io_allowed = calculate_io_allowed + carryover_ios
	      = 10 + (-100) = -90;
   io won't be throttled if (io_disp + 1 < io_allowed) passed.

Before this patch, in step 4) (io_disp + 1 < io_allowed) is passed,
because -90 for unsigned value is very huge, and such io won't be
throttled.

Fix this problem by checking if 'io/bytes_allowed' is negative first.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230816012708.1193747-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30 10:15:01 -06:00
Yu Kuai
ef100397fa blk-throttle: print signed value 'carryover_bytes/ios' for user
'carryover_bytes/ios' can be negative, indicate that some bio is
dispatched in advance within slice while configuration is updated.
Print a huge value is not user-friendly.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230816012708.1193747-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30 10:15:01 -06:00
Linus Torvalds
3d3dfeb3ae for-6.6/block-2023-08-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ
 tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa
 lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE
 pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ
 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v
 SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l
 l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu
 krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK
 sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y
 tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA
 AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs
 d0Y/zCZf1A==
 =p1bd
 -----END PGP SIGNATURE-----

Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:
 "Pretty quiet round for this release. This contains:

   - Add support for zoned storage to ublk (Andreas, Ming)

   - Series improving performance for drivers that mark themselves as
     needing a blocking context for issue (Bart)

   - Cleanup the flush logic (Chengming)

   - sed opal keyring support (Greg)

   - Fixes and improvements to the integrity support (Jinyoung)

   - Add some exports for bcachefs that we can hopefully delete again in
     the future (Kent)

   - deadline throttling fix (Zhiguo)

   - Series allowing building the kernel without buffer_head support
     (Christoph)

   - Sanitize the bio page adding flow (Christoph)

   - Write back cache fixes (Christoph)

   - MD updates via Song:
      - Fix perf regression for raid0 large sequential writes (Jan)
      - Fix split bio iostat for raid0 (David)
      - Various raid1 fixes (Heinz, Xueshi)
      - raid6test build fixes (WANG)
      - Deprecate bitmap file support (Christoph)
      - Fix deadlock with md sync thread (Yu)
      - Refactor md io accounting (Yu)
      - Various non-urgent fixes (Li, Yu, Jack)

   - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li,
     Ming, Nitesh, Ruan, Tejun, Thomas, Xu)"

* tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits)
  block: use strscpy() to instead of strncpy()
  block: sed-opal: keyring support for SED keys
  block: sed-opal: Implement IOC_OPAL_REVERT_LSP
  block: sed-opal: Implement IOC_OPAL_DISCOVERY
  blk-mq: prealloc tags when increase tagset nr_hw_queues
  blk-mq: delete redundant tagset map update when fallback
  blk-mq: fix tags leak when shrink nr_hw_queues
  ublk: zoned: support REQ_OP_ZONE_RESET_ALL
  md: raid0: account for split bio in iostat accounting
  md/raid0: Fix performance regression for large sequential writes
  md/raid0: Factor out helper for mapping and submitting a bio
  md raid1: allow writebehind to work on any leg device set WriteMostly
  md/raid1: hold the barrier until handle_read_error() finishes
  md/raid1: free the r1bio before waiting for blocked rdev
  md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
  blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
  drivers/rnbd: restore sysfs interface to rnbd-client
  md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
  raid6: test: only check for Altivec if building on powerpc hosts
  raid6: test: make sure all intermediate and artifact files are .gitignored
  ...
2023-08-29 20:21:42 -07:00
Linus Torvalds
511fb5bafe v6.6-vfs.super
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc
 oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj
 xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE=
 =2e9I
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull superblock updates from Christian Brauner:
 "This contains the super rework that was ready for this cycle. The
  first part changes the order of how we open block devices and allocate
  superblocks, contains various cleanups, simplifications, and a new
  mechanism to wait on superblock state changes.

  This unblocks work to ultimately limit the number of writers to a
  block device. Jan has already scheduled follow-up work that will be
  ready for v6.7 and allows us to restrict the number of writers to a
  given block device. That series builds on this work right here.

  The second part contains filesystem freezing updates.

  Overview:

  The generic superblock changes are rougly organized as follows
  (ignoring additional minor cleanups):

   (1) Removal of the bd_super member from struct block_device.

       This was a very odd back pointer to struct super_block with
       unclear rules. For all relevant places we have other means to get
       the same information so just get rid of this.

   (2) Simplify rules for superblock cleanup.

       Roughly, everything that is allocated during fs_context
       initialization and that's stored in fs_context->s_fs_info needs
       to be cleaned up by the fs_context->free() implementation before
       the superblock allocation function has been called successfully.

       After sget_fc() returned fs_context->s_fs_info has been
       transferred to sb->s_fs_info at which point sb->kill_sb() if
       fully responsible for cleanup. Adhering to these rules means that
       cleanup of sb->s_fs_info in fill_super() is to be avoided as it's
       brittle and inconsistent.

       Cleanup shouldn't be duplicated between sb->put_super() as
       sb->put_super() is only called if sb->s_root has been set aka
       when the filesystem has been successfully born (SB_BORN). That
       complexity should be avoided.

       This also means that block devices are to be closed in
       sb->kill_sb() instead of sb->put_super(). More details in the
       lower section.

   (3) Make it possible to lookup or create a superblock before opening
       block devices

       There's a subtle dependency on (2) as some filesystems did rely
       on fill_super() to be called in order to correctly clean up
       sb->s_fs_info. All these filesystems have been fixed.

   (4) Switch most filesystem to follow the same logic as the generic
       mount code now does as outlined in (3).

   (5) Use the superblock as the holder of the block device. We can now
       easily go back from block device to owning superblock.

   (6) Export and extend the generic fs_holder_ops and use them as
       holder ops everywhere and remove the filesystem specific holder
       ops.

   (7) Call from the block layer up into the filesystem layer when the
       block device is removed, allowing to shut down the filesystem
       without risk of deadlocks.

   (8) Get rid of get_super().

       We can now easily go back from the block device to owning
       superblock and can call up from the block layer into the
       filesystem layer when the device is removed. So no need to wade
       through all registered superblock to find the owning superblock
       anymore"

Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/

* tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits)
  super: use higher-level helper for {freeze,thaw}
  super: wait until we passed kill super
  super: wait for nascent superblocks
  super: make locking naming consistent
  super: use locking helpers
  fs: simplify invalidate_inodes
  fs: remove get_super
  block: call into the file system for ioctl BLKFLSBUF
  block: call into the file system for bdev_mark_dead
  block: consolidate __invalidate_device and fsync_bdev
  block: drop the "busy inodes on changed media" log message
  dasd: also call __invalidate_device when setting the device offline
  amiflop: don't call fsync_bdev in FDFMTBEG
  floppy: call disk_force_media_change when changing the format
  block: simplify the disk_force_media_change interface
  nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl
  xfs use fs_holder_ops for the log and RT devices
  xfs: drop s_umount over opening the log and RT devices
  ext4: use fs_holder_ops for the log device
  ext4: drop s_umount over opening the log device
  ...
2023-08-28 11:04:18 -07:00
Christian Brauner
3fb5a6562a New code for 6.6:
* Allow the kernel to initiate a freeze of a filesystem.  The kernel
    and userspace can both hold a freeze on a filesystem at the same
    time; the freeze is not lifted until /both/ holders lift it.  This
    will enable us to fix a longstanding bug in XFS online fsck.
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZLVnJwAKCRBKO3ySh0YR
 pqVIAP9u9CZEJ2Zcc7YpBj1MLUQGr2xBmz8RJEVJbQHKVgYcQwEA9BNb4eH4i2Af
 K7Qp0OGNgyzZw37lN23Uf/SDuBK2QgM=
 =seMl
 -----END PGP SIGNATURE-----
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXo2AAKCRCRxhvAZXjc
 ojDfAQDguc2saF8WLeXtn2O0pGOW8vTrhpwiFHNI6hwdzf07/AD+LGBpFEqYKyX5
 NHPzdR7YYpJoTsQzR4JFJVZqN9Q1xgU=
 =wDq0
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.6-merge-2' of ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull filesystem freezing updates from Darrick Wong:

New code for 6.6:

 * Allow the kernel to initiate a freeze of a filesystem.  The kernel
   and userspace can both hold a freeze on a filesystem at the same
   time; the freeze is not lifted until /both/ holders lift it.  This
   will enable us to fix a longstanding bug in XFS online fsck.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Message-Id: <20230822182604.GB11286@frogsfrogsfrogs>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-23 13:06:55 +02:00
Xu Panda
146afeb235 block: use strscpy() to instead of strncpy()
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL terminated strings.

Signed-off-by: Xu Panda <xu.panda@zte.com.cn>
Signed-off-by: Yang Yang <yang.yang29@zte.com>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/202212031422587503771@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 18:07:50 -06:00
Greg Joyce
3bfeb61256 block: sed-opal: keyring support for SED keys
Extend the SED block driver so it can alternatively
obtain a key from a sed-opal kernel keyring. The SED
ioctls will indicate the source of the key, either
directly in the ioctl data or from the keyring.

This allows the use of SED commands in scripts such as
udev scripts so that drives may be automatically unlocked
as they become available.

Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com>
Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20230721211534.3437070-4-gjoyce@linux.vnet.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 11:10:26 -06:00
Greg Joyce
5c82efc1ae block: sed-opal: Implement IOC_OPAL_REVERT_LSP
This is used in conjunction with IOC_OPAL_REVERT_TPR to return a drive to
Original Factory State without erasing the data. If IOC_OPAL_REVERT_LSP
is called with opal_revert_lsp.options bit OPAL_PRESERVE set prior
to calling IOC_OPAL_REVERT_TPR, the drive global locking range will not
be erased.

Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20230721211534.3437070-3-gjoyce@linux.vnet.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 11:10:26 -06:00
Greg Joyce
9fb10726ec block: sed-opal: Implement IOC_OPAL_DISCOVERY
Add IOC_OPAL_DISCOVERY ioctl to return raw discovery data to a SED Opal
application. This allows the application to display drive capabilities
and state.

Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Link: https://lore.kernel.org/r/20230721211534.3437070-2-gjoyce@linux.vnet.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 11:10:26 -06:00
Chengming Zhou
7222657e51 blk-mq: prealloc tags when increase tagset nr_hw_queues
Just like blk_mq_alloc_tag_set(), it's better to prepare all tags before
using to map to queue ctxs in blk_mq_map_swqueue(), which now have to
consider empty set->tags[].

The good point is that we can fallback easily if increasing nr_hw_queues
fail, instead of just mapping to hctx[0] when fail in blk_mq_map_swqueue().

And the fallback path already has tags free & clean handling, so all
is good.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230821095602.70742-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 08:58:06 -06:00
Chengming Zhou
2bc4d7a355 blk-mq: delete redundant tagset map update when fallback
When we increase nr_hw_queues fail, the fallback path will use
blk_mq_update_queue_map() to clear and update all maps.
Obviously, this line of update of HCTX_TYPE_DEFAULT only is not
needed, so delete it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230821095602.70742-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 08:58:05 -06:00
Chengming Zhou
e1dd7bc930 blk-mq: fix tags leak when shrink nr_hw_queues
Although we don't need to realloc set->tags[] when shrink nr_hw_queues,
we need to free them. Or these tags will be leaked.

How to reproduce:
1. mount -t configfs configfs /mnt
2. modprobe null_blk nr_devices=0 submit_queues=8
3. mkdir /mnt/nullb/nullb0
4. echo 1 > /mnt/nullb/nullb0/power
5. echo 4 > /mnt/nullb/nullb0/submit_queues
6. rmdir /mnt/nullb/nullb0

In step 4, will alloc 9 tags (8 submit queues and 1 poll queue), then
in step 5, new_nr_hw_queues = 5 (4 submit queues and 1 poll queue).
At last in step 6, only these 5 tags are freed, the other 4 tags leaked.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230821095602.70742-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-22 08:58:05 -06:00
Christoph Hellwig
2142b88c37 block: call into the file system for ioctl BLKFLSBUF
BLKFLSBUF is a historic ioctl that is called on a file handle to a
block device and syncs either the file system mounted on that block
device if there is one, or otherwise the just the data on the block
device.

Replace the get_super based syncing with a holder operation to remove
the last usage of get_super, and to also support syncing the file system
if the block device is not the main block device stored in s_dev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-16-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:32 +02:00
Christoph Hellwig
d8530de5a6 block: call into the file system for bdev_mark_dead
Combine the newly merged bdev_mark_dead helper with the existing
mark_dead holder operation so that all operations that invalidate
a device that is dead or being removed now go through the holder
ops.  This allows file systems to explicitly shutdown either ASAP
(for a surprise removal) or after writing back data (for an orderly
removal), and do so not only for the main device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-15-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:32 +02:00
Christoph Hellwig
560e20e4bf block: consolidate __invalidate_device and fsync_bdev
We currently have two interfaces that take a block_devices and the find
a mounted file systems to flush or invaldidate data on it.  Both are a
bit problematic because they only work for the "main" block devices
that is used as s_dev for the super_block, and because they don't call
into the file system at all.

Merge the two into a new bdev_mark_dead helper that does both the
syncing and invalidation and which is properly documented.  This is
in preparation of merging the functionality into the ->mark_dead
holder operation so that it will work on additional block devices
used by a file systems and give us a single entry point for invalidation
of dead devices or media.

Note that a single standalone fsync_bdev call for an obscure ioctl
remains for now, but that one will also be deal with in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-14-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:31 +02:00
Christoph Hellwig
127a5093c7 block: drop the "busy inodes on changed media" log message
This message isn't exactly helpful, and file systems already print way more
useful messages when shut down while active.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-13-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:31 +02:00
Christoph Hellwig
ab6860f62b block: simplify the disk_force_media_change interface
Hard code the events to DISK_EVENT_MEDIA_CHANGE as that is the only
useful use case, and drop the superfluous return value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Message-Id: <20230811100828.1897174-9-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21 14:35:30 +02:00
Linus Torvalds
2383ffc41a block-6.5-2023-08-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTg19oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmLjD/wKs0hA4JOQDqZPZK1p1aBU4f0vXwQxFGlL
 +gcnO4/MIB/5Ud+T+SXYuMrimLws7xsVbymcGatiRjH8LTfJVXFhuAzLILi0AcHw
 nhzjOUEzHokUex+tZLZRZxmavR+9SyGJoFNIbh+mY8JOLdNVzFDSqnLWO+D02Q2R
 OOBupA0mLRelYODEm2rI4xlQndwfrOAoAyEv+R7Ug0F6bFSno36QOg64pmZVI0Fl
 eudORXnIRYdtUajv+kNATWoqBbq/UCuBJdk0veM07Try6ZGRXRh6dQSA+GRh93pE
 Zg3JAHj4MKwlP3/wglw3SzoeECHpZrKQavIQQe9pTWKP4xGI/jdbVBcyFE0ERc66
 HijMo6CLeAzpOI1nEv+QhD8ntr4polEiWL4EVLuoXE9fVI1mYzavqmqrsDHeOHeF
 IJHadXZwsTG243msDvqedy0RFBwAkpnK0XdQuDtMnSa7UHwWWbxwUOwO5p4COJ3g
 vmrCfPQr7TTgkOtAXoMnwOZ1troEGxa/2CdUKaTdVG8RkMeM2qy8tmBBTV9Bx6+i
 rwQbB/JJm5SE6DX309TRaR6w+5YiwR6e7ECKx5hdYXia7M3OxlBBvl1NOfiWjWE3
 abC38/FReHLmFKHaDaN2AM1vLy+duc4NEc/yMQ4FDcfj/hUHQCoZBPYUsvlC+a4e
 Ws4qoMLU8A==
 =LnzH
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Main thing here is the fix for the regression in flush handling which
  caused IO hangs/stalls for a few reporters. Hopefully that should all
  be sorted out now. Outside of that, just a few minor fixes for issues
  that were introduced in this cycle"

* tag 'block-6.5-2023-08-19' of git://git.kernel.dk/linux:
  blk-mq: release scheduler resource when request completes
  blk-crypto: dynamically allocate fallback profile
  blk-cgroup: hold queue_lock when removing blkg->q_node
  drivers/rnbd: restore sysfs interface to rnbd-client
2023-08-19 17:31:46 +02:00
Chengming Zhou
e5c0ca1365 blk-mq: release scheduler resource when request completes
Chuck reported [1] an IO hang problem on NFS exports that reside on SATA
devices and bisected to commit 615939a2ae ("blk-mq: defer to the normal
submission path for post-flush requests").

We analysed the IO hang problem, found there are two postflush requests
waiting for each other.

The first postflush request completed the REQ_FSEQ_DATA sequence, so go to
the REQ_FSEQ_POSTFLUSH sequence and added in the flush pending list, but
failed to blk_kick_flush() because of the second postflush request which
is inflight waiting in scheduler queue.

The second postflush waiting in scheduler queue can't be dispatched because
the first postflush hasn't released scheduler resource even though it has
completed by itself.

Fix it by releasing scheduler resource when the first postflush request
completed, so the second postflush can be dispatched and completed, then
make blk_kick_flush() succeed.

While at it, remove the check for e->ops.finish_request, as all
schedulers set that. Reaffirm this requirement by adding a WARN_ON_ONCE()
at scheduler registration time, just like we do for insert_requests and
dispatch_request.

[1] https://lore.kernel.org/all/7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com/

Link: https://lore.kernel.org/linux-block/20230819031206.2744005-1-chengming.zhou@linux.dev/
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202308172100.8ce4b853-oliver.sang@intel.com
Fixes: 615939a2ae ("blk-mq: defer to the normal submission path for post-flush requests")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230813152325.3017343-1-chengming.zhou@linux.dev
[axboe: folded in incremental fix and added tags]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-19 07:47:17 -06:00
Sweet Tea Dorminy
c984ff1423 blk-crypto: dynamically allocate fallback profile
blk_crypto_profile_init() calls lockdep_register_key(), which warns and
does not register if the provided memory is a static object.
blk-crypto-fallback currently has a static blk_crypto_profile and calls
blk_crypto_profile_init() thereupon, resulting in the warning and
failure to register.

Fortunately it is simple enough to use a dynamically allocated profile
and make lockdep function correctly.

Fixes: 2fb48d88e7 ("blk-crypto: use dynamic lock class for blk_crypto_profile::lock")
Cc: stable@vger.kernel.org
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230817141615.15387-1-sweettea-kernel@dorminy.me
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-18 15:00:39 -06:00
Ming Lei
c164c7bc97 blk-cgroup: hold queue_lock when removing blkg->q_node
When blkg is removed from q->blkg_list from blkg_free_workfn(), queue_lock
has to be held, otherwise, all kinds of bugs(list corruption, hard lockup,
..) can be triggered from blkg_destroy_all().

Fixes: f1c006f1c6 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: xiaoli feng <xifeng@redhat.com>
Cc: Chunyu Hu <chuhu@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230817141751.1128970-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-18 15:00:39 -06:00
Tejun Heo
ec14a87ee1 blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
blk-iocost sometimes causes the following crash:

  BUG: kernel NULL pointer dereference, address: 00000000000000e0
  ...
  RIP: 0010:_raw_spin_lock+0x17/0x30
  Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00
  RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001
  RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0
  RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003
  R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000
  R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600
  FS:  00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0
  Call Trace:
   <TASK>
   ioc_weight_write+0x13d/0x410
   cgroup_file_write+0x7a/0x130
   kernfs_fop_write_iter+0xf5/0x170
   vfs_write+0x298/0x370
   ksys_write+0x5f/0xb0
   __x64_sys_write+0x1b/0x20
   do_syscall_64+0x3d/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

This happens because iocg->ioc is NULL. The field is initialized by
ioc_pd_init() and never cleared. The NULL deref is caused by
blkcg_activate_policy() installing blkg_policy_data before initializing it.

blkcg_activate_policy() was doing the following:

1. Allocate pd's for all existing blkg's and install them in blkg->pd[].
2. Initialize all pd's.
3. Online all pd's.

blkcg_activate_policy() only grabs the queue_lock and may release and
re-acquire the lock as allocation may need to sleep. ioc_weight_write()
grabs blkcg->lock and iterates all its blkg's. The two can race and if
ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a
pd which is not initialized yet, leading to crash.

The crash can be reproduced with the following script:

  #!/bin/bash

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct
  echo 100 > /sys/fs/cgroup/system.slice/io.weight
  bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" &
  sleep .2
  echo 100 > /sys/fs/cgroup/system.slice/io.weight

with the following patch applied:

> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index fc49be622e05..38d671d5e10c 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
> 		pd->online = false;
> 	}
>
> +       if (system_state == SYSTEM_RUNNING) {
> +               spin_unlock_irq(&q->queue_lock);
> +               ssleep(1);
> +               spin_lock_irq(&q->queue_lock);
> +       }
> +
> 	/* all allocated, init in the same order */
> 	if (pol->pd_init_fn)
> 		list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)

I don't see a reason why all pd's should be allocated, initialized and
onlined together. The only ordering requirement is that parent blkgs to be
initialized and onlined before children, which is guaranteed from the
walking order. Let's fix the bug by allocating, initializing and onlining pd
for each blkg and holding blkcg->lock over initialization and onlining. This
ensures that an installed blkg is always fully initialized and onlined
removing the the race window.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Breno Leitao <leitao@debian.org>
Fixes: 9d179b8654 ("blkcg: Fix multiple bugs in blkcg_activate_policy()")
Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-17 19:21:05 -06:00
Kent Overstreet
649f070e69 block: Bring back zero_fill_bio_iter
This reverts 6f822e1b5d - this helper is
used by bcachefs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20230813182636.2966159-4-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-14 15:40:42 -06:00
Kent Overstreet
168145f617 block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.

However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.

Thus this is needed to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/20230813182636.2966159-3-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-14 15:40:42 -06:00
Kent Overstreet
7ba3792718 block: Add some exports for bcachefs
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
 - blk_status_to_str() - error messages
 - bio_add_folio() - this should definitely be exported for everyone,
   it's the modern version of bio_add_page()

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-block@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20230813182636.2966159-2-kent.overstreet@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-14 15:40:42 -06:00
Linus Torvalds
360e694282 block-6.5-2023-08-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTWfLQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpg3nEACROhaeX6cpeDCSTqDVDW/ontbyn15eX7ep
 tGPLn/TVtKv2AztIobEinS08MdywqBO/VcB7XkxQV9Ov4JqCHIAKhndWI6/HqD9P
 DH3h6tE5JA8RQlNw1aHRrqWWIl1lpDQI6263um1tB2TuaxRa4xuR560jju0VZzAm
 9541ceKlJT8Qc7yG0aiiCv6Bxz+b6Htv3DqCf1mY2yznl3BpN52RQHKhiA0sfnlF
 WKqNsvSJ9/kz3vJbNpFucO7ch8a7W+MzmBx0vf2ickTBpL/3hbhUOrE7dGeKI9rS
 cWh1HaULWqjnKY1uxF9nnapZxm8QoxkT/5T0DgmprKjwuZivfLASAhYpHBc3mT1S
 eQQ0AK8hqx7sPnPeO/kxWtxM2nzRLkeVd19ClbIwux/zDbRrpHWk2/wgnSUUd3/H
 HBbjbgPWbkgLvTOUKhIA5VPBcgkC1efom1+ePzkH/H4TRRuVJwg6s6utGXdgc1PX
 +B4TA8GtXRH/7L0tsblFyJRmd0Y6G7gYE/yy0DYZTMie3oaWrKx3lmz48AQUtEzh
 DG46VRA4wnthHRlw3mkLP7C6z4PJvK9WWBiK11eZ9VfJMF643FNpXQ3/bviR9pfF
 kXdwYXoi1mlnsQ0VUhu2f+JeV4hHalrjwD/VE2H0E8Ogb4ezmJteLyiZKcw5xwaA
 Hmtmbb7Qxw==
 =+1Vt
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - Fixes for request_queue state (Ming)
      - Another uuid quirk (August)

 - RCU poll fix for NVMe (Ming)

 - Fix for an IO stall with polled IO (me)

 - Fix for blk-iocost stats enable/disable accounting (Chengming)

 - Regression fix for large pages for zram (Christoph)

* tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux:
  nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll
  blk-iocost: fix queue stats accounting
  block: don't make REQ_POLLED imply REQ_NOWAIT
  block: get rid of unused plug->nowait flag
  zram: take device and not only bvec offset into account
  nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G
  nvme-rdma: fix potential unbalanced freeze & unfreeze
  nvme-tcp: fix potential unbalanced freeze & unfreeze
  nvme: fix possible hang when removing a controller during error recovery
2023-08-11 12:14:08 -07:00
Jens Axboe
18267a0365 block: fix bad lockdep annotation in blk-iolatency
A previous commit added a lockdep annotation, but botched it. Use the
right type.

Fixes: 4eb44d1076 ("block: remove init_mutex and open-code blk_iolatency_try_init")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 17:24:53 -06:00
Li Lingfeng
4eb44d1076 block: remove init_mutex and open-code blk_iolatency_try_init
Commit a13696b83d ("blk-iolatency: Make initialization lazy") adds
a mutex named "init_mutex" in blk_iolatency_try_init for the race
condition of initializing RQ_QOS_LATENCY.
Now a new lock has been add to struct request_queue by commit a13bd91be2
("block/rq_qos: protect rq_qos apis with a new lock"). And it has been
held in blkg_conf_open_bdev before calling blk_iolatency_init.
So it's not necessary to keep init_mutex in blk_iolatency_try_init, just
remove it.

Since init_mutex has been removed, blk_iolatency_try_init can be
open-coded back to iolatency_set_limit() like ioc_qos_write().

Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Link: https://lore.kernel.org/r/20230810035111.2236335-1-lilingfeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-10 07:20:31 -06:00
Jinyoung Choi
0ece1d649b bio-integrity: create multi-page bvecs in bio_integrity_add_page()
In general, the bvec data structure consists of one for physically
continuous pages. But, in the bvec configuration for bip, physically
continuous integrity pages are composed of each bvec.

Allow bio_integrity_add_page() to create multi-page bvecs, just like
the bio payloads. This simplifies adding larger payloads, and fixes
support for non-tiny workloads with nvme, which stopped using
scatterlist for metadata a while ago.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>

Fixes: 783b94bd92 ("nvme-pci: do not build a scatterlist to map metadata")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:05:35 -06:00
Jinyoung Choi
d1f04c2e23 bio-integrity: cleanup adding integrity pages to bip's bvec.
bio_integrity_add_page() returns the add length if successful, else 0,
just as bio_add_page.  Simply check return value checking in
bio_integrity_prep to not deal with a > 0 but < len case that can't
happen.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803025058epcms2p5a4d0db5da2ad967668932d463661c633@epcms2p5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:05:35 -06:00
Jinyoung Choi
80814b8e35 bio-integrity: update the payload size in bio_integrity_add_page()
Previously, the bip's bi_size has been set before an integrity pages
were added. If a problem occurs in the process of adding pages for
bip, the bi_size mismatch problem must be dealt with.

When the page is successfully added to bvec, the bi_size is updated.

The parts affected by the change were also contained in this commit.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:05:35 -06:00
Jinyoung Choi
7c8998f75d block: make bvec_try_merge_hw_page() non-static
This will be used for multi-page configuration for integrity payload.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:05:35 -06:00
Chengming Zhou
f099a108ca blk-iocost: fix queue stats accounting
The q->stats->accounting is not only used by iocost, but iocost only
increase this counter, never decrease it. So queue stats accounting
will always enabled after using iocost once.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230804070609.31623-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:04:14 -06:00
Jens Axboe
2bc0576925 block: don't make REQ_POLLED imply REQ_NOWAIT
Normally these two flags do go together, as the issuer of polled IO
generally cannot wait for resources that will get freed as part of IO
completion. This is because that very task is the one that will complete
the request and free those resources, hence that would introduce a
deadlock.

But it is possible to have someone else issue the polled IO, eg via
io_uring if the request is punted to io-wq. For that case, it's fine to
have the task block on IO submission, as it is not the same task that
will be completing the IO.

It's completely up to the caller to ask for both polled and nowait IO
separately! If we don't allow polled IO where IOCB_NOWAIT isn't set in
the kiocb, then we can run into repeated -EAGAIN submissions and not
make any progress.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-09 16:04:07 -06:00
Jens Axboe
d74f714896 block: get rid of unused plug->nowait flag
This was introduced to add a plug based way of signaling nowait issues,
but we have since moved on from that. Kill the old dead code, nobody is
setting it anymore.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 15:50:37 -06:00
Zhiguo Niu
d47f9717e5 block/mq-deadline: use correct way to throttling write requests
The original formula was inaccurate:
dd->async_depth = max(1UL, 3 * q->nr_requests / 4);

For write requests, when we assign a tags from sched_tags,
data->shallow_depth will be passed to sbitmap_find_bit,
see the following code:

nr = sbitmap_find_bit_in_word(&sb->map[index],
			min_t (unsigned int,
			__map_depth(sb, index),
			depth),
			alloc_hint, wrap);

The smaller of data->shallow_depth and __map_depth(sb, index)
will be used as the maximum range when allocating bits.

For a mmc device (one hw queue, deadline I/O scheduler):
q->nr_requests = sched_tags = 128, so according to the previous
calculation method, dd->async_depth = data->shallow_depth = 96,
and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5,
sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or
a write I/O, tags can be allocated to the maximum range each time,
which has not throttling effect.

In addition, refer to the methods of bfg/kyber I/O scheduler,
limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift.

This patch can throttle write requests really.

Fixes: 07757588e5 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests")

Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/1691061162-22898-1-git-send-email-zhiguo.niu@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-08 15:46:41 -06:00
Christoph Hellwig
925c86a19b fs: add CONFIG_BUFFER_HEAD
Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.

For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call
into the buffer_head code when it doesn't exist.

Otherwise this is just Kconfig and ifdef changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02 09:13:09 -06:00
Christoph Hellwig
487c607df7 block: use iomap for writes to block devices
Use iomap in buffer_head compat mode to write to block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02 09:13:09 -06:00
Christoph Hellwig
a05f7bd957 block: stop setting ->direct_IO
Direct I/O on block devices now nevers goes through aops->direct_IO.
Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02 09:13:09 -06:00
Christoph Hellwig
727cfe9767 block: open code __generic_file_write_iter for blkdev writes
Open code __generic_file_write_iter to remove the indirect call into
->direct_IO and to prepare using the iomap based write code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-02 09:13:09 -06:00
Jinyoung Choi
51d74ec9b6 block: cleanup bio_integrity_prep
If a problem occurs in the process of creating an integrity payload, the
status of bio is always BLK_STS_RESOURCE.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230725051839epcms2p8e4d20ad6c51326ad032e8406f59d0aaa@epcms2p8
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-25 20:30:54 -06:00
Bart Van Assche
65a558f66c block: Improve performance for BLK_MQ_F_BLOCKING drivers
blk_mq_run_queue() runs the queue asynchronously if BLK_MQ_F_BLOCKING
has been set. This is suboptimal since running the queue asynchronously
is slower than running the queue synchronously. This patch modifies
blk_mq_run_queue() as follows if BLK_MQ_F_BLOCKING has been set:
- Run the queue synchronously if it is allowed to sleep.
- Run the queue asynchronously if it is not allowed to sleep.
Additionally, blk_mq_run_hw_queue(hctx, false) calls are modified into
blk_mq_run_hw_queue(hctx, hctx->flags & BLK_MQ_F_BLOCKING) if the caller
may be invoked from atomic context.

The following caller chains have been reviewed:

blk_mq_run_hw_queue(hctx, false)
  blk_mq_get_tag()      /* may sleep, hence the functions it calls may also sleep */
  blk_execute_rq()             /* may sleep */
  blk_mq_run_hw_queues(q, async=false)
    blk_freeze_queue_start()   /* may sleep */
    blk_mq_requeue_work()      /* may sleep */
    scsi_kick_queue()
      scsi_requeue_run_queue() /* may sleep */
      scsi_run_host_queues()
        scsi_ioctl_reset()     /* may sleep */
  blk_mq_insert_requests(hctx, ctx, list, run_queue_async=false)
    blk_mq_dispatch_plug_list(plug, from_sched=false)
      blk_mq_flush_plug_list(plug, from_schedule=false)
        __blk_flush_plug(plug, from_schedule=false)
	blk_add_rq_to_plug()
	  blk_mq_submit_bio()  /* may sleep if REQ_NOWAIT has not been set */
  blk_mq_plug_issue_direct()
    blk_mq_flush_plug_list()   /* see above */
  blk_mq_dispatch_plug_list(plug, from_sched=false)
    blk_mq_flush_plug_list()   /* see above */
  blk_mq_try_issue_directly()
    blk_mq_submit_bio()        /* may sleep if REQ_NOWAIT has not been set */
  blk_mq_try_issue_list_directly(hctx, list)
    blk_mq_insert_requests() /* see above */

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230721172731.955724-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 20:13:12 -06:00
Christoph Hellwig
ae42f0b3bf block: don't pass a bio to bio_try_merge_hw_seg
There is no good reason to pass the bio to bio_try_merge_hw_seg.  Just
pass the current bvec and rename the function to bvec_try_merge_hw_page.
This will allow reusing this function for supporting multi-page integrity
payload bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
858c708d9e block: move the bi_size update out of __bio_try_merge_page
The update of bi_size is the only thing in __bio_try_merge_page that
needs a bio.  Move it to the callers, and merge __bio_try_merge_page
and page_is_mergeable into a single bvec_try_merge_page that only takes
the current bvec instead of a full bio.  This will allow reusing this
function for supporting multi-page integrity payload bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
80232b5203 block: downgrade a bio_full call in bio_add_page
bio_add_page already checks that there is space in bi_size a little
earlier.  So after we failed to add to an existing segment, just check
that there is another one available instead of duplicating the bi_size
check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
613699050a block: move the bi_size overflow check in __bio_try_merge_page
Checking for availability in bi_size in a function that attempts to
merge into an existing segment is a bit odd, as the limit also applies
when adding a new segment.  This code works fine as we always call
__bio_try_merge_page, but contributes to sub-optimal calling conventions
and doesn't lead to clear code.

Move it to two of the callers instead, the third one already has a more
strict check that includes max_hw_segments anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
0eca8b6f97 block: move the bi_vcnt check out of __bio_try_merge_page
Move the bi_vcnt out of __bio_try_merge_page and into the two callers
that don't already have it in preparation for additional changes to
__bio_try_merge_page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
939e1a3703 block: move the BIO_CLONED checks out of __bio_try_merge_page
__bio_try_merge_page is a way too low-level helper to assert that the
bio is not cloned.  Move the check into bio_add_page and
bio_iov_iter_get_pages instead, which are the high level entry points
that should enforce this variant.  bio_add_hw_page already this
check, coverig the third (indirect) caller of __bio_try_merge_page.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
6850b2dd5c block: use SECTOR_SHIFT bio_add_hw_page
Use the SECTOR_SHIFT magic constant instead of the magic number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230724165433.117645-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Christoph Hellwig
cd1d83e24e block: tidy up the bio full checks in bio_add_hw_page
bio_add_hw_page already checks if the number of bytes trying to be added
even fit into max_hw_sectors limit of the queue.   Remove the call to
bio_full and just do a check for the smaller of the number of segments
in the bio and the queue max segments limit, and do this cheap check
before the more expensive gap to previous check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jinyoung Choi <j-young.choi@samsung.com>
Link: https://lore.kernel.org/r/20230724165433.117645-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-24 19:55:16 -06:00
Linus Torvalds
f036d67c02 block-6.5-2023-07-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmS629wQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv/7D/99ysE5ZszmjxNOmyy1lGfqtQnaTLuToRsl
 wB16umIPAFfye5r4TV8l9GZuUyI7FU8LySglu0Y0qMKmCp+kJKLh90kB281Co4Dn
 yp1AbqlTorAlG4ElQJBRaQr4kaqqvI2tzeVmFdUhIE1oX2e9OX/O+YKa8k1JfsKI
 oecChQgodlPxX3wusItgiyvZKl2q2+mivg5E6cqiGIgP3uF8fmOQCbio4Vm8ZSxb
 TO8JEfBTiXslR+CvJD3Gi96pzexN1qCUed8/7FDiIUufhETmwqSIOo89GxzGAQ6O
 7o/83IkqgXPHjKLYs3R4/jhHPXZmXmvDZHWIiSg+KLOFqxxWmRPNJ6V6igIBP8SG
 eu5PTA7SDGtvIXePpu38FTPmSiUW7MbGhnjqY8u64Je6MaQ8l28KN7xkFtmxV+n4
 hgB0gr6uKBnXMKZHobk0yJeUUI/L/0ESzbVPDHY8JM/rQCsp1eSNQDpZoVjPWZmg
 lMGYmOq57oPA20LVch7U3gUFhD4CJ7c3e2/EzJdJVjsTveTYieBCEESQErFbMcEr
 VuRZSAGnPyXQ4yF4wG93x4sDye28ZFS/Q9c6Q3DCUxctDkCz4eY1+vmdX+NJXwDA
 aYXCyyKzk18udbKvV0QvTuDTb6PrJDPxbFagCveibPTtP4XDMv1LvpdZPUPJ/HGX
 4xA1mrsGJA==
 =e2OR
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - Fix for loop regressions (Mauricio)

 - Fix a potential stall with batched wakeups in sbitmap (David)

 - Fix for stall with recursive plug flushes (Ross)

 - Skip accounting of empty requests for blk-iocost (Chengming)

 - Remove a dead field in struct blk_mq_hw_ctx (Chengming)

* tag 'block-6.5-2023-07-21' of git://git.kernel.dk/linux:
  loop: do not enforce max_loop hard limit by (new) default
  loop: deprecate autoloading callback loop_probe()
  sbitmap: fix batching wakeup
  blk-iocost: skip empty flush bio in iocost
  blk-mq: delete dead struct blk_mq_hw_ctx->queued field
  blk-mq: Fix stall due to recursive flush plug
2023-07-22 11:05:15 -07:00
Nitesh Shetty
8f63fef586 block: refactor to use helper
Reduce some code by making use of bio_integrity_bytes().

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230719121608.32105-1-nj.shetty@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20 16:27:52 -06:00
Chengming Zhou
013adcbef1 blk-iocost: skip empty flush bio in iocost
The flush bio may have data, may have no data (empty flush), we couldn't
calculate cost for empty flush bio. So we'd better just skip it for now.

Another side effect is that empty flush bio's bio_end_sector() is 0, cause
iocg->cursor reset to 0, may break the cost calculation of other bios.

This isn't good enough, since flush bio still consume the device bandwidth,
but flush request is special, can be merged randomly in the flush state
machine, we don't know how to calculate cost for it for now.

Its completion time also has flaws, which may include the pre-flush or
post-flush completion time, but I don't know if we need to fix that and
how to fix it.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230720121441.1408522-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-20 14:02:17 -06:00
Darrick J. Wong
880b957785 fs: distinguish between user initiated freeze and kernel initiated freeze
Userspace can freeze a filesystem using the FIFREEZE ioctl or by
suspending the block device; this state persists until userspace thaws
the filesystem with the FITHAW ioctl or resuming the block device.
Since commit 18e9e5104f ("Introduce freeze_super and thaw_super for
the fsfreeze ioctl") we only allow the first freeze command to succeed.

The kernel may decide that it is necessary to freeze a filesystem for
its own internal purposes, such as suspends in progress, filesystem fsck
activities, or quiescing a device prior to removal.  Userspace thaw
commands must never break a kernel freeze, and kernel thaw commands
shouldn't undo userspace's freeze command.

Introduce a couple of freeze holder flags and wire it into the
sb_writers state.  One kernel and one userspace freeze are allowed to
coexist at the same time; the filesystem will not thaw until both are
lifted.

I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but
for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing
behaviors.

Cc: mcgrof@kernel.org
Cc: jack@suse.cz
Cc: hch@infradead.org
Cc: ruansy.fnst@fujitsu.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2023-07-17 09:00:09 -07:00
Chengming Zhou
81ada09cc2 blk-flush: reuse rq queuelist in flush state machine
Since we don't need to maintain inflight flush_data requests list
anymore, we can reuse rq->queuelist for flush pending list.

Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist
before reusing it in the state machine when end, since the rq->rq_next
also reuse it, may have corrupted rq->queuelist by the driver.

This patch decrease the size of struct request by 16 bytes.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230717040058.3993930-5-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:21 -06:00
Chengming Zhou
b175c86739 blk-flush: count inflight flush_data requests
The flush state machine use a double list to link all inflight
flush_data requests, to avoid issuing separate post-flushes for
these flush_data requests which shared PREFLUSH.

So we can't reuse rq->queuelist, this is why we need rq->flush.list

In preparation of the next patch that reuse rq->queuelist for flush
state machine, we change the double linked list to unsigned long
counter, which count all inflight flush_data requests.

This is ok since we only need to know if there is any inflight
flush_data request, so unsigned long counter is good.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-4-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:21 -06:00
Chengming Zhou
28b2412374 blk-flush: fix rq->flush.seq for post-flush requests
If the policy == (REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH), it means that the
data sequence and post-flush sequence need to be done for this request.

The rq->flush.seq should record what sequences have been done (or don't
need to be done). So in this case, pre-flush doesn't need to be done,
we should init rq->flush.seq to REQ_FSEQ_PREFLUSH not REQ_FSEQ_POSTFLUSH.

Fixes: 615939a2ae ("blk-mq: defer to the normal submission path for post-flush requests")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-3-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:21 -06:00
Chengming Zhou
660e802c76 blk-mq: use percpu csd to remote complete instead of per-rq csd
If request need to be completed remotely, we insert it into percpu llist,
and smp_call_function_single_async() if llist is empty previously.

We don't need to use per-rq csd, percpu csd is enough. And the size of
struct request is decreased by 24 bytes.

This way is cleaner, and looks correct, given block softirq is guaranteed
to be scheduled to consume the list if one new request is added to this
percpu list, either smp_call_function_single_async() returns -EBUSY or 0.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:21 -06:00
Christoph Hellwig
43c9835b14 block: don't allow enabling a cache on devices that don't support it
Currently the write_cache attribute allows enabling the QUEUE_FLAG_WC
flag on devices that never claimed the capability.

Fix that by adding a QUEUE_FLAG_HW_WC flag that is set by
blk_queue_write_cache and guards re-enabling the cache through sysfs.

Note that any rescan that calls blk_queue_write_cache will still
re-enable the write cache as in the current code.

Fixes: 93e9d8e836 ("block: add ability to flag write back caching on a device")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230707094239.107968-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:18 -06:00
Christoph Hellwig
c4e21bcd0f block: cleanup queue_wc_store
Get rid of the local queue_wc_store variable and handling setting and
clearing the QUEUE_FLAG_WC flag diretly instead the if / else if.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230707094239.107968-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17 08:18:17 -06:00
Linus Torvalds
be522ac7cd SCSI fixes on 20230714
This is a bunch of small driver fixes and a larger rework of zone disk
 handling (which reaches into blk and nvme).  The aacraid array-bounds
 fix is now critical since the security people turned on -Werror for
 some build tests, which now fail without it.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZLGSiCYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishd/BAPwO2i4t
 5uzhcWihoYaIZ6x07oEhgOP/o1h5n5mM908AyAEA6s2hQKDoIxjJexqvkS7lPjni
 P8VMcfvOmdsLDCD3nJ4=
 =+10g
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is a bunch of small driver fixes and a larger rework of zone disk
  handling (which reaches into blk and nvme).

  The aacraid array-bounds fix is now critical since the security people
  turned on -Werror for some build tests, which now fail without it"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: storvsc: Handle SRB status value 0x30
  scsi: block: Improve checks in blk_revalidate_disk_zones()
  scsi: block: virtio_blk: Set zone limits before revalidating zones
  scsi: block: nullblk: Set zone limits before revalidating zones
  scsi: nvme: zns: Set zone limits before revalidating zones
  scsi: sd_zbc: Set zone limits before revalidating zones
  scsi: ufs: core: Add support for qTimestamp attribute
  scsi: aacraid: Avoid -Warray-bounds warning
  scsi: ufs: ufs-mediatek: Add dependency for RESET_CONTROLLER
  scsi: ufs: core: Update contact email for monitor sysfs nodes
  scsi: scsi_debug: Remove dead code
  scsi: qla2xxx: Use vmalloc_array() and vcalloc()
  scsi: fnic: Use vmalloc_array() and vcalloc()
  scsi: qla2xxx: Fix error code in qla2x00_start_sp()
  scsi: qla2xxx: Silence a static checker warning
  scsi: lpfc: Fix a possible data race in lpfc_unregister_fcf_rescan()
2023-07-14 19:57:29 -07:00
Ross Lagerwall
7090426351 blk-mq: Fix stall due to recursive flush plug
We have seen rare IO stalls as follows:

* blk_mq_plug_issue_direct() is entered with an mq_list containing two
requests.
* For the first request, it sets last == false and enters the driver's
queue_rq callback.
* The driver queue_rq callback indirectly calls schedule() which calls
blk_flush_plug(). This may happen if the driver has the
BLK_MQ_F_BLOCKING flag set and is allowed to sleep in ->queue_rq.
* blk_flush_plug() handles the remaining request in the mq_list. mq_list
is now empty.
* The original call to queue_rq resumes (with last == false).
* The loop in blk_mq_plug_issue_direct() terminates because there are no
remaining requests in mq_list.

The IO is now stalled because the last request submitted to the driver
had last == false and there was no subsequent call to commit_rqs().

Fix this by returning early in blk_mq_flush_plug_list() if rq_count is 0
which it will be in the recursive case, rather than checking if the
mq_list is empty. At the same time, adjust one of the callers to skip
the mq_list empty check as it is not necessary.

Fixes: dc5fc361d8 ("block: attempt direct issue of plug list")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230714101106.3635611-1-ross.lagerwall@citrix.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14 13:44:24 -06:00
Christoph Hellwig
9f87fc4d72 block: queue data commands from the flush state machine at the head
We used to insert the data commands following a pre-flush to the head
of the queue until commit 1e82fadfc6 ("blk-mq: do not do head insertions
post-pre-flush commands").  Not doing this seems to cause hangs of
such commands on NFS workloads when exported from file systems with
SATA SSDs.  I have no idea why this would starve these workloads,
but doing a semantic revert of this patch (which looks quite different
due to various other changes) fixes the hangs.

Fixes: 1e82fadfc6 ("blk-mq: do not do head insertions post-pre-flush commands")
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Link: https://lore.kernel.org/r/20230714143014.11879-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-14 08:42:58 -06:00
Chengming Zhou
5c17f45e91 blk-mq: fix start_time_ns and alloc_time_ns for pre-allocated rq
The iocost rely on rq start_time_ns and alloc_time_ns to tell saturation
state of the block device. Most of the time request is allocated after
rq_qos_throttle() and its alloc_time_ns or start_time_ns won't be affected.

But for plug batched allocation introduced by the commit 47c122e35d
("block: pre-allocate requests if plug is started and is a batch"), we can
rq_qos_throttle() after the allocation of the request. This is what the
blk_mq_get_cached_request() does.

In this case, the cached request alloc_time_ns or start_time_ns is much
ahead if blocked in any qos ->throttle().

Fix it by setting alloc_time_ns and start_time_ns to now when the allocated
request is actually used.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230710105516.2053478-1-chengming.zhou@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-13 12:30:57 -06:00
Bart Van Assche
f673b4f5bd block/mq-deadline: Fix a bug in deadline_from_pos()
A bug was introduced in deadline_from_pos() while implementing the
suggestion to use round_down() in the following code:

	pos -= bdev_offset_from_zone_start(rq->q->disk->part0, pos);

This patch makes deadline_from_pos() use round_down() such that 'pos' is
rounded down.

Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/all/5zthzi3lppvcdp4nemum6qck4gpqbdhvgy4k3qwguhgzxc4quj@amulvgycq67h/
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Fixes: 0effb390c4 ("block: mq-deadline: Handle requeued requests correctly")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230712173344.2994513-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-12 11:37:41 -06:00
Martin K. Petersen
e96277a570 Merge branch '6.5/scsi-staging' into 6.5/scsi-fixes
Pull in the currently staged SCSI fixes for 6.5.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-11 12:15:15 -04:00
Damien Le Moal
03e51c4a74 scsi: block: Improve checks in blk_revalidate_disk_zones()
blk_revalidate_disk_zones() implements checks of the zones of a zoned
block device, verifying that the zone size is a power of 2 number of
sectors, that all zones (except possibly the last one) have the same
size and that zones cover the entire addressing space of the device.

While these checks are appropriate to verify that well tested hardware
devices have an adequate zone configurations, they lack in certain areas
which may result in issues with emulated devices implemented with user
drivers such as ublk or tcmu. Specifically, this function does not
check if the device driver indicated support for the mandatory zone
append writes, that is, if the device max_zone_append_sectors queue
limit is set to a non-zero value. Additionally, invalid zones such as
a zero length zone with a start sector equal to the device capacity will
not be detected and result in out of bounds use of the zone bitmaps
prepared with the callback function blk_revalidate_zone_cb().

Improve blk_revalidate_disk_zones() to address these inadequate checks,
relying on the fact that all device drivers supporting zoned block
devices must set the device zone size (chunk_sectors queue limit) and
the max_zone_append_sectors queue limit before executing this function.

The check for a non-zero max_zone_append_sectors value is done in
blk_revalidate_disk_zones() before executing the zone report. The zone
report callback function blk_revalidate_zone_cb() is also modified to
add a check that a zone start is below the device capacity.

The check that the zone size is a power of 2 number of sectors is moved
to blk_revalidate_disk_zones() as the zone size is already known.
Similarly, the number of zones of the device can be calculated in
blk_revalidate_disk_zones() before executing the zone report.

The kdoc comment for blk_revalidate_disk_zones() is also updated to
mention that device drivers must set the device zone size and the
max_zone_append_sectors queue limit before calling this function.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230703024812.76778-6-dlemoal@kernel.org
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-07-05 21:58:10 -04:00
Eric Biggers
2fb48d88e7 blk-crypto: use dynamic lock class for blk_crypto_profile::lock
When a device-mapper device is passing through the inline encryption
support of an underlying device, calls to blk_crypto_evict_key() take
the blk_crypto_profile::lock of the device-mapper device, then take the
blk_crypto_profile::lock of the underlying device (nested).  This isn't
a real deadlock, but it causes a lockdep report because there is only
one lock class for all instances of this lock.

Lockdep subclasses don't really work here because the hierarchy of block
devices is dynamic and could have more than 2 levels.

Instead, register a dynamic lock class for each blk_crypto_profile, and
associate that with the lock.

This avoids false-positive lockdep reports like the following:

    ============================================
    WARNING: possible recursive locking detected
    6.4.0-rc5 #2 Not tainted
    --------------------------------------------
    fscryptctl/1421 is trying to acquire lock:
    ffffff80829ca418 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0x44/0x1c0

                   but task is already holding lock:
    ffffff8086b68ca8 (&profile->lock){++++}-{3:3}, at: __blk_crypto_evict_key+0xc8/0x1c0

                   other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&profile->lock);
      lock(&profile->lock);

                    *** DEADLOCK ***

     May be due to missing lock nesting notation

Fixes: 1b26283970 ("block: Keyslot Manager for Inline Encryption")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230610061139.212085-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-05 16:36:12 -06:00
Michael Schmitz
7eb1e47696 block/partition: fix signedness issue for Amiga partitions
Making 'blk' sector_t (i.e. 64 bit if LBD support is active) fails the
'blk>0' test in the partition block loop if a value of (signed int) -1 is
used to mark the end of the partition block list.

Explicitly cast 'blk' to signed int to allow use of -1 to terminate the
partition block linked list.

Fixes: b6f3f28f60 ("block: add overflow checks for Amiga partition support")
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: https://lore.kernel.org/r/024ce4fa-cc6d-50a2-9aae-3701d0ebf668@xenosoft.de
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Martin Steigerwald <martin@lichtvoll.de>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-05 16:34:56 -06:00
Linus Torvalds
e50df24979 block-6.5-2023-07-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSjJ2IQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsQMEACQiUBw81tXvetYhz3P/4KrrjvUobgqMU0w
 jtrxqMgPee9FbqCShpj76c+La5wu23DnlrCXoHZxFQuiQLnsX5xFV66NYVi+W1CN
 k5MHP7f2e9V0T7qJ9UoHFRV1k22LF4X6T8njEZimxsm/uXfpav/knkhI7nUDnB1K
 wxlu9akD2Bo/X9O2NTS+X6qjoawZ6rDWN15THMXlC45VzJPLmIcs07Ev+mvw21KE
 XqasoZrxEO0S8dWxmJgJGqnRIOQptTS5U+0OPBZT8H220Qp/1q0pQHPw6iLXNrkc
 w1a2W1Bge012gjJt7gCMkdDnZb76sKiyGuMbFME7DoRbLCQeaOtoSfmg7NoRI2gp
 74TCSr7dPWZUVUy5Tmsy0DCv0552vIbnlQ69W6Xwx8YkplM3FPiMpWrQ5JWEHdvv
 Zl84mLP6Yyo54JVuk9zi8q/2L0HfyfMDj4UM/mNs8hwmcUSbPO2TKdIWDaq8xPuS
 Ed+D+kg6XFux8tLnCSDLNbaD5JE+ak9gTVhNdRa/zFE04o/OeidscKEqRSYTkdXL
 2p34qtw5kEQocO4Pa3eUGO6KJCDTR36Rms5p6ZFybL4O2oZYrAbRi1TGDxaG2Hag
 GCr2vaFbmz1zbGuMpFhLha5B7HeDLs+PHOn+B1iUNjEr9RC0EOHV7moJKqjxlnCh
 4mBkK/Nlyg==
 =kSeX
 -----END PGP SIGNATURE-----

Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:
 "Mostly items that came in a bit late for the initial pull request,
  wanted to make sure they had the appropriate amount of linux-next soak
  before going upstream.

  Outside of stragglers, just generic fixes for either merge window
  items, or longer standing bugs"

* tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits)
  md/raid0: add discard support for the 'original' layout
  nvme: disable controller on reset state failure
  nvme: sync timeout work on failed reset
  nvme: ensure unquiesce on teardown
  cdrom/gdrom: Fix build error
  nvme: improved uring polling
  block: add request polling helper
  nvme-mpath: fix I/O failure with EAGAIN when failing over I/O
  nvme: host: fix command name spelling
  blk-sysfs: add a new attr_group for blk_mq
  blk-iocost: move wbt_enable/disable_default() out of spinlock
  blk-wbt: cleanup rwb_enabled() and wbt_disabled()
  blk-wbt: remove dead code to handle wbt enable/disable with io inflight
  blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
  blk-mq: fix two misuses on RQF_USE_SCHED
  blk-throttle: Fix io statistics for cgroup v1
  bcache: Fix bcache device claiming
  bcache: Alloc holder object before async registration
  raid10: avoid spin_lock from fastpath from raid10_unplug()
  md: fix 'delete_mutex' deadlock
  ...
2023-07-03 18:48:38 -07:00
Linus Torvalds
ca7ce08d6a SCSI misc on 20230629
Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
 lpfc, qla2xxx).  We have a couple of major core changes impacting
 other systems: Command Duration Limits, which spills into block and
 ATA and block level Persistent Reservation Operations, which touches
 block, nvme, target and dm (both of which are added with merge commits
 containing a cover letter explaining what's going on).
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZJ19cSYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfZpAQCQBuWR
 ELcOhsaG5KzO6xLWcH8mjsOoxffKvazZjTKXlAD5ATEv7++E250oKS3t+yfjae5I
 Lc195MlDju85ItUQgfk=
 =U9ik
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
  lpfc, qla2xxx).

  We have a couple of major core changes impacting other systems:

   - Command Duration Limits, which spills into block and ATA

   - block level Persistent Reservation Operations, which touches block,
     nvme, target and dm

  Both of these are added with merge commits containing a cover letter
  explaining what's going on"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits)
  scsi: core: Improve warning message in scsi_device_block()
  scsi: core: Replace scsi_target_block() with scsi_block_targets()
  scsi: core: Don't wait for quiesce in scsi_device_block()
  scsi: core: Don't wait for quiesce in scsi_stop_queue()
  scsi: core: Merge scsi_internal_device_block() and device_block()
  scsi: sg: Increase number of devices
  scsi: bsg: Increase number of devices
  scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue
  scsi: ufs: ufs-pci: Add support for Intel Arrow Lake
  scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT
  scsi: ufs: wb: Add explicit flush_threshold sysfs attribute
  scsi: ufs: ufs-qcom: Switch to the new ICE API
  scsi: ufs: dt-bindings: qcom: Add ICE phandle
  scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk
  scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk
  scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC
  scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR
  scsi: ufs: core: Remove dedicated hwq for dev command
  scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command
  scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes
  ...
2023-06-30 11:57:07 -07:00
Keith Busch
f6c80cffcd block: add request polling helper
Provide a direct request polling will for drivers. The interface does
not require a bio, and can skip the overhead associated with polling
those. The biggest gain from skipping the relatively expensive xarray
lookup unnecessary when you already have the request.

With this, the simple rq/qc conversion functions have only one caller
each, so open code this and remove the helpers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230612190343.2087040-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28 16:09:41 -06:00
Jens Axboe
3a08284ff2 Merge branch 'for-6.5/block-late' into block-6.5
* for-6.5/block-late:
  blk-sysfs: add a new attr_group for blk_mq
  blk-iocost: move wbt_enable/disable_default() out of spinlock
  blk-wbt: cleanup rwb_enabled() and wbt_disabled()
  blk-wbt: remove dead code to handle wbt enable/disable with io inflight
  blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
  blk-mq: fix two misuses on RQF_USE_SCHED
  blk-throttle: Fix io statistics for cgroup v1
  bcache: Fix bcache device claiming
  bcache: Alloc holder object before async registration
  raid10: avoid spin_lock from fastpath from raid10_unplug()
  md: fix 'delete_mutex' deadlock
  md: use mddev->external to select holder in export_rdev()
  md/raid1-10: fix casting from randomized structure in raid1_submit_write()
  md/raid10: fix the condition to call bio_end_io_acct()
2023-06-28 16:08:19 -06:00
Linus Torvalds
6e17c6de3d - Yosry Ahmed brought back some cgroup v1 stats in OOM logs.
- Yosry has also eliminated cgroup's atomic rstat flushing.
 
 - Nhat Pham adds the new cachestat() syscall.  It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability.
 
 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning.
 
 - Lorenzo Stoakes has done some maintanance work on the get_user_pages()
   interface.
 
 - Liam Howlett continues with cleanups and maintenance work to the maple
   tree code.  Peng Zhang also does some work on maple tree.
 
 - Johannes Weiner has done some cleanup work on the compaction code.
 
 - David Hildenbrand has contributed additional selftests for
   get_user_pages().
 
 - Thomas Gleixner has contributed some maintenance and optimization work
   for the vmalloc code.
 
 - Baolin Wang has provided some compaction cleanups,
 
 - SeongJae Park continues maintenance work on the DAMON code.
 
 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting.
 
 - Christoph Hellwig has some cleanups for the filemap/directio code.
 
 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the provided
   APIs rather than open-coding accesses.
 
 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings.
 
 - John Hubbard has a series of fixes to the MM selftesting code.
 
 - ZhangPeng continues the folio conversion campaign.
 
 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock.
 
 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from
   128 to 8.
 
 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management.
 
 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code.
 
 - Vishal Moola also has done some folio conversion work.
 
 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZJejewAKCRDdBJ7gKXxA
 joggAPwKMfT9lvDBEUnJagY7dbDPky1cSYZdJKxxM2cApGa42gEA6Cl8HRAWqSOh
 J0qXCzqaaN8+BuEyLGDVPaXur9KirwY=
 =B7yQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull mm updates from Andrew Morton:

 - Yosry Ahmed brought back some cgroup v1 stats in OOM logs

 - Yosry has also eliminated cgroup's atomic rstat flushing

 - Nhat Pham adds the new cachestat() syscall. It provides userspace
   with the ability to query pagecache status - a similar concept to
   mincore() but more powerful and with improved usability

 - Mel Gorman provides more optimizations for compaction, reducing the
   prevalence of page rescanning

 - Lorenzo Stoakes has done some maintanance work on the
   get_user_pages() interface

 - Liam Howlett continues with cleanups and maintenance work to the
   maple tree code. Peng Zhang also does some work on maple tree

 - Johannes Weiner has done some cleanup work on the compaction code

 - David Hildenbrand has contributed additional selftests for
   get_user_pages()

 - Thomas Gleixner has contributed some maintenance and optimization
   work for the vmalloc code

 - Baolin Wang has provided some compaction cleanups,

 - SeongJae Park continues maintenance work on the DAMON code

 - Huang Ying has done some maintenance on the swap code's usage of
   device refcounting

 - Christoph Hellwig has some cleanups for the filemap/directio code

 - Ryan Roberts provides two patch series which yield some
   rationalization of the kernel's access to pte entries - use the
   provided APIs rather than open-coding accesses

 - Lorenzo Stoakes has some fixes to the interaction between pagecache
   and directio access to file mappings

 - John Hubbard has a series of fixes to the MM selftesting code

 - ZhangPeng continues the folio conversion campaign

 - Hugh Dickins has been working on the pagetable handling code, mainly
   with a view to reducing the load on the mmap_lock

 - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
   from 128 to 8

 - Domenico Cerasuolo has improved the zswap reclaim mechanism by
   reorganizing the LRU management

 - Matthew Wilcox provides some fixups to make gfs2 work better with the
   buffer_head code

 - Vishal Moola also has done some folio conversion work

 - Matthew Wilcox has removed the remnants of the pagevec code - their
   functionality is migrated over to struct folio_batch

* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
  mm/hugetlb: remove hugetlb_set_page_subpool()
  mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
  hugetlb: revert use of page_cache_next_miss()
  Revert "page cache: fix page_cache_next/prev_miss off by one"
  mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
  mm: memcg: rename and document global_reclaim()
  mm: kill [add|del]_page_to_lru_list()
  mm: compaction: convert to use a folio in isolate_migratepages_block()
  mm: zswap: fix double invalidate with exclusive loads
  mm: remove unnecessary pagevec includes
  mm: remove references to pagevec
  mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
  mm: remove struct pagevec
  net: convert sunrpc from pagevec to folio_batch
  i915: convert i915_gpu_error to use a folio_batch
  pagevec: rename fbatch_count()
  mm: remove check_move_unevictable_pages()
  drm: convert drm_gem_put_pages() to use a folio_batch
  i915: convert shmem_sg_free_table() to use a folio_batch
  scatterlist: add sg_set_folio()
  ...
2023-06-28 10:28:11 -07:00
Linus Torvalds
a0433f8cae for-6.5/block-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8dwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpilGD/9Yys1oxIXJpRf00fzrylAlBthRxMjFQVWw
 zAut106hAQiBHvU8IkmGA3MvEFVHxtzwYhHI7IR8K3aZBIqscweCqmVI9JyogJw9
 U9Twnzel47VmuKdM94FeoN+hbj1fP8EWTjzmy67/zEEfFCdmHvNlMi3lSrGYIpFy
 39LxTB99Y4UarM5PtWbes37GYYljzMSWKuo4AfBkvq1eQa+sZ0Vq2xAABKq3UM7f
 apqhgHtkJooRePDP0eQp+kAyyVMgW2jIK+oIdJDxNF3CKTu2w40RzaYz6fp+jVSU
 H4R/xS59GW4/xql+VBJDh/qJg9K62DPPYjlW8BmSR8+IjvfFpsyH3/MacE50CD3P
 20fs/Mnj49H79fDrQEHJI53cOOb2EmUitbwLbvOcColNTPpt8loBtdQxjF2RMU8R
 Nyort9DJPFclYCxky1LYg1CNEC2Ln4Zy/jD47wPvqRmOQphOoVlV/hPnOEqvjaZC
 49Vn70W2DeE9cXvYI7ha+XIg6/oj+Gs3iusEbV08Ci7EAtXgI+ZUUsQ97K8UNiUh
 h2lqSJtuI7lBpYP9sf+BeCch5UCC+xGYyTdoM5f58lehWBBPtbs0g7S9RyRyOYxe
 n+yxEUo3dAGzJ/xsKAjinbZfeWIpr0b1TkAh4w3Cq/BKzRr9Bp8lBAxYuancbQ+Y
 1ADPteUOTA==
 =zP4Y
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull request via Keith:
      - Various cleanups all around (Irvin, Chaitanya, Christophe)
      - Better struct packing (Christophe JAILLET)
      - Reduce controller error logs for optional commands (Keith)
      - Support for >=64KiB block sizes (Daniel Gomez)
      - Fabrics fixes and code organization (Max, Chaitanya, Daniel
        Wagner)

 - bcache updates via Coly:
      - Fix a race at init time (Mingzhe Zou)
      - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)

 - use page pinning in the block layer for dio (David)

 - convert old block dio code to page pinning (David, Christoph)

 - cleanups for pktcdvd (Andy)

 - cleanups for rnbd (Guoqing)

 - use the unchecked __bio_add_page() for the initial single page
   additions (Johannes)

 - fix overflows in the Amiga partition handling code (Michael)

 - improve mq-deadline zoned device support (Bart)

 - keep passthrough requests out of the IO schedulers (Christoph, Ming)

 - improve support for flush requests, making them less special to deal
   with (Christoph)

 - add bdev holder ops and shutdown methods (Christoph)

 - fix the name_to_dev_t() situation and use cases (Christoph)

 - decouple the block open flags from fmode_t (Christoph)

 - ublk updates and cleanups, including adding user copy support (Ming)

 - BFQ sanity checking (Bart)

 - convert brd from radix to xarray (Pankaj)

 - constify various structures (Thomas, Ivan)

 - more fine grained persistent reservation ioctl capability checks
   (Jingbo)

 - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
   Jordy, Li, Min, Yu, Zhong, Waiman)

* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
  scsi/sg: don't grab scsi host module reference
  ext4: Fix warning in blkdev_put()
  block: don't return -EINVAL for not found names in devt_from_devname
  cdrom: Fix spectre-v1 gadget
  block: Improve kernel-doc headers
  blk-mq: don't insert passthrough request into sw queue
  bsg: make bsg_class a static const structure
  ublk: make ublk_chr_class a static const structure
  aoe: make aoe_class a static const structure
  block/rnbd: make all 'class' structures const
  block: fix the exclusive open mask in disk_scan_partitions
  block: add overflow checks for Amiga partition support
  block: change all __u32 annotations to __be32 in affs_hardblocks.h
  block: fix signed int overflow in Amiga partition support
  block: add capacity validation in bdev_add_partition()
  block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
  block: disallow Persistent Reservation on partitions
  reiserfs: fix blkdev_put() warning from release_journal_dev()
  block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
  block: document the holder argument to blkdev_get_by_path
  ...
2023-06-26 12:47:20 -07:00
Linus Torvalds
0aa69d53ac for-6.5/io_uring-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8cEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnvZD/0QWstFCe1CSLWaycdC9fhWftFt3hyEIST5
 CYEL56UZrDWNkv9xTLe855xvMavjd4sdHlUa8NUPghRQeJyKYgRxHBLXRWmy0uNN
 l47Zjiwsolmbr3Nt6qViLdCDYmG39ZGNwWo8b6p3ybWYLtzxeOblocOBTPzoCtkS
 hjo7Z0eMONvsvLX+l0o9IDdWtZIQ2fGo4VYkMIVb6CyxRPpuUuPKbE25qaTx+uBg
 Fy6Qa3SlaTwzqcg3dttggjIP792L/eETUCWGndg5pJNbrkj/fI4Vm4bEljID76eS
 HODl+pWHmyM6avVkypr7N3Tp5HKF0OTUa4vJTLIZo1QiRu6zlXphtuvGn+McEmgV
 hbYmQMYWzqJ22k2iEpCR58pdhmZJC9uB8r4Rwgr/t9GKqt4E+15EzmqkG9cUVMGV
 rfbBwVLwBUd5+0WHwQ8RzdtaUPt17vSIW/8WhU5zoMGVotqVBHO/H+5BtmKPWWpq
 fx1etQ8XJVPIxziJvgsEitb1s6KZzJspcONDlLEitmZkflv3gGdVm99KNbXwJpcp
 m6+FcYQ5d5FivfLPGgpx8go+4M2QuoW2yRGwZHu54buCnpxgNjIk898OjrUrdXCg
 3/0m99GXmOWQQl0VrrTr+Fv99nVsQ2hMQzOFJGMYRtHEEc5xiTcJiZmoxmF7T7/C
 TipyW3czsw==
 =5Me8
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:
 "Nothing major in this release, just a bunch of cleanups and some
  optimizations around networking mostly.

   - clean up file request flags handling (Christoph)

   - clean up request freeing and CQ locking (Pavel)

   - support for using pre-registering the io_uring fd at setup time
     (Josh)

   - Add support for user allocated ring memory, rather than having the
     kernel allocate it. Mostly for packing rings into a huge page (me)

   - avoid an unnecessary double retry on receive (me)

   - maintain ordering for task_work, which also improves performance
     (me)

   - misc cleanups/fixes (Pavel, me)"

* tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits)
  io_uring: merge conditional unlock flush helpers
  io_uring: make io_cq_unlock_post static
  io_uring: inline __io_cq_unlock
  io_uring: fix acquire/release annotations
  io_uring: kill io_cq_unlock()
  io_uring: remove IOU_F_TWQ_FORCE_NORMAL
  io_uring: don't batch task put on reqs free
  io_uring: move io_clean_op()
  io_uring: inline io_dismantle_req()
  io_uring: remove io_free_req_tw
  io_uring: open code io_put_req_find_next
  io_uring: add helpers to decode the fixed file file_ptr
  io_uring: use io_file_from_index in io_msg_grab_file
  io_uring: use io_file_from_index in __io_sync_cancel
  io_uring: return REQ_F_ flags from io_file_get_flags
  io_uring: remove io_req_ffs_set
  io_uring: remove a confusing comment above io_file_get_flags
  io_uring: remove the mode variable in io_file_get_flags
  io_uring: remove __io_file_supports_nowait
  io_uring: wait interruptibly for request completions on exit
  ...
2023-06-26 12:30:26 -07:00
Linus Torvalds
3eccc0c886 for-6.5/splice-2023-06-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSV8QgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpupIEADKEZvpxDyaxHjYZFFeoSJRkh+AEJHe0Xtr
 J5vUL8t8zmAV3F7i8XaoAEcR0dC0VQcoTc8fAOty71+5hsc7gvtyyNjqU/YWRVqK
 Xr+VJuSJ+OGx3MzpRWEkepagfPyqP5cyyCOK6gqIgqzc3IwqkR/3QHVRc6oR8YbY
 AQd7tqm2fQXK9WDHEy5hcaQeqb9uKZjQQoZejpPPerpJM+9RMgKxpCGtnLLIUhr/
 sgl7KyLIQPBmveO2vfOR+dmsJBqsLqneqkXDKMAIfpeVEEkHHAlCH4E5Ne1XUS+s
 ie4If+reuyn1Ktt5Ry1t7w2wr8cX1fcay3K28tgwjE2Bvremc5YnYgb3pyUDW38f
 tXXkpg/eTXd/Pn0Crpagoa9zJ927tt5JXIO1/PagPEP1XOqUuthshDFsrVqfqbs+
 36gqX2JWB4NJTg9B9KBHA3+iVCJyZLjUqOqws7hOJOvhQytZVm/IwkGBg1Slhe1a
 J5WemBlqX8lTgXz0nM7cOhPYTZeKe6hazCcb5VwxTUTj9SGyYtsMfqqTwRJO9kiF
 j1VzbOAgExDYe+GvfqOFPh9VqZho66+DyOD/Xtca4eH7oYyHSmP66o8nhRyPBPZA
 maBxQhUkPQn4/V/0fL2TwIdWYKsbj8bUyINKPZ2L35YfeICiaYIctTwNJxtRmItB
 M3VxWD3GZQ==
 =KhW4
 -----END PGP SIGNATURE-----

Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux

Pull splice updates from Jens Axboe:
 "This kills off ITER_PIPE to avoid a race between truncate,
  iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
  with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
  memory corruption.

  Instead, we either use (a) filemap_splice_read(), which invokes the
  buffered file reading code and splices from the pagecache into the
  pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
  into it and then pushes the filled pages into the pipe; or (c) handle
  it in filesystem-specific code.

  Summary:

   - Rename direct_splice_read() to copy_splice_read()

   - Simplify the calculations for the number of pages to be reclaimed
     in copy_splice_read()

   - Turn do_splice_to() into a helper, vfs_splice_read(), so that it
     can be used by overlayfs and coda to perform the checks on the
     lower fs

   - Make vfs_splice_read() jump to copy_splice_read() to handle
     direct-I/O and DAX

   - Provide shmem with its own splice_read to handle non-existent pages
     in the pagecache. We don't want a ->read_folio() as we don't want
     to populate holes, but filemap_get_pages() requires it

   - Provide overlayfs with its own splice_read to call down to a lower
     layer as overlayfs doesn't provide ->read_folio()

   - Provide coda with its own splice_read to call down to a lower layer
     as coda doesn't provide ->read_folio()

   - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
     and random files as they just copy to the output buffer and don't
     splice pages

   - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
     ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation

   - Make cifs use filemap_splice_read()

   - Replace pointers to generic_file_splice_read() with pointers to
     filemap_splice_read() as DIO and DAX are handled in the caller;
     filesystems can still provide their own alternate ->splice_read()
     op

   - Remove generic_file_splice_read()

   - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
     was the only user"

* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
  splice: kdoc for filemap_splice_read() and copy_splice_read()
  iov_iter: Kill ITER_PIPE
  splice: Remove generic_file_splice_read()
  splice: Use filemap_splice_read() instead of generic_file_splice_read()
  cifs: Use filemap_splice_read()
  trace: Convert trace/seq to use copy_splice_read()
  zonefs: Provide a splice-read wrapper
  xfs: Provide a splice-read wrapper
  orangefs: Provide a splice-read wrapper
  ocfs2: Provide a splice-read wrapper
  ntfs3: Provide a splice-read wrapper
  nfs: Provide a splice-read wrapper
  f2fs: Provide a splice-read wrapper
  ext4: Provide a splice-read wrapper
  ecryptfs: Provide a splice-read wrapper
  ceph: Provide a splice-read wrapper
  afs: Provide a splice-read wrapper
  9p: Add splice_read wrapper
  net: Make sock_splice_read() use copy_splice_read() by default
  tty, proc, kernfs, random: Use copy_splice_read()
  ...
2023-06-26 11:52:12 -07:00
Yu Kuai
6d85ebf95c blk-sysfs: add a new attr_group for blk_mq
Currently wbt sysfs entry is created for bio based device, and wbt can
be enabled for such device through sysfs while it doesn't make sense
because wbt can only work for rq based device. In the meantime, there
are other similar sysfs entries.

Fix this by adding a new attr_group for blk_mq, and sysfs entries will
only be created when the device is rq based.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230527010644.647900-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26 09:53:36 -06:00
Yu Kuai
eebc21d12f blk-iocost: move wbt_enable/disable_default() out of spinlock
There are following smatch warning:

block/blk-wbt.c:843 wbt_init() warn: sleeping in atomic context
ioc_qos_write() <- disables preempt
-> wbt_enable_default()
   -> wbt_init()

wbt_init() will be called from wbt_enable_default() if wbt is not
initialized, currently this is only possible in blk_register_queue(), hence
wbt_init() will never be called from iocost and this warning is false
positive.

However, we might support rq_qos destruction dynamically in the future,
and it's better to prevent that, hence move wbt_enable_default() outside
'ioc->lock'. This is safe because queue is still freezed.

Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/lkml/Y+Ja5SRs886CEz7a@kadam/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230527010644.647900-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26 09:53:36 -06:00
Yu Kuai
06257fda83 blk-wbt: cleanup rwb_enabled() and wbt_disabled()
'wb_normal' will set to 0 if 'min_lat_nsec' is 0, and 'min_lat_nsec' can
only be set to 0 through sysfs configuration where 'WBT_STATE_OFF_MANUAL'
is set together, in the meantime, they can only be cleared together
through sysfs afterwards. Hence 'wb_normal != 0' is the same as
'rwb->enable_state != WBT_STATE_OFF_MANUAL'.

The code is redundan, hence replace the checking of 'wb_normal' to
'enable_state' in rwb_enabled() and reuse rwb_enabled() for
wbt_disabled().

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230527010644.647900-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26 09:53:36 -06:00
Yu Kuai
71b8642e79 blk-wbt: remove dead code to handle wbt enable/disable with io inflight
enable or disable wbt is always called with queue freezed, so that wbt
can never be enabled or disabled while io is still inflight, and this
behaviour should always hold to avoid io hang(There have been reported
several times).

Therefor, the code to handle wbt enable/diskble with io inflight is not
and never will be used, hence remove such dead code.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230527010644.647900-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26 09:53:36 -06:00
Yu Kuai
645a829e03 blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
sysfs entry /sys/block/[device]/queue/wbt_lat_usec will be created even
if CONFIG_BLK_WBT is disabled, while read and write will always fail.
It doesn't make sense to create a sysfs entry that can't be accessed,
so don't create such entry.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230527010644.647900-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-26 09:53:36 -06:00
Ming Lei
c6b7a3a26e blk-mq: fix two misuses on RQF_USE_SCHED
Request allocated from sched tags can't be issued via ->queue_rqs()
directly, since driver tag isn't allocated yet. This is the 1st misuse
of RQF_USE_SCHED for figuring out plug->has_elevator.

Request allocated from sched tags can't be ended by
blk_mq_end_request_batch() too, fix the 2nd RQF_USE_SCHED misuse
in blk_mq_add_to_batch().

Without this patch, NVMe uring cmd passthrough IO workload can run into
hang easily with real io scheduler.

Fixes: dd6216bb16 ("blk-mq: make sure elevator callbacks aren't called for passthrough request")
Reported-by: Guangwu Zhang <guazhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGS2=YrBjpLPOKa-gzcKuuOG60AGth5794PNCDwatdnnscB9ug@mail.gmail.com/
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230624130105.1443879-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-25 08:01:05 -06:00
Jinke Han
ad7c3b41e8 blk-throttle: Fix io statistics for cgroup v1
After commit f382fb0bce ("block: remove legacy IO schedulers"),
blkio.throttle.io_serviced and blkio.throttle.io_service_bytes become
the only stable io stats interface of cgroup v1, and these statistics
are done in the blk-throttle code. But the current code only counts the
bios that are actually throttled. When the user does not add the throttle
limit, the io stats for cgroup v1 has nothing. I fix it according to the
statistical method of v2, and made it count all ios accurately.

Fixes: a7b36ee6ba ("block: move blk-throtl fast path inline")
Tested-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230507170631.89607-1-hanjinke.666@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-25 08:00:39 -06:00
Christoph Hellwig
648fa60fa7 block: don't return -EINVAL for not found names in devt_from_devname
When we didn't find a device and didn't guess it might be a partition,
it might still show up later, so don't disable rootwait for it by
returning -EINVAL.

Fixes: 079caa35f7 ("init: clear root_wait on all invalid root= strings")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230622150644.600327-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-22 09:09:33 -06:00
Ming Lei
9c39b7a905 block: make sure local irq is disabled when calling __blkcg_rstat_flush
When __blkcg_rstat_flush() is called from cgroup_rstat_flush*() code
path, interrupt is always disabled.

When we start to flush blkcg per-cpu stats list in __blkg_release()
for avoiding to leak blkcg_gq's reference in commit 20cb1c2fb7
("blk-cgroup: Flush stats before releasing blkcg_gq"), local irq
isn't disabled yet, then lockdep warning may be triggered because
the dependent cgroup locks may be acquired from irq(soft irq) handler.

Fix the issue by disabling local irq always.

Fixes: 20cb1c2fb7 ("blk-cgroup: Flush stats before releasing blkcg_gq")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/pz2wzwnmn5tk3pwpskmjhli6g3qly7eoknilb26of376c7kwxy@qydzpvt6zpis/T/#u
Cc: stable@vger.kernel.org
Cc: Jay Shin <jaeshin@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20230622084249.1208005-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-22 07:44:00 -06:00
Bart Van Assche
017fb83ee0 block: Improve kernel-doc headers
Fix the documentation of the devt_from_partuuid() return value.

Fix the following two recently introduced kernel-doc warnings:

block/bdev.c:570: warning: Function parameter or member 'hops' not described in 'bd_finish_claiming'
block/early-lookup.c:46: warning: Function parameter or member 'devt' not described in 'devt_from_partuuid'

Cc: Christoph Hellwig <hch@lst.de>
Fixes: 0718afd47f ("block: introduce holder ops")
Fixes: cf056a4312 ("init: improve the name_to_dev_t interface")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230621165054.743815-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 13:18:15 -06:00
Ming Lei
2293cae703 blk-mq: don't insert passthrough request into sw queue
In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue()
may just check if scheduler queue has request to dispatch, see
__blk_mq_sched_dispatch_requests(). Then IO hang may be caused because
all passthorugh requests may stay in sw queue.

And any passthrough request should have been inserted to hctx->dispatch
always.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Fixes: d97217e7f0 ("blk-mq: don't queue plugged passthrough requests into scheduler")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:48:34 -06:00
Ivan Orlov
72ef02b8df bsg: make bsg_class a static const structure
Now that the driver core allows for struct class to be in read-only
memory, move the bsg_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.

Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-scsi@vger.kernel.org
Cc: linux-block@vger.kernel.org
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230620180129.645646-8-gregkh@linuxfoundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:46:17 -06:00
Christoph Hellwig
56e71bdf32 block: fix the exclusive open mask in disk_scan_partitions
FMODE_EXEC has nothing to do with exclusive opens, and even is of
the wrong type.  We need to check for BLK_OPEN_EXCL here.

Fixes: 985958b858 ("block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230621124914.185992-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21 07:37:52 -06:00
Michael Schmitz
b6f3f28f60 block: add overflow checks for Amiga partition support
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.

Use u64 as type for sector address and size to allow using disks up to
2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD
format allows to specify disk sizes up to 2^128 bytes (though native
OS limitations reduce this somewhat, to max 2^68 bytes), so check for
u64 overflow carefully to protect against overflowing sector_t.

Bail out if sector addresses overflow 32 bits on kernels without LBD
support.

This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted (now resubmitted as patch 1 in this series).
This patch adds additional error checking and warning messages.

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/r/20230620201725.7020-4-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 14:28:17 -06:00
Michael Schmitz
fc3d092c6b block: fix signed int overflow in Amiga partition support
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.

Use sector_t as type for sector address and size to allow using disks
up to 2 TB without LBD support, and disks larger than 2 TB with LBD.

This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted. This patch differs from Joanne's patch only in its use of
sector_t instead of unsigned int. No checking for overflows is done
(see patch 3 of this series for that).

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Message-ID: <201206192146.09327.Martin@lichtvoll.de>
Cc: <stable@vger.kernel.org> # 5.2
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Tested-by: Martin Steigerwald <Martin@lichtvoll.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620201725.7020-2-schmitzmic@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 14:28:17 -06:00
Min Li
6d4e80db4e block: add capacity validation in bdev_add_partition()
In the function bdev_add_partition(),there is no check that the start
and end sectors exceed the size of the disk before calling add_partition.
When we call the block's ioctl interface directly to add a partition,
and the capacity of the disk is set to 0 by driver,the command will
continue to execute.

Signed-off-by: Min Li <min15.li@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230619091214.31615-1-min15.li@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 12:50:51 -06:00
Jingbo Xu
9a72a02456 block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
Allow of unprivileged Persistent Reservation operations on devices
if the write permission check on the device node has passed.

brw-rw---- 1 root disk 259, 0 Jun 13 07:09 /dev/nvme0n1

In the example above, the "disk" group of nvme0n1 is also allowed to
make reservations on the device even without CAP_SYS_ADMIN.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230613084008.93795-3-jefflexu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 12:49:23 -06:00
Jingbo Xu
1262962166 block: disallow Persistent Reservation on partitions
Refuse Persistent Reservation operations on partitions as reservation
on partitions doesn't make sense.

Besides, introduce blkdev_pr_allowed() helper, where more policies could
be placed here later.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230613084008.93795-2-jefflexu@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 12:49:23 -06:00
Yu Kuai
985958b858 block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
After commit 2736e8eeb0 ("block: use the holder as indication for
exclusive opens"), blkdev_get_by_dev() will warn if holder is NULL and
mode contains 'FMODE_EXCL'.

holder from blkdev_get_by_dev() from disk_scan_partitions() is always NULL,
hence it should not use 'FMODE_EXCL', which is broben by the commit. For
consequence, WARN_ON_ONCE() will be triggered from blkdev_get_by_dev()
if user scan partitions with device opened exclusively.

Fix this problem by removing 'FMODE_EXCL' from disk_scan_partitions(),
as it used to be.

Reported-by: syzbot+00cd27751f78817f167b@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=00cd27751f78817f167b
Fixes: 2736e8eeb0 ("block: use the holder as indication for exclusive opens")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230618140402.7556-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:17:46 -06:00
Christoph Hellwig
e89e001f24 block: document the holder argument to blkdev_get_by_path
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230620043536.707249-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:16:38 -06:00
Demi Marie Obenour
b90ecc0379 block: increment diskseq on all media change events
Currently, associating a loop device with a different file descriptor
does not increment its diskseq.  This allows the following race
condition:

1. Program X opens a loop device
2. Program X gets the diskseq of the loop device.
3. Program X associates a file with the loop device.
4. Program X passes the loop device major, minor, and diskseq to
   something.
5. Program X exits.
6. Program Y detaches the file from the loop device.
7. Program Y attaches a different file to the loop device.
8. The opener finally gets around to opening the loop device and checks
   that the diskseq is what it expects it to be.  Even though the
   diskseq is the expected value, the result is that the opener is
   accessing the wrong file.

From discussions with Christoph Hellwig, it appears that
disk_force_media_change() was supposed to call inc_diskseq(), but in
fact it does not.  Adding a Fixes: tag to indicate this.  Christoph's
Reported-by is because he stated that disk_force_media_change()
calls inc_diskseq(), which is what led me to discover that it should but
does not.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Fixes: e6138dc12d ("block: add a helper to raise a media changed event")
Cc: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230607170837.1559-1-demi@invisiblethingslab.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-20 07:16:24 -06:00
Martin K. Petersen
af92c02fb2 Merge patch series "scsi: fixes for targets with many LUNs, and scsi_target_block rework"
Martin Wilck <mwilck@suse.com> says:

This patch series addresses some issues we saw in a test setup with a
large number of SCSI LUNs. The first two patches simply increase the
number of available sg and bsg devices. 3-5 fix a large delay we
encountered between blocking a Fibre Channel remote port and the
dev_loss_tmo. 6 renames scsi_target_block() to scsi_block_targets(),
and makes additional changes to this API, as suggested in the review
of the v2 series. 7 improves a warning message.

Link: https://lore.kernel.org/r/20230614103616.31857-1-mwilck@suse.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-06-16 12:21:04 -04:00
Hannes Reinecke
9077fb2ab7 scsi: bsg: Increase number of devices
Larger setups may need to allocate more than 32k bsg devices, so increase
the number of devices to the full range of minor device numbers.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin Wilck <mwilck@suse.com>
Link: https://lore.kernel.org/r/20230614103616.31857-2-mwilck@suse.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-06-16 12:19:59 -04:00
Ming Lei
245165658e blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_none
After grabbing q->sysfs_lock, q->elevator may become NULL because of
elevator switch.

Fix the NULL dereference on q->elevator by checking it with lock.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:12:25 -06:00
Christoph Hellwig
e4cc64657b block: remove BIO_PAGE_REFFED
Now that all block direct I/O helpers use page pinning, this flag is
unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16 10:08:09 -06:00
Damien Le Moal
01584c1e23 scsi: block: Improve ioprio value validity checks
The introduction of the macro IOPRIO_PRIO_LEVEL() in commit eca2040972
("scsi: block: ioprio: Clean up interface definition") results in an
iopriority level to always be masked using the macro IOPRIO_LEVEL_MASK, and
thus to the kernel always seeing an acceptable value for an I/O priority
level when checked in ioprio_check_cap().  Before this patch, this function
would return an error for some (but not all) invalid values for a level
valid range of [0..7].

Restore and improve the detection of invalid priority levels by introducing
the inline function ioprio_value() to check an ioprio class, level and hint
value before combining these fields into a single value to be used with
ioprio_set() or AIOs. If an invalid value for the class, level or hint of
an ioprio is detected, ioprio_value() returns an ioprio using the class
IOPRIO_CLASS_INVALID, indicating an invalid value and causing
ioprio_check_cap() to return -EINVAL.

Fixes: 6c91325722 ("scsi: block: Introduce ioprio hints")
Fixes: eca2040972 ("scsi: block: ioprio: Clean up interface definition")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230608095556.124001-1-dlemoal@kernel.org
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-06-16 12:04:30 -04:00
Yu Kuai
dd7de3704a block: fix blktrace debugfs entries leakage
Commit 99d055b4fd ("block: remove per-disk debugfs files in
blk_unregister_queue") moves blk_trace_shutdown() from
blk_release_queue() to blk_unregister_queue(), this is safe if blktrace
is created through sysfs, however, there is a regression in corner
case.

blktrace can still be enabled after del_gendisk() through ioctl if
the disk is opened before del_gendisk(), and if blktrace is not shutdown
through ioctl before closing the disk, debugfs entries will be leaked.

Fix this problem by shutdown blktrace in disk_release(), this is safe
because blk_trace_remove() is reentrant.

Fixes: 99d055b4fd ("block: remove per-disk debugfs files in blk_unregister_queue")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 20:24:03 -06:00
Ed Tsai
30654614f3 blk-mq: check on cpu id when there is only one ctx mapping
commit f168420c62 ("blk-mq: don't redirect completion for hctx withs
only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx,
there will be no remote request.

But for ufs, the submission and completion queues could be asymmetric.
(e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and
ctx won't complete request on the submission cpu. In this situation,
this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result,
check on cpu id when there is only one ctx mapping.

Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com>
Suggested-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com
[axboe: fixed up indentation]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14 11:11:25 -06:00
Yu Kuai
4f1731df60 blk-mq: fix potential io hang by wrong 'wake_batch'
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating
'wake_batch' is not atomic:

t1:			t2:
_blk_mq_tag_busy	blk_mq_tag_busy
inc active_queues
// assume 1->2
			inc active_queues
			// 2 -> 3
			blk_mq_update_wake_batch
			// calculate based on 3
blk_mq_update_wake_batch
/* calculate based on 2, while active_queues is actually 3. */

Fix this problem by protecting them wih 'tags->lock', this is not a hot
path, so performance should not be concerned. And now that all writers
are inside the lock, switch 'actives_queues' from atomic to unsigned
int.

Fixes: 180dccb0db ("blk-mq: fix tag_get wait task can't be awakened")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 09:55:53 -06:00
Christoph Hellwig
ee3249a8ce block: store the holder in file->private_data
Store the file struct used as the holder in file->private_data as an
indicator that this file descriptor was opened exclusively to  remove
the last use of FMODE_EXCL.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230608110258.189493-30-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
4e762d8623 block: always use I_BDEV on file->f_mapping->host to find the bdev
Always use I_BDEV(file->f_mapping->host) to find the bdev for a file to
free up file->private_data for other uses.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-29-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
05bdb99653 block: replace fmode_t with a block-specific type for block open flags
The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE.  Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:05 -06:00
Christoph Hellwig
5e4ea83467 block: remove unused fmode_t arguments from ioctl handlers
A few ioctl handlers have fmode_t arguments that are entirely unused,
remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20230608110258.189493-27-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
cfb425761c block: move a few internal definitions out of blkdev.h
All these helpers are only used in core block code, so move them out of
the public header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-26-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
1991299e49 scsi: replace the fmode_t argument to ->sg_io_fn with a simple bool
Instead of passing a fmode_t and only checking it for FMODE_WRITE, pass
a bool open_for_write to prepare for callers that won't have the fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
2736e8eeb0 block: use the holder as indication for exclusive opens
The current interface for exclusive opens is rather confusing as it
requires both the FMODE_EXCL flag and a holder.  Remove the need to pass
FMODE_EXCL and just key off the exclusive open off a non-NULL holder.

For blkdev_put this requires adding the holder argument, which provides
better debug checking that only the holder actually releases the hold,
but at the same time allows removing the now superfluous mode argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>		[btrfs]
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
7ee34cbc29 block: rename blkdev_close to blkdev_release
Make the function name match the method name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
ae220766d8 block: remove the unused mode argument to ->release
The mode argument to the ->release block_device_operation is never used,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>			[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
d32e2bf837 block: pass a gendisk to ->open
->open is only called on the whole device.  Make that explicit by
passing a gendisk instead of the block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Jack Wang <jinpu.wang@ionos.com>		[rnbd]
Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:04 -06:00
Christoph Hellwig
444aa2c58c block: pass a gendisk on bdev_check_media_change
bdev_check_media_change should only ever be called for the whole device.
Pass a gendisk to make that explicit and rename the function to
disk_check_media_change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:03 -06:00
Christoph Hellwig
9d1c92872e block: also call ->open for incremental partition opens
For whole devices ->open is called for each open, but for partitions it
is only called on the first open of a partition, e.g.:

  open("/dev/vdb", ...)
  open("/dev/vdb", ...)
    - 2 call to ->open

  open("/dev/vdb1", ...)
  open("/dev/vdb", ...)
    - 2 call to ->open

  open("/dev/vdb", ...)
  open("/dev/vdb", ...)
    - just open call to ->open

This is problematic as various block drivers look at open flags and
might not do all the required setup if the earlier open was with an
odd flag like O_NDELAY or the magic 3 ioctl-only open mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Phillip Potter <phil@philpotter.co.uk>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12 08:04:03 -06:00
Ming Lei
20cb1c2fb7 blk-cgroup: Flush stats before releasing blkcg_gq
As noted by Michal, the blkg_iostat_set's in the lockless list hold
reference to blkg's to protect against their removal. Those blkg's
hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which
is called when the blkcg reference count reaches 0. This circular
dependency will prevent blkcg and some blkgs from being freed after
they are made offline.

It is less a problem if the cgroup to be destroyed also has other
controllers like memory that will call cgroup_rstat_flush() which will
clean up the reference count. If block is the only controller that uses
rstat, these offline blkcg and blkgs may never be freed leaking more
and more memory over time.

To prevent this potential memory leak:

- flush blkcg per-cpu stats list in __blkg_release(), when no new stat
can be added

- add global blkg_stat_lock for covering concurrent parent blkg stat
update

- don't grab bio->bi_blkg reference when adding the stats into blkcg's
per-cpu stat list since all stats are guaranteed to be consumed before
releasing blkg instance, and grabbing blkg reference for stats was the
most fragile part of original patch

Based on Waiman's patch:

https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.com/

Fixes: 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Cc: stable@vger.kernel.org
Reported-by: Jay Shin <jaeshin@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: mkoutny@suse.com
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230609234249.1412858-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-11 19:49:29 -06:00
Christoph Hellwig
3c435a0fe3 filemap: add a kiocb_write_and_wait helper
Factor out a helper that does filemap_write_and_wait_range for the range
covered by a read kiocb, or returns -EAGAIN if the kiocb is marked as
nowait and there would be pages to write.

Link: https://lkml.kernel.org/r/20230601145904.1385409-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09 16:25:52 -07:00
Christoph Hellwig
bb91a7d96a block: fix rootwait= again
The previous rootwait fix added an -EINVAL return to a completely
bogus superflous branch, fix this.

Fixes: 1341c7d2cc ("block: fix rootwait=")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230609051737.328930-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-09 11:38:10 -06:00
Christoph Hellwig
1341c7d2cc block: fix rootwait=
Failures to look up the gendisk must return -ENODEV so that rootwait
retries the lookup instead of -EINVAL which exits early.

Fixes: cf056a4312 ("init: improve the name_to_dev_t interface")
Reported-by: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Fabio Estevam <festevam@gmail.com>
Link: https://lore.kernel.org/r/20230607135746.92995-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07 08:00:14 -06:00
Waiman Long
3d2af77e31 blk-cgroup: Reinit blkg_iostat_set after clearing in blkcg_reset_stats()
When blkg_alloc() is called to allocate a blkcg_gq structure
with the associated blkg_iostat_set's, there are 2 fields within
blkg_iostat_set that requires proper initialization - blkg & sync.
The former field was introduced by commit 3b8cc62987 ("blk-cgroup:
Optimize blkcg_rstat_flush()") while the later one was introduced by
commit f733164829 ("blk-cgroup: reimplement basic IO stats using
cgroup rstat").

Unfortunately those fields in the blkg_iostat_set's are not properly
re-initialized when they are cleared in v1's blkcg_reset_stats(). This
can lead to a kernel panic due to NULL pointer access of the blkg
pointer. The missing initialization of sync is less problematic and
can be a problem in a debug kernel due to missing lockdep initialization.

Fix these problems by re-initializing them after memory clearing.

Fixes: 3b8cc62987 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Fixes: f733164829 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07 07:51:34 -06:00
Yu Kuai
a7cfa0af0c blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue()
Recursive spin_lock/unlock_irq() is not safe, because spin_unlock_irq()
will enable irq unconditionally:

spin_lock_irq	queue_lock	-> disable irq
spin_lock_irq	ioc->lock
spin_unlock_irq ioc->lock	-> enable irq
/*
 * AA dead lock will be triggered if current context is preempted by irq,
 * and irq try to hold queue_lock again.
 */
spin_unlock_irq queue_lock

Fix this problem by using spin_lock/unlock() directly for 'ioc->lock'.

Fixes: 5a0ac57c48 ("blk-ioc: protect ioc_destroy_icq() by 'queue_lock'")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230606011438.3743440-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07 07:51:00 -06:00
Hou Tao
ddf63516d8 blk-ioprio: Introduce promote-to-rt policy
Since commit a78418e6a0 ("block: Always initialize bio IO priority on
submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling
blkcg_set_ioprio(), so there will be no way to promote the io-priority
of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be
greater than or equals to IOPRIO_CLASS_RT.

It seems possible to call blkcg_set_ioprio() first then try to
initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work
for bio in which bi_ioprio is already initialized (e.g., direct-io), so
introduce a new promote-to-rt policy to promote the iopriority of bio to
IOPRIO_CLASS_RT if the ioprio is not already RT.

For none-to-rt policy, although it doesn't work now, but considering
that its purpose was also to override the io-priority to RT and allowing
for a smoother transition, just keep it and treat it as an alias of
the promote-to-rt policy.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-06 22:26:26 -06:00
Li Nan
8d21155467 blk-iocost: use spin_lock_irqsave in adjust_inuse_and_calc_cost
adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled
when unlock. DEADLOCK might happen if we have held other locks and disabled
IRQ before invoking it.

Fix it by using spin_lock_irqsave() instead, which can keep IRQ state
consistent with before when unlock.

  ================================
  WARNING: inconsistent lock state
  5.10.0-02758-g8e5f91fd772f #26 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
  {IN-HARDIRQ-W} state was registered at:
    __lock_acquire+0x3d7/0x1070
    lock_acquire+0x197/0x4a0
    __raw_spin_lock_irqsave
    _raw_spin_lock_irqsave+0x3b/0x60
    bfq_idle_slice_timer_body
    bfq_idle_slice_timer+0x53/0x1d0
    __run_hrtimer+0x477/0xa70
    __hrtimer_run_queues+0x1c6/0x2d0
    hrtimer_interrupt+0x302/0x9e0
    local_apic_timer_interrupt
    __sysvec_apic_timer_interrupt+0xfd/0x420
    run_sysvec_on_irqstack_cond
    sysvec_apic_timer_interrupt+0x46/0xa0
    asm_sysvec_apic_timer_interrupt+0x12/0x20
  irq event stamp: 837522
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50
  softirqs last  enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec
  softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&bfqd->lock);
    <Interrupt>
      lock(&bfqd->lock);

   *** DEADLOCK ***

  3 locks held by kworker/2:3/388:
   #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0
   #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390

  stack backtrace:
  CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  Workqueue: kthrotld blk_throtl_dispatch_work_fn
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x107/0x167
   print_usage_bug
   valid_state
   mark_lock_irq.cold+0x32/0x3a
   mark_lock+0x693/0xbc0
   mark_held_locks+0x9e/0xe0
   __trace_hardirqs_on_caller
   lockdep_hardirqs_on_prepare.part.0+0x151/0x360
   trace_hardirqs_on+0x5b/0x180
   __raw_spin_unlock_irq
   _raw_spin_unlock_irq+0x24/0x40
   spin_unlock_irq
   adjust_inuse_and_calc_cost+0x4fb/0x970
   ioc_rqos_merge+0x277/0x740
   __rq_qos_merge+0x62/0xb0
   rq_qos_merge
   bio_attempt_back_merge+0x12c/0x4a0
   blk_mq_sched_try_merge+0x1b6/0x4d0
   bfq_bio_merge+0x24a/0x390
   __blk_mq_sched_bio_merge+0xa6/0x460
   blk_mq_sched_bio_merge
   blk_mq_submit_bio+0x2e7/0x1ee0
   __submit_bio_noacct_mq+0x175/0x3b0
   submit_bio_noacct+0x1fb/0x270
   blk_throtl_dispatch_work_fn+0x1ef/0x2b0
   process_one_work+0x83e/0x13f0
   process_scheduled_works
   worker_thread+0x7e3/0xd80
   kthread+0x353/0x470
   ret_from_fork+0x1f/0x30

Fixes: b0853ab4a2 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 12:08:32 -06:00
Christoph Hellwig
2577f53f42 block: mark early_lookup_bdev as __init
early_lookup_bdev is now only used during the early boot code as it
should, so mark it __init to not waste run time memory on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 11:10:14 -06:00
Christoph Hellwig
7cadcaf1d8 block: move more code to early-lookup.c
blk_lookup_devt is only used by code in early-lookup.c, so move it
there.

printk_all_partitions and it's helper bdevt_str are only used by the
early init code in init/do_mounts.c, so they should go there as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:57:40 -06:00
Christoph Hellwig
702f3189e4 block: move the code to do early boot lookup of block devices to block/
Create a new block/early-lookup.c to keep the early block device lookup
code instead of having this code sit with the early mount code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:57:40 -06:00
Christoph Hellwig
f55e017c64 block: add a mark_dead holder operation
Add a mark_dead method to blk_holder_ops that is called from blk_mark_disk_dead
to notify the holder that the block device it is using has been marked dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
0718afd47f block: introduce holder ops
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims.  It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
00080f7fb7 block: remove blk_drop_partitions
There is only a single caller left, so fold the loop into that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
eec1be4c30 block: delete partitions later in del_gendisk
Delay dropping the block_devices for partitions in del_gendisk until
after the call to blk_mark_disk_dead, so that we can implementat
notification of removed devices in blk_mark_disk_dead.

This requires splitting a lower-level drop_partition helper out of
delete_partition and using that from del_gendisk, while having a
common loop for the whole device and partitions that calls
remove_inode_hash, fsync_bdev and __invalidate_device before the
call to blk_mark_disk_dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
69f90b70bd block: unhash the inode earlier in delete_partition
Move the call to remove_inode_hash to the beginning of delete_partition,
as we want to prevent opening a block_device that is about to be removed
ASAP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
a4f75764d1 block: avoid repeated work in blk_mark_disk_dead
Check if GD_DEAD is already set in blk_mark_disk_dead, and don't
duplicate the work already done.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00
Christoph Hellwig
66fddc25fe block: consolidate the shutdown logic in blk_mark_disk_dead and del_gendisk
blk_mark_disk_dead does very similar work a a section of del_gendisk:

 - set the GD_DEAD flag
 - set the capacity to zero
 - start a queue drain

but del_gendisk also sets QUEUE_FLAG_DYING on the queue if it is owned by
the disk, sets the capacity to zero before starting the drain, and both
with sending a uevent and kernel message for this fake capacity change.

Move the exact logic from the more heavily used del_gendisk into
blk_mark_disk_dead and then call blk_mark_disk_dead from del_gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05 10:53:04 -06:00