Commit Graph

6331 Commits

Author SHA1 Message Date
Ming Lei
850fd2abbe block: cleanup q->srcu
srcu structure has to be cleanup via cleanup_srcu_struct(), so fix it.

Reported-by: syzbot+4f789823c1abc5accf13@syzkaller.appspotmail.com
Fixes: 704b914f15 ("blk-mq: move srcu from blk_mq_hw_ctx to request_queue")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220111123401.520192-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-17 07:24:45 -07:00
GuoYong Zheng
e6a2e5116e block: Remove unnecessary variable assignment
The parameter "ret" should be zero when running to this line,
no need to set to zero again, remove it.

Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1642414957-6785-1-git-send-email-zhenggy@chinatelecom.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-17 07:24:04 -07:00
Yury Norov
9b51d9d866 cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0). This patch replaces 'next' with 'first' where
things look trivial.

There's no cpumask_first_zero() function, so create it.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2022-01-15 08:47:31 -08:00
Linus Torvalds
e1a7aa25ff SCSI misc on 20220113
This series consists of the usual driver updates (ufs, pm80xx, lpfc,
 mpi3mr, mpt3sas, hisi_sas, libsas) and minor updates and bug fixes.
 The most impactful change is likely the switch from GFP_DMA to
 GFP_KERNEL in a bunch of drivers, but even that shouldn't affect too
 many people.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYeCJZyYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfMnAQCsERG9
 V4yX8LDpBjD7leIccf+6krJNNWaIWYYkEdxpzQD9FShB7/yDakFq3erW2y5mVqac
 dZ065M0ckE4bxk9uMIE=
 =gPHF
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (ufs, pm80xx, lpfc,
  mpi3mr, mpt3sas, hisi_sas, libsas) and minor updates and bug fixes.

  The most impactful change is likely the switch from GFP_DMA to
  GFP_KERNEL in a bunch of drivers, but even that shouldn't affect too
  many people"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (121 commits)
  scsi: mpi3mr: Bump driver version to 8.0.0.61.0
  scsi: mpi3mr: Fixes around reply request queues
  scsi: mpi3mr: Enhanced Task Management Support Reply handling
  scsi: mpi3mr: Use TM response codes from MPI3 headers
  scsi: mpi3mr: Add io_uring interface support in I/O-polled mode
  scsi: mpi3mr: Print cable mngnt and temp threshold events
  scsi: mpi3mr: Support Prepare for Reset event
  scsi: mpi3mr: Add Event acknowledgment logic
  scsi: mpi3mr: Gracefully handle online FW update operation
  scsi: mpi3mr: Detect async reset that occurred in firmware
  scsi: mpi3mr: Add IOC reinit function
  scsi: mpi3mr: Handle offline FW activation in graceful manner
  scsi: mpi3mr: Code refactor of IOC init - part2
  scsi: mpi3mr: Code refactor of IOC init - part1
  scsi: mpi3mr: Fault IOC when internal command gets timeout
  scsi: mpi3mr: Display IOC firmware package version
  scsi: mpi3mr: Handle unaligned PLL in unmap cmnds
  scsi: mpi3mr: Increase internal cmnds timeout to 60s
  scsi: mpi3mr: Do access status validation before adding devices
  scsi: mpi3mr: Add support for PCIe Managed Switch SES device
  ...
2022-01-14 14:37:34 +01:00
Laibin Qiu
180dccb0db blk-mq: fix tag_get wait task can't be awakened
In case of shared tags, there might be more than one hctx which
allocates from the same tags, and each hctx is limited to allocate at
most:
        hctx_max_depth = max((bt->sb.depth + users - 1) / users, 4U);

tag idle detection is lazy, and may be delayed for 30sec, so there
could be just one real active hctx(queue) but all others are actually
idle and still accounted as active because of the lazy idle detection.
Then if wake_batch is > hctx_max_depth, driver tag allocation may wait
forever on this real active hctx.

Fix this by recalculating wake_batch when inc or dec active_queues.

Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Suggested-by: Ming Lei <ming.lei@redhat.com>
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20220113025536.1479653-1-qiulaibin@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-13 12:52:14 -07:00
Linus Torvalds
f079ab01b5 Convert xfs/iomap to use folios
This should be all that is needed for XFS to use large folios.
 There is no code in this pull request to create large folios, but
 no additional changes should be needed to XFS or iomap once they
 are created.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmHcpaUACgkQDpNsjXcp
 gj4MUAf+ItcKfgFo1QCMT+6Y0mohVqPme/vdyOCNv6yOOfZZqN5ZQc+2hmxXrRz9
 XPOPwZKL0TttlHSYEJmrm8mqwN8UXl0kqMu4kQqOXMziiD9qpVlaLXOZ7iLdkQxu
 z/xe1iACcGfJUaQCsaMP6BZqp6iETA4qP72dBE4jc6PC4H3OI0pN/900gEbAcLxD
 Yn0a5NhrdS/EySU2aHLB6OcwhqnSiHBVjUbFiuXxuvOYyzLaERIh00Kx3jLdj4DR
 82K4TF8h2IZpALfIDSt0JG+gHLCc+EfF7Yd/xkeEv0md3ncyi+jWvFCFPNJbyFjm
 cYoDTSunfbxwszA2n01R4JM8/KkGwA==
 =IeFX
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux

Pull iomap updates from Matthew Wilcox:
 "Convert xfs/iomap to use folios.

  This should be all that is needed for XFS to use large folios. There
  is no code in this pull request to create large folios, but no
  additional changes should be needed to XFS or iomap once they are
  created.

  Usually this would have come from Darrick, and we had intended that it
  would come that route. Between the holidays and various things which
  Darrick needed to work on, he asked if I could send things directly.

  There weren't any other iomap patches pending for this release, which
  probably also played a role"

* tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux: (26 commits)
  iomap: Inline __iomap_zero_iter into its caller
  xfs: Support large folios
  iomap: Support large folios in invalidatepage
  iomap: Convert iomap_migrate_page() to use folios
  iomap: Convert iomap_add_to_ioend() to take a folio
  iomap: Simplify iomap_do_writepage()
  iomap: Simplify iomap_writepage_map()
  iomap,xfs: Convert ->discard_page to ->discard_folio
  iomap: Convert iomap_write_end_inline to take a folio
  iomap: Convert iomap_write_begin() and iomap_write_end() to folios
  iomap: Convert __iomap_zero_iter to use a folio
  iomap: Allow iomap_write_begin() to be called with the full length
  iomap: Convert iomap_page_mkwrite to use a folio
  iomap: Convert readahead and readpage to use a folio
  iomap: Convert iomap_read_inline_data to take a folio
  iomap: Use folio offsets instead of page offsets
  iomap: Convert bio completions to use folios
  iomap: Pass the iomap_page into iomap_set_range_uptodate
  iomap: Add iomap_invalidate_folio
  iomap: Convert iomap_releasepage to use a folio
  ...
2022-01-12 12:51:41 -08:00
Linus Torvalds
d3c8108035 for-5.17/block-2022-01-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmHd8DAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnhRD/wMAjsNO65PCA+o/bPpVi4ulx9EejAzrJnB
 5vHFvREAoOOGKvRpYGe4w3TcKyW+zPb+GtlXFjPfK+wuVzWhrQtW/+vkjKlBt8wK
 o7rzeMwTKJ9ZGvYaaQpp1yC0WURBB3qnCRQhb8dOQzhJgEXinhIOznZsut4mniLv
 fTqcDmKAb/+G6K6CQCCqnH0I/+OJZyUeSFo1kk2i4ZqCBepQpBkOL6H2rBOtGxUg
 bt1jiGHbbhCRYEE3u2kV0HP10qAChNaMQC705jV4Qpf4+3EntSxs+6nSb74dvMkX
 3+Wmp8Ctq6lpPnDL1nrAFGz3jZnB0Y+GdgOclQn3ViQd1FCXZzuYWQ3fTaBfURCZ
 /RE5nc047SqpwCFLOynM++OkaeQZ1zSxeyoFTtzDaPF4tLuaX3JHswvTzNGPw8SN
 BnexseNnNBCjJliZSEE7fOkjJDcev2dvRxPtI8/wkF4lHUgETc5IW563C53xo/Tx
 32yFjZwCVIpNWk21su/0H3iEq80wZ7PnriiN/E3JA6XbnevlRPu0NPMb0D258GCm
 yCcdPVDNZsQCB8hluqZcu0g6LSgZRo90Yg1oqKqEpAllJJMBaEAPPPuUIJh998mo
 iKGxZzgr7d9jrbGJTInp0F8b3B3/oV/hxgzy0Hu/mHP3AsnaAk9o/oEQZ7rX4Khr
 6biloqkIMA==
 =RWnJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Unify where the struct request handling code is located in the blk-mq
   code (Christoph)

 - Header cleanups (Christoph)

 - Clean up the io_context handling code (Christoph, me)

 - Get rid of ->rq_disk in struct request (Christoph)

 - Error handling fix for add_disk() (Christoph)

 - request allocation cleanusp (Christoph)

 - Documentation updates (Eric, Matthew)

 - Remove trivial crypto unregister helper (Eric)

 - Reduce shared tag overhead (John)

 - Reduce poll_stats memory overhead (me)

 - Known indirect function call for dio (me)

 - Use atomic references for struct request (me)

 - Support request list issue for block and NVMe (me)

 - Improve queue dispatch pinning (Ming)

 - Improve the direct list issue code (Keith)

 - BFQ improvements (Jan)

 - Direct completion helper and use it in mmc block (Sebastian)

 - Use raw spinlock for the blktrace code (Wander)

 - fsync error handling fix (Ye)

 - Various fixes and cleanups (Lukas, Randy, Yang, Tetsuo, Ming, me)

* tag 'for-5.17/block-2022-01-11' of git://git.kernel.dk/linux-block: (132 commits)
  MAINTAINERS: add entries for block layer documentation
  docs: block: remove queue-sysfs.rst
  docs: sysfs-block: document virt_boundary_mask
  docs: sysfs-block: document stable_writes
  docs: sysfs-block: fill in missing documentation from queue-sysfs.rst
  docs: sysfs-block: add contact for nomerges
  docs: sysfs-block: sort alphabetically
  docs: sysfs-block: move to stable directory
  block: don't protect submit_bio_checks by q_usage_counter
  block: fix old-style declaration
  nvme-pci: fix queue_rqs list splitting
  block: introduce rq_list_move
  block: introduce rq_list_for_each_safe macro
  block: move rq_list macros to blk-mq.h
  block: drop needless assignment in set_task_ioprio()
  block: remove unnecessary trailing '\'
  bio.h: fix kernel-doc warnings
  block: check minor range in device_add_disk()
  block: use "unsigned long" for blk_validate_block_size().
  block: fix error unwinding in device_add_disk
  ...
2022-01-12 10:26:52 -08:00
Ming Lei
9d497e2941 block: don't protect submit_bio_checks by q_usage_counter
Commit cc9c884dd7 ("block: call submit_bio_checks under q_usage_counter")
uses q_usage_counter to protect submit_bio_checks for avoiding IO after
disk is deleted by del_gendisk().

Turns out the protection isn't necessary, because once
blk_mq_freeze_queue_wait() in del_gendisk() returns:

1) all in-flight IO has been done

2) all new IO will be failed in __bio_queue_enter() because
   q_usage_counter is dead, and GD_DEAD is set

3) both disk and request queue instance are safe since caller of
submit_bio() guarantees that the disk can't be closed.

Once submit_bio_checks() needn't the protection of q_usage_counter, we can
move submit_bio_checks before calling blk_mq_submit_bio() and
->submit_bio(). With this change, we needn't to throttle queue with
holding one allocated request, then precise driver tag or request won't be
wasted in throttling. Meantime we can unify the bio check for both bio
based and request based driver.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220104134223.590803-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-09 18:54:52 -07:00
Lukas Bulwahn
669a064625 block: drop needless assignment in set_task_ioprio()
Commit 5fc11eebb4 ("block: open code create_task_io_context in
set_task_ioprio") introduces a needless assignment
'ioc = task->io_context', as the local variable ioc is not further
used before returning.

Even after the further fix, commit a957b61254 ("block: fix error in
handling dead task for ioprio setting"), the assignment still remains
needless.

Drop this needless assignment in set_task_ioprio().

This code smell was identified with 'make clang-analyzer'.

Fixes: 5fc11eebb4 ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211223125300.20691-1-lukas.bulwahn@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-23 07:10:07 -07:00
Alan Stern
6e1fcab00a scsi: block: pm: Always set request queue runtime active in blk_post_runtime_resume()
John Garry reported a deadlock that occurs when trying to access a
runtime-suspended SATA device.  For obscure reasons, the rescan procedure
causes the link to be hard-reset, which disconnects the device.

The rescan tries to carry out a runtime resume when accessing the device.
scsi_rescan_device() holds the SCSI device lock and won't release it until
it can put commands onto the device's block queue.  This can't happen until
the queue is successfully runtime-resumed or the device is unregistered.
But the runtime resume fails because the device is disconnected, and
__scsi_remove_device() can't do the unregistration because it can't get the
device lock.

The best way to resolve this deadlock appears to be to allow the block
queue to start running again even after an unsuccessful runtime resume.
The idea is that the driver or the SCSI error handler will need to be able
to use the queue to resolve the runtime resume failure.

This patch removes the err argument to blk_post_runtime_resume() and makes
the routine act as though the resume was successful always.  This fixes the
deadlock.

Link: https://lore.kernel.org/r/1639999298-244569-4-git-send-email-chenxiang66@hisilicon.com
Fixes: e27829dc92 ("scsi: serialize ->rescan against ->remove")
Reported-and-tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-12-22 23:38:29 -05:00
Tetsuo Handa
e338924bd0 block: check minor range in device_add_disk()
ioctl(fd, LOOP_CTL_ADD, 1048576) causes

  sysfs: cannot create duplicate filename '/dev/block/7:0'

message because such request is treated as if ioctl(fd, LOOP_CTL_ADD, 0)
due to MINORMASK == 1048575. Verify that all minor numbers for that device
fit in the minor range.

Reported-by: wangyangbo <wangyangbo@uniontech.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/b1b19379-23ee-5379-0eb5-94bf5f79f1b4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-21 09:34:29 -07:00
Christoph Hellwig
99d8690aae block: fix error unwinding in device_add_disk
One device_add is called disk->ev will be freed by disk_release, so we
should free it twice.  Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.

Based on an earlier patch from Tetsuo Handa.

Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221161851.788424-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-21 09:31:51 -07:00
Ming Lei
37e11c3616 block: call blk_exit_queue() before freeing q->stats
blk_stat_disable_accounting() is added in commit 68497092bd
("block: make queue stat accounting a reference"), and called in
kyber_exit_sched().

So we have to free q->stats after elevator is unloaded from
blk_exit_queue() in blk_release_queue(). Otherwise kernel panic
is caused.

Fixes: 68497092bd ("block: make queue stat accounting a reference")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211221040436.1333880-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-20 21:07:51 -07:00
Jens Axboe
a957b61254 block: fix error in handling dead task for ioprio setting
Don't combine the task exiting and "already have io_context" case, we
need to just abort if the task is marked as dead. Return -ESRCH, which
is the documented value for ioprio_set() if the specified task could not
be found.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: syzbot+8836466a79f4175961b0@syzkaller.appspotmail.com
Fixes: 5fc11eebb4 ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-20 20:32:24 -07:00
Keith Busch
518579a9af blk-mq: blk-mq: check quiesce state before queue_rqs
The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().

Fixes: 3c67d44de7 ("block: add mq_ops->queue_rqs hook")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211220205919.180191-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-20 14:03:59 -07:00
Linus Torvalds
2da09da4ae block-5.16-2021-12-19
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG/Z/MQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjo7EADdal+Z26fAabpI/wHJJQQKOE9WIy3hP5kA
 G466lPP4FvW+2JIclxRGWmBfkvnzzKLjEFKg2OJZQgOvSycut217qBNxG4xpxy6O
 a0B0jKV3Dh8uLGaaOwvwcYmeBzuEnzmr+qh5hQJQDr+9Cpx491Uh7+GABOt+sceY
 +j+KNNNShcK1qUYtYHsKqSvxEAyTmKuN0uz6rOTbGr/Z2TjWXVVBjRw5g3J52U7y
 DjueqM0Y7qebCN+DexdOXZDC4h3A5/j1kCABPe3Vx3yRRWR1RJDLGDfQes/QZDpF
 sf5kkr/1iPki7Dh2laLin4686sNa4HyIJgS8wlrVXGBnRl38SGTShcbuP94m5okJ
 JsdBD38pmmBsSaokKGpxfsG1N5kf3MpBD9zc2WWeiYpwZH+Mr3cRwSvrz3jUpVqL
 MfrHoCHtgqOgIdC7Jdv3ilet3ujfp9yX9QcgIljgkxlBHZHQxX2mXcobROdWbMgz
 v+sn0Ot9+bEf4FrwSeq43fWgBGTK3IbiLajtxWozTKvy3Re+hnDnfoHFqUZE9yfe
 nFzKJPXeplKdM9Y5J75OaJQ+Ohp0F+0jcPNGHeseEWqvJhK0A5mgssM8TqrTQkAo
 GJj8DZQU8Szc0q7yO7gqVJ398sjyKEtL8Rg3qTGLmKJrNdFis+bmjtaDzyDd4HeW
 Uij+cetUuw==
 =dbu4
 -----END PGP SIGNATURE-----

Merge tag 'block-5.16-2021-12-19' of git://git.kernel.dk/linux-block

Pull block revert from Jens Axboe:
 "It turns out that the fix for not hammering on the delayed work timer
  too much caused a performance regression for BFQ, so let's revert the
  change for now.

  I've got some ideas on how to fix it appropriately, but they should
  wait for 5.17"

* tag 'block-5.16-2021-12-19' of git://git.kernel.dk/linux-block:
  Revert "block: reduce kblockd_mod_delayed_work_on() CPU consumption"
2021-12-19 12:38:53 -08:00
Jens Axboe
87959fa16c Revert "block: reduce kblockd_mod_delayed_work_on() CPU consumption"
This reverts commit cb2ac2912a.

Alex and the kernel test robot report that this causes a significant
performance regression with BFQ. I can reproduce that result, so let's
revert this one as we're close to -rc6 and we there's no point in trying
to rush a fix.

Link: https://lore.kernel.org/linux-block/1639853092.524jxfaem2.none@localhost/
Link: https://lore.kernel.org/lkml/20211219141852.GH14057@xsang-OptiPlex-9020/
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-19 07:58:44 -07:00
Linus Torvalds
fa09ca5ebc block-5.16-2021-12-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG8wJ4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp+aD/9C6yo/8I2rbP0iJqIWKaADKoU9WGo4E4Q5
 NFPsEF17IY2mxZGfvpjMMx2fWPbgfvMuXnWPyfHnFv/00Bm3sr79eQ4Y7Ak8pZWQ
 r0YzNz1lbgHn8S+EsqWGxWbDZisptaYN2d0skT81dTE62s2dcZc7WYHBJnCW5JZb
 OTwx68e5QoHrWnoHVnCx8toUszFr0OMiQVRVwbrRtJKVh+2rQODIKz2qs2ajceHo
 IN2Bk35W1xsv6TWGLKCK4zk9k+/Nz5Cqx+FF59bKAIs3kPy3dt10PDjN30jQGR17
 DuGZe08wZ+lCMQGXd1XNcjz3GS9qWlc+mbzVpqUATNMQSGbqwW8JyheiNZwfNClw
 SzWasQD0KipdBiU0hOxBbDko2n4TVFHbnISfBwz0vHEz54HHHtltNd4iBowJuaq4
 VG/zuD2CyWaj7p9Xpymrir1xMonVNTRFRL3mkkU8feTxqTEhy2zIix9CbKdGkU5e
 rEhAAb5i1+/58A8dcCh9X+zULvXjFL6FfbnCGSMrLEq0ymyysc/+w0BPNzq93Jva
 F8qSx2nUGq0250u+Boc+wL9uB7AXWwR8addyWWlwaAZ52nqVhciWus4gh1jn8CAV
 EhO+NFbx/9K1rmPkvbJau1/fgnz4+gJS8iBC0pxlyhvt6rNzMFCJB719jcdr05tG
 JO0131r51g==
 =Huz8
 -----END PGP SIGNATURE-----

Merge tag 'block-5.16-2021-12-17' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Fix for hammering on the delayed run queue timer (me)

 - bcache regression fix for this merge window (Lin)

 - Fix a divide-by-zero in the blk-iocost code (Tejun)

* tag 'block-5.16-2021-12-17' of git://git.kernel.dk/linux-block:
  bcache: fix NULL pointer reference in cached_dev_detach_finish
  block: reduce kblockd_mod_delayed_work_on() CPU consumption
  iocost: Fix divide-by-zero on donation from low hweight cgroup
2021-12-17 11:46:07 -08:00
Matthew Wilcox (Oracle)
85f5a74c2b block: Add bio_add_folio()
This is a thin wrapper around bio_add_page().  The main advantage here
is the documentation that folios larger than 2GiB are not supported.
It's not currently possible to allocate folios that large, but if it
ever becomes possible, this function will fail gracefully instead of
doing I/O to the wrong bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-16 15:49:51 -05:00
Christoph Hellwig
5ef1630586 block: only build the icq tracking code when needed
Only bfq needs to code to track icq, so make it conditional.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
Christoph Hellwig
90b627f542 block: fold create_task_io_context into ioc_find_get_icq
Fold create_task_io_context into the only remaining caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
Christoph Hellwig
5fc11eebb4 block: open code create_task_io_context in set_task_ioprio
The flow in set_task_ioprio can be simplified by simply open coding
create_task_io_context, which removes a refcount roundtrip on the I/O
context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
Christoph Hellwig
8472161b77 block: fold get_task_io_context into set_task_ioprio
Fold get_task_io_context into its only caller, and simplify the code
as no reference to the I/O context is required to just set the ioprio
field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:02 -07:00
Christoph Hellwig
a411cd3cfd block: move set_task_ioprio to blk-ioc.c
Keep set_task_ioprio with the other low-level code that accesses the
io_context structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Christoph Hellwig
091abcb3ef block: cleanup ioc_clear_queue
Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Christoph Hellwig
edf70ff5a1 block: refactor put_io_context
Move the code to delay freeing the icqs into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Christoph Hellwig
8a20c0c7e0 block: remove the NULL ioc check in put_io_context
No caller passes in a NULL pointer, so remove the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Christoph Hellwig
4be8a2eaff block: refactor put_iocontext_active
Factor out a ioc_exit_icqs helper to tear down the icqs and the fold
the rest of put_iocontext_active into exit_io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Christoph Hellwig
0aed2f162b block: simplify struct io_context refcounting
Don't hold a reference to ->refcount for each active reference, but
just one for all active references.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Christoph Hellwig
8a2ba1785c block: remove the nr_task field from struct io_context
Nothing ever looks at ->nr_tasks, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 10:59:01 -07:00
Jens Axboe
3c67d44de7 block: add mq_ops->queue_rqs hook
If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 08:49:17 -07:00
Jens Axboe
fcade2ce06 block: use singly linked list for bio cache
Pointless to maintain a head/tail for the list, as we never need to
access the tail. Entries are always LIFO for cache hotness reasons.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 08:43:09 -07:00
Jens Axboe
5581a5ddfe block: add completion handler for fast path
The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-16 08:42:06 -07:00
Jens Axboe
cb2ac2912a block: reduce kblockd_mod_delayed_work_on() CPU consumption
Dexuan reports that he's seeing spikes of very heavy CPU utilization when
running 24 disks and using the 'none' scheduler. This happens off the
sched restart path, because SCSI requires the queue to be restarted async,
and hence we're hammering on mod_delayed_work_on() to ensure that the work
item gets run appropriately.

Avoid hammering on the timer and just use queue_work_on() if no delay
has been specified.

Reported-and-tested-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-14 20:08:05 -07:00
Jens Axboe
68497092bd block: make queue stat accounting a reference
kyber turns on IO statistics when it is loaded on a queue, which means
that even if kyber is then later unloaded, we're still stuck with stats
enabled on the queue.

Change the account enabled from a bool to an int, and pair the enable call
with the equivalent disable call. This ensures that stats gets turned off
again appropriately.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-14 17:23:05 -07:00
Tejun Heo
edaa26334c iocost: Fix divide-by-zero on donation from low hweight cgroup
The donation calculation logic assumes that the donor has non-zero
after-donation hweight, so the lowest active hweight a donating cgroup can
have is 2 so that it can donate 1 while keeping the other 1 for itself.
Earlier, we only donated from cgroups with sizable surpluses so this
condition was always true. However, with the precise donation algorithm
implemented, f1de2439ec ("blk-iocost: revamp donation amount
determination") made the donation amount calculation exact enabling even low
hweight cgroups to donate.

This means that in rare occasions, a cgroup with active hweight of 1 can
enter donation calculation triggering the following warning and then a
divide-by-zero oops.

 WARNING: CPU: 4 PID: 0 at block/blk-iocost.c:1928 transfer_surpluses.cold+0x0/0x53 [884/94867]
 ...
 RIP: 0010:transfer_surpluses.cold+0x0/0x53
 Code: 92 ff 48 c7 c7 28 d1 ab b5 65 48 8b 34 25 00 ae 01 00 48 81 c6 90 06 00 00 e8 8b 3f fe ff 48 c7 c0 ea ff ff ff e9 95 ff 92 ff <0f> 0b 48 c7 c7 30 da ab b5 e8 71 3f fe ff 4c 89 e8 4d 85 ed 74 0
4
 ...
 Call Trace:
  <IRQ>
  ioc_timer_fn+0x1043/0x1390
  call_timer_fn+0xa1/0x2c0
  __run_timers.part.0+0x1ec/0x2e0
  run_timer_softirq+0x35/0x70
 ...
 iocg: invalid donation weights in /a/b: active=1 donating=1 after=0

Fix it by excluding cgroups w/ active hweight < 2 from donating. Excluding
these extreme low hweight donations shouldn't affect work conservation in
any meaningful way.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: f1de2439ec ("blk-iocost: revamp donation amount determination")
Cc: stable@vger.kernel.org # v5.10+
Link: https://lore.kernel.org/r/Ybfh86iSvpWKxhVM@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-14 06:58:15 -07:00
Matthew Wilcox (Oracle)
0ba4566cd8 bdev: Improve lookup_bdev documentation
Add a Context section and rewrite the rest to be clearer.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Link: https://lore.kernel.org/r/20211213171113.3097631-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-13 11:29:16 -07:00
Linus Torvalds
eccea80be2 block-5.16-2021-12-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmG0PvAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjiDEACP9AWpsoLiN5CCRsrQpneErFRaYsov59OD
 n299Gy9hsccVVbl1hjgp3lF/RrCLkdQ3K1WNlIZ+pG4OO20us9b8zdgTO7u9ZOvi
 sKyucmPhzpUnbUidmdoytW0KdvQ7SXO95OI2C7OSJhNCzJdpemiu6WpFJKwYiB2V
 Agv73knEtjhNIM8biWgy5B8KekYulkGUy9JO3tLgnpLbnnln210WFvUNamW0JJKg
 671OwT7TSbK0tZF1wqtaxv4iaf1CqHPLDxJaABZvo6FGTLc878xD8KFmCFMsWNGq
 dl+R9HdG+fGvaaZiEBMoI6Gy02M7BcQimGnra33nWghe+XCpR10efZ3K/Tl3HrwX
 eiw2GGBASejrgioMqa15XVw36e6MMQwZeqisHHNiULUm0UG9SipJYTWlpy0FqfbG
 vRdkArHNXCJgNN6kacRqnfspe2vf0VGOMLODcwGKzl7iwf4shAHlWsvcvE/bPpe1
 GwQosO/6XUiF6jIFu05QsqU7FHjk7rMknDVrNX6oxxnn84ZOMUHfZDMkFbK5vDsZ
 n5f0uEohgJWzOCMrHbrtW252lwWLGxxa0XZdTZLsD8xnS3axZf9JpgAXINf1lHey
 YGRsvo3veZNFRI1S/HgF2/IcqyP4rMeFZDDJK/L5Tq5IvDo2usjcFEeyzNTOosKd
 hhdETVuZHg==
 =aqTJ
 -----END PGP SIGNATURE-----

Merge tag 'block-5.16-2021-12-10' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few block fixes that should go into this release:

   - NVMe pull request:
        - set ana_log_size to 0 after freeing ana_log_buf (Hou Tao)
        - show subsys nqn for duplicate cntlids (Keith Busch)
        - disable namespace access for unsupported metadata (Keith
          Busch)
        - report write pointer for a full zone as zone start + zone len
          (Niklas Cassel)
        - fix use after free when disconnecting a reconnecting ctrl
          (Ruozhu Li)
        - fix a list corruption in nvmet-tcp (Sagi Grimberg)

   - Fix for a regression on DIO single bio async IO (Pavel)

   - ioprio seteuid fix (Davidlohr)

   - mtd fix that subsequently got reverted as it was broken, will get
     re-done and submitted for the next round

   - Two MD fixes via Song (Markus, zhangyue)"

* tag 'block-5.16-2021-12-10' of git://git.kernel.dk/linux-block:
  Revert "mtd_blkdevs: don't scan partitions for plain mtdblock"
  block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2)
  md: fix double free of mddev->private in autorun_array()
  md: fix update super 1.0 on rdev size change
  nvmet-tcp: fix possible list corruption for unexpected command failure
  block: fix single bio async DIO error handling
  nvme: fix use after free when disconnecting a reconnecting ctrl
  nvme-multipath: set ana_log_size to 0 after free ana_log_buf
  mtd_blkdevs: don't scan partitions for plain mtdblock
  nvme: report write pointer for a full zone as zone start + zone len
  nvme: disable namespace access for unsupported metadata
  nvme: show subsys nqn for duplicate cntlids
2021-12-11 09:25:07 -08:00
Davidlohr Bueso
e6a59aac8a block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2)
do_each_pid_thread(PIDTYPE_PGID) can race with a concurrent
change_pid(PIDTYPE_PGID) that can move the task from one hlist
to another while iterating. Serialize ioprio_get to take
the tasklist_lock in this case, just like it's set counterpart.

Fixes: d69b78ba1d (ioprio: grab rcu_read_lock in sys_ioprio_{set,get}())
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Link: https://lore.kernel.org/r/20211210182058.43417-1-dave@stgolabs.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-10 11:26:07 -07:00
Jakub Kicinski
6efcdadc15 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
bpf 2021-12-08

We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 29 files changed, 659 insertions(+), 80 deletions(-).

The main changes are:

1) Fix an off-by-two error in packet range markings and also add a batch of
   new tests for coverage of these corner cases, from Maxim Mikityanskiy.

2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh.

3) Fix two functional regressions and a build warning related to BTF kfunc
   for modules, from Kumar Kartikeya Dwivedi.

4) Fix outdated code and docs regarding BPF's migrate_disable() use on non-
   PREEMPT_RT kernels, from Sebastian Andrzej Siewior.

5) Add missing includes in order to be able to detangle cgroup vs bpf header
   dependencies, from Jakub Kicinski.

6) Fix regression in BPF sockmap tests caused by missing detachment of progs
   from sockets when they are removed from the map, from John Fastabend.

7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF
   dispatcher, from Björn Töpel.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add selftests to cover packet access corner cases
  bpf: Fix the off-by-two error in range markings
  treewide: Add missing includes masked by cgroup -> bpf dependency
  tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets
  bpf: Fix bpf_check_mod_kfunc_call for built-in modules
  bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
  mips, bpf: Fix reference to non-existing Kconfig symbol
  bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.
  Documentation/locking/locktypes: Update migrate_disable() bits.
  bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
  bpf, sockmap: Attach map progs to psock early for feature probes
  bpf, x86: Fix "no previous prototype" warning
====================

Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-08 16:06:44 -08:00
Pavel Begunkov
75feae73a2 block: fix single bio async DIO error handling
BUG: KASAN: use-after-free in io_submit_one+0x496/0x2fe0 fs/aio.c:1882
CPU: 2 PID: 15100 Comm: syz-executor873 Not tainted 5.16.0-rc1-syzk #1
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
04/01/2014
Call Trace:
  [...]
  refcount_dec_and_test include/linux/refcount.h:333 [inline]
  iocb_put fs/aio.c:1161 [inline]
  io_submit_one+0x496/0x2fe0 fs/aio.c:1882
  __do_sys_io_submit fs/aio.c:1938 [inline]
  __se_sys_io_submit fs/aio.c:1908 [inline]
  __x64_sys_io_submit+0x1c7/0x4a0 fs/aio.c:1908
  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
  do_syscall_64+0x3a/0x80 arch/x86/entry/common.c:80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

__blkdev_direct_IO_async() returns errors from bio_iov_iter_get_pages()
directly, in which case upper layers won't be expecting ->ki_complete
to be called by the block layer and will terminate the request. However,
there is also bio_endio() leading to a second ->ki_complete and a double
free.

Fixes: 54a88eb838 ("block: add single bio async direct IO helper")
Reported-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c9eb786f6cef041e159e6287de131bec0719ad5c.1638907997.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-07 15:07:40 -07:00
John Garry
fea9f92f17 blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags
Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees
using megaraid SAS RAID card since moving to shared tags [0].

Previously, when shared tags was shared sbitmap, this function was less
than optimum since we would iter through all tags for all hctx's,
yet only ever match upto tagset depth number of rqs.

Since the change to shared tags, things are even less efficient if we have
parallel callers of blk_mq_queue_tag_busy_iter(). This is because in
bt_iter() -> blk_mq_find_and_get_req() there would be more contention on
accessing each request ref and tags->lock since they are now shared among
all HW queues.

Optimise by having separate calls to bt_for_each() for when we're using
shared tags. In this case no longer pass a hctx, as it is no longer
relevant, and teach bt_iter() about this.

Ming suggested something along the lines of this change, apart from a
different implementation.

[0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reported-and-tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: e155b0c238 ("blk-mq: Use shared tags for shared sbitmap support")
Link: https://lore.kernel.org/r/1638794990-137490-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 13:18:47 -07:00
John Garry
fc39f8d2d1 blk-mq: Delete busy_iter_fn
Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete
busy_iter_fn to reduce duplication.

It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is
less specific.

However busy_tag_iter_fn is used in many different parts of the tree,
unlike busy_iter_fn which is just use in block/, so just take the
straightforward path now, so that we could rename later treewide.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 13:18:47 -07:00
John Garry
8ab30a3319 blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().

Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.

With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.

Reviewed-by Hannes Reinecke <hare@suse.de>

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 13:18:47 -07:00
Ming Lei
73f3760edd blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops()
blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.

Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 4cafe86c92 ("blk-mq: run dispatch lock once in case of issuing from list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 09:41:40 -07:00
Ming Lei
41adf531e3 blk-mq: don't run might_sleep() if the operation needn't blocking
The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.

Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-06 09:40:42 -07:00
Ming Lei
4cafe86c92 blk-mq: run dispatch lock once in case of issuing from list
It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Ming Lei
bcc330f42f blk-mq: pass request queue to blk_mq_run_dispatch_ops
We have switched to allocate srcu into request queue, so it is fine
to pass request queue to blk_mq_run_dispatch_ops().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Ming Lei
704b914f15 blk-mq: move srcu from blk_mq_hw_ctx to request_queue
In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.

Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.

So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Ming Lei
2a904d0085 blk-mq: remove hctx_lock and hctx_unlock
Remove hctx_lock and hctx_unlock, and add one helper of
blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops
with rcu/srcu read held.

Compared with hctx_lock()/hctx_unlock():

1) remove 2 branch to 1, so we just need to check
(hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops

2) srcu_idx needn't to be touched in case of non-blocking

3) might_sleep_if() can be moved to the blocking branch

Also put the added blk_mq_run_dispatch_ops() in private header, so that
the following patch can use it out of blk-mq.c.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Jens Axboe
0a467d0fdd block: switch to atomic_t for request references
refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.

This borrows that same implementation, which in turn is based on the
mm implementation from Linus.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Jens Axboe
ceaa762527 block: move direct_IO into our own read_iter handler
Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.

Fall back to filemap_read() if we fail.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-03 14:51:29 -07:00
Jakub Kicinski
8581fd402a treewide: Add missing includes masked by cgroup -> bpf dependency
cgroup.h (therefore swap.h, therefore half of the universe)
includes bpf.h which in turn includes module.h and slab.h.
Since we're about to get rid of that dependency we need
to clean things up.

v2: drop the cpu.h include from cacheinfo.h, it's not necessary
and it makes riscv sensitive to ordering of include files.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Krzysztof Wilczyński <kw@linux.com>
Acked-by: Peter Chen <peter.chen@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20211120035253.72074-1-kuba@kernel.org/  # v1
Link: https://lore.kernel.org/all/20211120165528.197359-1-kuba@kernel.org/ # cacheinfo discussion
Link: https://lore.kernel.org/bpf/20211202203400.1208663-1-kuba@kernel.org
2021-12-03 10:58:13 -08:00
Jens Axboe
a08ed9aae8 block: fix double bio queue when merging in cached request path
When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.

Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.

Fixes: 5b13bc8a3f ("blk-mq: cleanup request allocation")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-02 19:39:01 -07:00
Jens Axboe
373b5416b4 block: get rid of useless goto and label in blk_mq_get_new_requests()
Expected case is returning a request, just check for success and return
the request rather than having an error label.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-02 12:42:58 -07:00
Ming Lei
18d78171c0 blk-mq: check q->poll_stat in queue_poll_stat_show
Without checking q->poll_stat in queue_poll_stat_show(), kernel panic
may be caused if q->poll_stat isn't allocated.

Fixes: 48b5c1fbcd ("block: only allocate poll_stats if there's a user of them")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211202090716.3292244-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-02 08:20:49 -07:00
Ye Bin
8a7518931b block: Fix fsync always failed if once failed
We do test with inject error fault base on v4.19, after test some time we found
sync /dev/sda always failed.
[root@localhost] sync /dev/sda
sync: error syncing '/dev/sda': Input/output error

scsi log as follows:
[19069.812296] sd 0:0:0:0: [sda] tag#64 Send: scmd 0x00000000d03a0b6b
[19069.812302] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812533] sd 0:0:0:0: [sda] tag#64 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[19069.812536] sd 0:0:0:0: [sda] tag#64 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[19069.812539] sd 0:0:0:0: [sda] tag#64 scsi host busy 1 failed 0
[19069.812542] sd 0:0:0:0: Notifying upper driver of completion (result 0)
[19069.812546] sd 0:0:0:0: [sda] tag#64 sd_done: completed 0 of 0 bytes
[19069.812549] sd 0:0:0:0: [sda] tag#64 0 sectors total, 0 bytes done.
[19069.812564] print_req_error: I/O error, dev sda, sector 0

ftrace log as follows:
 rep-306069 [007] .... 19654.923315: block_bio_queue: 8,0 FWS 0 + 0 [rep]
 rep-306069 [007] .... 19654.923333: block_getrq: 8,0 FWS 0 + 0 [rep]
 kworker/7:1H-250   [007] .... 19654.923352: block_rq_issue: 8,0 FF 0 () 0 + 0 [kworker/7:1H]
 <idle>-0     [007] ..s. 19654.923562: block_rq_complete: 8,0 FF () 18446744073709551615 + 0 [0]
 <idle>-0     [007] d.s. 19654.923576: block_rq_complete: 8,0 WS () 0 + 0 [-5]

As 8d6996630c introduce 'fq->rq_status', this data only update when 'flush_rq'
reference count isn't zero. If flush request once failed and record error code
in 'fq->rq_status'. If there is no chance to update 'fq->rq_status',then do fsync
will always failed.
To address this issue reset 'fq->rq_status' after return error code to upper layer.

Fixes: 8d6996630c03("block: fix null pointer dereference in blk_mq_rq_timed_out()")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211129012659.1553733-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:46:04 -07:00
Christoph Hellwig
b84ba30b6c block: remove the gendisk argument to blk_execute_rq
Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now.  Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
f3fa33acca block: remove the ->rq_disk field in struct request
Just use the disk attached to the request_queue instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
79bb1dbd12 block: don't check ->rq_disk in merges
There is a 1:1 relationship between request_queues and gendisks now, so
no need for these extra checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Colin Ian King
af22fef3e7 block: Remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20211126230652.1175636-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
eca5892a5d block: simplify ioc_lookup_icq
Remove the ioc argument as it always points to current->io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
18b74c4dca block: simplify ioc_create_icq
Remove the ioc and gfp_mask argument, which are hard coded by the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
d538ea4cb8 block: return the io_context from create_task_io_context
Grab a reference to the newly allocated or existing io_context in
create_task_io_context and return it.  This simplifies the callers and
removes the need for double lookups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
8ffc13680e block: use alloc_io_context in __copy_io
In __copy_io we know that the newly allocate task_struct does not have
an I/O context yet and is not exiting.  So just allocate the I/O context
struct and install it directly.  There is no need to lock the task
either as it is just being created.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
a0f14d8baa block: factor out a alloc_io_context helper
Factor out a helper that just allocate an I/O context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
50569c24be block: remove get_io_context_active
Fold it into it's only caller, and remove a lof of the debug checks
that are not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
222ee581b8 block: move the remaining elv.icq handling to the I/O scheduler
After the prepare side has been moved to the only I/O scheduler that
cares, do the same for the cleanup and the NULL initialization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
87dd1d63dc block: move blk_mq_sched_assign_ioc to blk-ioc.c
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static.  Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
3304742562 block: mark put_io_context_active static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
c2a32464f4 Revert "block: Provide blk_mq_sched_get_icq()"
This reverts commit 4896c4e64ba5d5d5acdbcf68c5910dd4f6d8fa62.

The helper is not needed any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
a0725c22cd bfq: use bfq_bic_lookup in bfq_limit_depth
No need to create a new I/O context if there is none present yet in
->limit_depth.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
836b394b63 bfq: simplify bfq_bic_lookup
Remove the unused bfqd argument, and hardcode ioc to current->io_context.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Christoph Hellwig
88c9a2ce52 fork: move copy_io to block/blk-ioc.c
Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:29 -07:00
Ming Lei
5f480b1a63 blk-mq: use bio->bi_opf after bio is checked
bio->bi_opf isn't finalized before checking the bio, so use it after
submit_bio_checks() returns.

Fixes: 5b13bc8a3f ("blk-mq: cleanup request allocation")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:41:23 -07:00
Jan Kara
c65e6fd460 bfq: Do not let waker requests skip proper accounting
Commit 7cc4ffc555 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.

Consider for example the following job file:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200

[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200

Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.

Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.

Fixes: 7cc4ffc555 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:39:31 -07:00
Jan Kara
1eb17f5e15 bfq: Log waker detections
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:39:31 -07:00
Jan Kara
582f04e19a bfq: Provide helper to generate bfqq name
Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:52 -07:00
Jan Kara
1f18b7005b bfq: Limit waker detection in time
Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.

This significantly reduces false waker detection for workload like:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
fsync=200

[fastwriter]
numjobs=1
fsync=200

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
76f1df88bb bfq: Limit number of requests consumed by each cgroup
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.

Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.

Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
44dfa279f1 bfq: Store full bitmap depth in bfq_data
Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
98f044999b bfq: Track number of allocated requests in bfq_entity
When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Jan Kara
790cf9c848 block: Provide blk_mq_sched_get_icq()
Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So provide a
function blk_mq_sched_get_icq() so that BFQ can lookup ICQ earlier.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Eric Biggers
72cd9df2ef blk-crypto: remove blk_crypto_unregister()
This function is trivial and is only used in one place.  Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Christoph Hellwig
5b13bc8a3f blk-mq: cleanup request allocation
Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.

There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124062856.1444266-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:51 -07:00
Christoph Hellwig
82d981d423 block: don't include <linux/part_stat.h> in blk.h
Not needed, shift it into the source files that need it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
ca5b304cab block: don't include <linux/idr.h> in blk.h
Not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
a2ff7781cf block: don't include <linux/blk-mq.h> in blk.h
Not needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
e4a19f7289 block: don't include blk-mq.h in blk.h
No needed, shift a blk-stat.h include into the source file that needs it
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
2aa7745bf6 block: don't include blk-mq-sched.h in blk.h
No needed, shift it into the source files that need it instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
0c6cb3a293 block: remove the e argument to elevator_exit
All callers pass q->elevator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
f46b81c54b block: remove elevator_exit
Open code elevator_exit in it's only caller, and rename __elevator_exit to
elevator_exit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Christoph Hellwig
0281ed3cf4 block: move blk_get_flush_queue to blk-flush.c
blk_get_flush_queue is only used in blk-flush.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Guo Zhengkui
35c90e6ec9 blk_mq: remove repeated includes
Remove a repeated "#include<linux/sched/sysctl.h>".

Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211123063340.25882-1-guozhengkui@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Jens Axboe
5a9d041ba2 block: move io_context creation into where it's needed
The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.

Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:44 -07:00
Jens Axboe
48b5c1fbcd block: only allocate poll_stats if there's a user of them
This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.

Kill the queue flag for this while at it, since we can just check the
assigned pointer now.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Jens Axboe
25c4b5e058 blk-ioprio: don't set bio priority if not needed
We don't need to write to the bio if:

1) No ioprio value has ever been assigned to the blkcg
2) We wouldn't anyway, depending on bio and blkcg IO priority

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
1e9c23034d blk-mq: move more plug handling from blk_mq_submit_bio into blk_add_rq_to_plug
Keep all the functionality for adding a request to a plug in a single place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
0c5bcc92d9 blk-mq: simplify the plug handling in blk_mq_submit_bio
blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.

The limited plugging case is only used for a corner case that does
not matter in real life:

 - no ->commit_rqs (so not NVMe)
 - no shared tags (so not SCSI)
 - not rotational (so no old disk or floppy driver)
 - must have multiple queues (so no eMMC)

Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
9f18db572c block: don't set GENHD_FL_NO_PART for hidden gendisks
Hidden gendisks can't be opened using blkdev_get_*, so we can't really
reach any of the partition scanning paths or partitioning ioctls except
for the initial partition scan from add_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
1ebe2e5f9d block: remove GENHD_FL_EXT_DEVT
All modern drivers can support extra partitions using the extended
dev_t.  In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
3b5149ac50 block: remove GENHD_FL_SUPPRESS_PARTITION_INFO
This flag is not set directly anywhere and only inherited from
GENHD_FL_HIDDEN.  Just check for GENHD_FL_HIDDEN instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:35 -07:00
Christoph Hellwig
140862805a block: remove the GENHD_FL_HIDDEN check in blkdev_get_no_open
Hidden gendisks never hash the block device inode, so this can't happen.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:38:04 -07:00
Christoph Hellwig
46e7eac647 block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART
The GENHD_FL_NO_PART_SCAN controls more than just partitions canning,
so rename it to GENHD_FL_NO_PART.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
e16e506ccd block: merge disk_scan_partitions and blkdev_reread_part
Unify the functionality that implements a partition rescan for a
gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
e3b3bad3f2 block: remove a dead check in show_partition
disk_max_parts never returns 0 given that ->minors for devices not using
the extended dev_t must be non-zero, and disk_max_parts always returns
DISK_MAX_PARTS for the latter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
1545e0b419 block: move GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE to disk->event_flags
GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE is all about the event reporting
mechanism, so move it to the event_flags field.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
8641691646 block: move GENHD_FL_NATIVE_CAPACITY to disk->state
The flag to indicate an unlocked native capacity is dynamic state,
not a driver capability flag, so move it to disk->state.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
d9337a420a block: don't include blk-mq headers in blk-core.c
All request based code is in the blk-mq files now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
0d7a29a2b5 block: move blk_print_req_error to blk-mq.c
This function is only used by the request completion path.  Factor out
a blk_status_to_str to keep blk_errors private in blk-core.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:35:21 -07:00
Christoph Hellwig
22350ad7f1 block: move blk_dump_rq_flags to blk-mq.c
blk_dump_rq_flags deals with a request, so move it to blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
450b7879e3 block: move blk_account_io_{start,done} to blk-mq.c
These are only used for request based I/O, so move them where they are
used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
f2b8f3ce98 block: move blk_steal_bios to blk-mq.c
Keep all the request based code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
52fdbbcc83 block: move blk_rq_init to blk-mq.c
blk_rq_init deals with a request structure, so move it to blk-mq.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:51 -07:00
Christoph Hellwig
06c8c691e2 block: move request based cloning helpers to blk-mq.c
Keep all the request based code together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
b84c5b50d3 blk-mq: move blk_mq_flush_plug_list
Move blk_mq_flush_plug_list and blk_mq_plug_issue_direct down in blk-mq.c
to prepare for marking blk_mq_request_issue_directly static without the
need of a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
4054cff92c block: remove blk-exec.c
All this code is tightly coupled to the blk-mq core, so move it
there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-4-hch@lst.de
[axboe: remove doc generation for blk-exec.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
786d4e01c5 block: remove rq_flush_dcache_pages
This function is trivial, and flush_dcache_page is always defined, so
just open code it in the 2.5 callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Christoph Hellwig
79478bf9ea block: move blk_rq_err_bytes to scsi
blk_rq_err_bytes is only used by the scsi midlayer, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29 06:34:50 -07:00
Jens Axboe
98b26a0e76 block: call rq_qos_done() before ref check in batch completions
We need to call rq_qos_done() regardless of whether or not we're freeing
the request or not, as the reference count doesn't cover the IO completion
tracking.

Fixes: f794f3351f ("block: add support for blk_mq_end_request_batch()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reported-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-26 09:53:23 -07:00
Yang Guang
e30028ace8 block: fix parameter not described warning
The build warning:
block/blk-core.c:968: warning: Function parameter or member 'iob'
not described in 'bio_poll'.

Fixes: 5a72e899ce ("block: add a struct io_comp_batch argument to fops->iopoll()")
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-25 09:32:19 -07:00
Ming Lei
efcf593223 block: avoid to touch unloaded module instance when opening bdev
disk->fops->owner is grabbed in blkdev_get_no_open() after the disk
kobject refcount is increased. This way can't make sure that
disk->fops->owner is still alive since del_gendisk() still can move
on if the kobject refcount of disk is grabbed by open() and
disk->fops->open() isn't called yet.

Fixes the issue by moving try_module_get() into blkdev_get_by_dev()
with ->open_mutex() held, then we can drain the in-progress open()
in del_gendisk(). Meantime new open() won't succeed because disk
becomes not alive.

This way is reasonable because blkdev_get_no_open() needn't to touch
disk->fops or defined callbacks.

Cc: Christoph Hellwig <hch@lst.de>
Cc: czhong@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111020343.316126-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-22 18:35:37 -07:00
Ming Lei
2b504bd484 blk-mq: don't insert FUA request with data into scheduler queue
We never insert flush request into scheduler queue before.

Recently commit d92ca9d834 ("blk-mq: don't handle non-flush requests in
blk_insert_flush") tries to handle FUA data request as normal request.
This way has caused warning[1] in mq-deadline dd_exit_sched() or io hang in
case of kyber since RQF_ELVPRIV isn't set for flush request, then
->finish_request won't be called.

Fix the issue by inserting FUA data request with blk_mq_request_bypass_insert()
when the device supports FUA, just like what we did before.

[1] https://lore.kernel.org/linux-block/CAHj4cs-_vkTW=dAzbZYGxpEWSpzpcmaNeY1R=vH311+9vMUSdg@mail.gmail.com/

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: d92ca9d834 ("blk-mq: don't handle non-flush requests in blk_insert_flush")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211118153041.2163228-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-19 06:28:18 -07:00
Yu Kuai
15c3010496 blk-cgroup: fix missing put device in error path from blkg_conf_pref()
If blk_queue_enter() failed due to queue is dying, the
blkdev_put_no_open() is needed because blkcg_conf_open_bdev() succeeded.

Fixes: 0c9d338c84 ("blk-cgroup: synchronize blkg creation against policy deactivation")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211102020705.2321858-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-19 06:26:45 -07:00
Ming Lei
245a489e81 block: avoid to quiesce queue in elevator_init_mq
elevator_init_mq() is only called before adding disk, when there isn't
any FS I/O, only passthrough requests can be queued, so freezing queue
plus canceling dispatch work is enough to drain any dispatch activities,
then we can avoid synchronize_srcu() in blk_mq_quiesce_queue().

Long boot latency issue can be fixed in case of lots of disks added
during booting.

Fixes: 737eb78e82 ("block: Delay default elevator initialization")
Reported-by: yangerkun <yangerkun@huawei.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211117115502.1600950-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-17 07:43:26 -07:00
Ming Lei
2a19b28f79 blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release()
For avoiding to slow down queue destroy, we don't call
blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
cancel dispatch work in blk_release_queue().

However, this way has caused kernel oops[1], reported by Changhui. The log
shows that scsi_device can be freed before running blk_release_queue(),
which is expected too since scsi_device is released after the scsi disk
is closed and the scsi_device is removed.

Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
and disk_release():

1) when disk_release() is run, the disk has been closed, and any sync
dispatch activities have been done, so canceling dispatch work is enough to
quiesce filesystem I/O dispatch activity.

2) in blk_cleanup_queue(), we only focus on passthrough request, and
passthrough request is always explicitly allocated & freed by
its caller, so once queue is frozen, all sync dispatch activity
for passthrough request has been done, then it is enough to just cancel
dispatch work for avoiding any dispatch activity.

[1] kernel panic log
[12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
[12622.777186] #PF: supervisor read access in kernel mode
[12622.782918] #PF: error_code(0x0000) - not-present page
[12622.788649] PGD 0 P4D 0
[12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
[12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
[12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[12622.813321] Workqueue: kblockd blk_mq_run_work_fn
[12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
[12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
[12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
[12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
[12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
[12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
[12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
[12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
[12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
[12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
[12622.913328] Call Trace:
[12622.916055]  <TASK>
[12622.918394]  scsi_mq_get_budget+0x1a/0x110
[12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
[12622.928404]  ? pick_next_task_fair+0x39/0x390
[12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
[12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
[12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
[12622.949593]  process_one_work+0x1e8/0x3c0
[12622.954059]  worker_thread+0x50/0x3b0
[12622.958144]  ? rescuer_thread+0x370/0x370
[12622.962616]  kthread+0x158/0x180
[12622.966218]  ? set_kthread_struct+0x40/0x40
[12622.970884]  ret_from_fork+0x22/0x30
[12622.974875]  </TASK>
[12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

Reported-by: ChanghuiZhong <czhong@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15 19:22:13 -07:00
Jens Axboe
95febeb61b block: fix missing queue put in error path
If we fail the submission queue checks, we don't put the queue afterwards.
This can cause various issues like stalls on scheduler switch or failure
to remove the device, or like in the original bug report, timeout waiting
for the device on reboot/restart.

While in there, fix a few whitespace discrepancies in the surrounding
code.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215039
Fixes: b637108a40 ("blk-mq: fix filesystem I/O request allocation")
Reported-and-tested-by: Stephen Smith <stephenmsmith@blueyonder.co.uk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15 17:00:54 -07:00
Alistair Delva
94c4b4fd25 block: Check ADMIN before NICE for IOPRIO_CLASS_RT
Booting to Android userspace on 5.14 or newer triggers the following
SELinux denial:

avc: denied { sys_nice } for comm="init" capability=23
     scontext=u:r:init:s0 tcontext=u:r:init:s0 tclass=capability
     permissive=0

Init is PID 0 running as root, so it already has CAP_SYS_ADMIN. For
better compatibility with older SEPolicy, check ADMIN before NICE.

Fixes: 9d3a39a5f1 ("block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE")
Signed-off-by: Alistair Delva <adelva@google.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: selinux@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: kernel-team@android.com
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Serge Hallyn <serge@hallyn.com>
Link: https://lore.kernel.org/r/20211115181655.3608659-1-adelva@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-15 14:28:59 -07:00
Ming Lei
b637108a40 blk-mq: fix filesystem I/O request allocation
submit_bio_checks() may update bio->bi_opf, so we have to initialize
blk_mq_alloc_data.cmd_flags with bio->bi_opf after submit_bio_checks()
returns when allocating new request.

In case of using cached request, fallback to allocate new request if
cached rq isn't compatible with the incoming bio, otherwise change
rq->cmd_flags with incoming bio->bi_opf.

Fixes: 900e080752 ("block: move queue enter logic into blk_mq_submit_bio()")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-12 09:31:13 -07:00
Laibin Qiu
b781d8db58 blkcg: Remove extra blkcg_bio_issue_init
KASAN reports a use-after-free report when doing block test:

==================================================================
[10050.967049] BUG: KASAN: use-after-free in
submit_bio_checks+0x1539/0x1550

[10050.977638] Call Trace:
[10050.978190]  dump_stack+0x9b/0xce
[10050.979674]  print_address_description.constprop.6+0x3e/0x60
[10050.983510]  kasan_report.cold.9+0x22/0x3a
[10050.986089]  submit_bio_checks+0x1539/0x1550
[10050.989576]  submit_bio_noacct+0x83/0xc80
[10050.993714]  submit_bio+0xa7/0x330
[10050.994435]  mpage_readahead+0x380/0x500
[10050.998009]  read_pages+0x1c1/0xbf0
[10051.002057]  page_cache_ra_unbounded+0x4c2/0x6f0
[10051.007413]  do_page_cache_ra+0xda/0x110
[10051.008207]  force_page_cache_ra+0x23d/0x3d0
[10051.009087]  page_cache_sync_ra+0xca/0x300
[10051.009970]  generic_file_buffered_read+0xbea/0x2130
[10051.012685]  generic_file_read_iter+0x315/0x490
[10051.014472]  blkdev_read_iter+0x113/0x1b0
[10051.015300]  aio_read+0x2ad/0x450
[10051.023786]  io_submit_one+0xc8e/0x1d60
[10051.029855]  __se_sys_io_submit+0x125/0x350
[10051.033442]  do_syscall_64+0x2d/0x40
[10051.034156]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[10051.048733] Allocated by task 18598:
[10051.049482]  kasan_save_stack+0x19/0x40
[10051.050263]  __kasan_kmalloc.constprop.1+0xc1/0xd0
[10051.051230]  kmem_cache_alloc+0x146/0x440
[10051.052060]  mempool_alloc+0x125/0x2f0
[10051.052818]  bio_alloc_bioset+0x353/0x590
[10051.053658]  mpage_alloc+0x3b/0x240
[10051.054382]  do_mpage_readpage+0xddf/0x1ef0
[10051.055250]  mpage_readahead+0x264/0x500
[10051.056060]  read_pages+0x1c1/0xbf0
[10051.056758]  page_cache_ra_unbounded+0x4c2/0x6f0
[10051.057702]  do_page_cache_ra+0xda/0x110
[10051.058511]  force_page_cache_ra+0x23d/0x3d0
[10051.059373]  page_cache_sync_ra+0xca/0x300
[10051.060198]  generic_file_buffered_read+0xbea/0x2130
[10051.061195]  generic_file_read_iter+0x315/0x490
[10051.062189]  blkdev_read_iter+0x113/0x1b0
[10051.063015]  aio_read+0x2ad/0x450
[10051.063686]  io_submit_one+0xc8e/0x1d60
[10051.064467]  __se_sys_io_submit+0x125/0x350
[10051.065318]  do_syscall_64+0x2d/0x40
[10051.066082]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[10051.067455] Freed by task 13307:
[10051.068136]  kasan_save_stack+0x19/0x40
[10051.068931]  kasan_set_track+0x1c/0x30
[10051.069726]  kasan_set_free_info+0x1b/0x30
[10051.070621]  __kasan_slab_free+0x111/0x160
[10051.071480]  kmem_cache_free+0x94/0x460
[10051.072256]  mempool_free+0xd6/0x320
[10051.072985]  bio_free+0xe0/0x130
[10051.073630]  bio_put+0xab/0xe0
[10051.074252]  bio_endio+0x3a6/0x5d0
[10051.074984]  blk_update_request+0x590/0x1370
[10051.075870]  scsi_end_request+0x7d/0x400
[10051.076667]  scsi_io_completion+0x1aa/0xe50
[10051.077503]  scsi_softirq_done+0x11b/0x240
[10051.078344]  blk_mq_complete_request+0xd4/0x120
[10051.079275]  scsi_mq_done+0xf0/0x200
[10051.080036]  virtscsi_vq_done+0xbc/0x150
[10051.080850]  vring_interrupt+0x179/0x390
[10051.081650]  __handle_irq_event_percpu+0xf7/0x490
[10051.082626]  handle_irq_event_percpu+0x7b/0x160
[10051.083527]  handle_irq_event+0xcc/0x170
[10051.084297]  handle_edge_irq+0x215/0xb20
[10051.085122]  asm_call_irq_on_stack+0xf/0x20
[10051.085986]  common_interrupt+0xae/0x120
[10051.086830]  asm_common_interrupt+0x1e/0x40

==================================================================

Bio will be checked at beginning of submit_bio_noacct(). If bio needs
to be throttled, it will start the timer and stop submit bio directly.
Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
But in the current process, if bio is throttled, it will still set bio
issue->value by blkcg_bio_issue_init(). This is redundant and may cause
the above use-after-free.

CPU0                                   CPU1
submit_bio
submit_bio_noacct
  submit_bio_checks
    blk_throtl_bio()
      <=mod_timer(&sq->pending_timer
                                      blk_throtl_dispatch_work_fn
                                        submit_bio_noacct() <= bio have
                                        throttle tag, will throw directly
                                        and bio issue->value will be set
                                        here

                                      bio_endio()
                                      bio_put()
                                      bio_free() <= free this bio

    blkcg_bio_issue_init(bio)
      <= bio has been freed and
      will lead to UAF
  return BLK_QC_T_NONE

Fix this by remove extra blkcg_bio_issue_init.

Fixes: e439bedf6b (blkcg: consolidate bio_issue_init() to be a part of core)
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-12 05:46:07 -07:00
Shin'ichiro Kawasaki
86399ea071 block: Hold invalidate_lock in BLKRESETZONE ioctl
When BLKRESETZONE ioctl and data read race, the data read leaves stale
page cache. The commit e511350590 ("block: Discard page cache of zone
reset target range") added page cache truncation to avoid stale page
cache after the ioctl. However, the stale page cache still can be read
during the reset zone operation for the ioctl. To avoid the stale page
cache completely, hold invalidate_lock of the block device file mapping.

Fixes: e511350590 ("block: Discard page cache of zone reset target range")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211111085238.942492-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:46 -07:00
Ming Lei
b131f20111 blk-mq: rename blk_attempt_bio_merge
It is very annoying to have two block layer functions which share same
name, so rename blk_attempt_bio_merge in blk-mq.c as
blk_mq_attempt_bio_merge.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:33 -07:00
Ming Lei
10f7335e36 blk-mq: don't grab ->q_usage_counter in blk_mq_sched_bio_merge
blk_mq_sched_bio_merge is only called from blk-mq.c:blk_attempt_bio_merge(),
which is called when queue usage counter is grabbed already:

1) blk_mq_get_new_requests()

2) blk_mq_get_request()
- cached request in current plug owns one queue usage counter

So don't grab ->q_usage_counter in blk_mq_sched_bio_merge(), and more
importantly this nest way causes hang in blk_mq_freeze_queue_wait().

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111085134.345235-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:33 -07:00
Jens Axboe
438cd74223 block: fix kerneldoc for disk_register_independent_access__ranges()
The naming got changed as part of a revision of the patchset, but the
kerneldoc apparently never got updated. Fix it.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: a2247f19ee ("block: Add independent access ranges support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-11 11:52:30 -07:00
Luis Chamberlain
278167fd2f block: add __must_check for *add_disk*() callers
Now that we have done a spring cleaning on all drivers and added
error checking / handling, let's keep it that way and ensure
no new drivers fail to stick with it.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20211110002949.999380-1-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 19:19:34 -07:00
Jens Axboe
ecaf97f474 block: use enum type for blk_mq_alloc_data->rq_flags
kernel test robot reports that we now trigger some sparse warnings:

block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer
block/blk-mq.h:169:32: sparse: sparse: restricted req_flags_t degrades to integer

which is due to ->rq_flags being an unsigned int, rather than the
stronger type req_flags_t enum.

Change the type to req_flags_t to silence this warning.

Fixes: 56f8da642b ("block: add rq_flags to struct blk_mq_alloc_data")
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 19:19:15 -07:00
Shin'ichiro Kawasaki
35e4c6c1a2 block: Hold invalidate_lock in BLKZEROOUT ioctl
When BLKZEROOUT ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is modified to call "blkdiscard -z" command
and repeated hundreds of times.

This patch can be applied back to the stable kernel version v5.15.y.
Rework is required for older stable kernels.

Fixes: 22dd6d3566 ("block: invalidate the page cache when issuing BLKZEROOUT")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-3-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 12:41:12 -07:00
Shin'ichiro Kawasaki
7607c44c15 block: Hold invalidate_lock in BLKDISCARD ioctl
When BLKDISCARD ioctl and data read race, the data read leaves stale
page cache. To avoid the stale page cache, hold invalidate_lock of the
block device file mapping. The stale page cache is observed when
blktests test case block/009 is repeated hundreds of times.

This patch can be applied back to the stable kernel version v5.15.y
with slight patch edit. Rework is required for older stable kernels.

Fixes: 351499a172 ("block: Invalidate cache on discard v2")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.15
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211109104723.835533-2-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 12:41:12 -07:00
Linus Torvalds
cb690f5238 for-5.16/drivers-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqOcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr3yEADD9Cx8oNk3KzWV3c3JlIR4JQtvpczS3dho
 KkGU0D5fOh1sViXbLBNr6VxypcEIKQoHxDQQ6qid1kOu/B3mCNM1duLsVjyj3Qa0
 7nbm2dVUsD/EVDuXedRmMvcfCUx6Z23DbpI182wXtIPaCsEEmsANzHnZNg38OV44
 25SYG0QUvb9ViSz1Y1GORu0ttEJNF2GhZfiBpb0WveRnY7eTSL/PnHNDzHsSeFv4
 zD0W205g7jKbt0+57kgNElTz7DbdM3p8XVex+aXPlFaHz2qx4ZoJJIsaMv/P8tT5
 14b50cB41xnPvlGTvqr1WfZZfJocDNq2rG+fh6N5D1sO86ogWpj7psiiADfa0pb6
 ZWoJqhk3BvEUMPQ5N/BJ/8j3FWGIYWtKQf4QcyxrJYpqDwtwbBfMlzKkc7JMPFYk
 JAi6uq1uF5SbA4x99G90tK85LvxsbkseyIYXgBJ/GIyW5doIPkD9TPDEzJMCdHOe
 laynHS5PMHzuhPLuEDDn9sTVXpZWAMBnoy4j1L4wGmBjiogYWLTSJVobODzCAqHY
 1Va2oP6SXfCdVRkCysFbcrdsjJuoIWlMKrdE40tNvkmU0v7sEX0Zd+GLHiaWdIZa
 fgxC9fmZtDDOowCp+Iw0VaAqPeeptmyUrof06ZktJleOAscX7kSwbxPdmr1FM0jy
 dbnLDyaq/A==
 =QaFI
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block

Pull more block driver updates from Jens Axboe:

 - Last series adding error handling support for add_disk() in drivers.
   After this one, and once the SCSI side has been merged, we can
   finally annotate add_disk() as must_check. (Luis)

 - bcache fixes (Coly)

 - zram fixes (Ming)

 - ataflop locking fix (Tetsuo)

 - nbd fixes (Ye, Yu)

 - MD merge via Song
      - Cleanup (Yang)
      - sysfs fix (Guoqing)

 - Misc fixes (Geert, Wu, luo)

* tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits)
  bcache: Revert "bcache: use bvec_virt"
  ataflop: Add missing semicolon to return statement
  floppy: address add_disk() error handling on probe
  ataflop: address add_disk() error handling on probe
  block: update __register_blkdev() probe documentation
  ataflop: remove ataflop_probe_lock mutex
  mtd/ubi/block: add error handling support for add_disk()
  block/sunvdc: add error handling support for add_disk()
  z2ram: add error handling support for add_disk()
  nvdimm/pmem: use add_disk() error handling
  nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned
  nvdimm/blk: add error handling support for add_disk()
  nvdimm/blk: avoid calling del_gendisk() on early failures
  nvdimm/btt: add error handling support for add_disk()
  nvdimm/btt: use goto error labels on btt_blk_init()
  loop: Remove duplicate assignments
  drbd: Fix double free problem in drbd_create_device
  nvdimm/btt: do not call del_gendisk() if not needed
  bcache: fix use-after-free problem in bcache_device_free()
  zram: replace fsync_bdev with sync_blockdev
  ...
2021-11-09 11:24:08 -08:00
Linus Torvalds
3e28850cbd for-5.16/block-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqAcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpojrD/4yA+GgV+jWeIepYWvU81TQFpt9AJmzWbrY
 uryj4dy7EdMjun+JkAP8k4qreqvTZRsJMkr9dhmS4qaM8/Vt8K/RU/0n/lxNVmqc
 1//ZaTS6DURVAc52GHIXD3q4cv8pHofTZZlrj1Hgz35shlOayStGJtktH5f8uQl4
 5Yxjh+HKr15Chym+fKlbR6T7BgVxxNyhT9q89BgUwMAJX+1KRVtwtkyVK5IbObFy
 zOeiC+n9niQ6iJHcLoqb7LjfBOs/VjdNOQYGSCAnrBxuQ8GnEP2xDw2nvFlOPE12
 5tWEwTgAX7381ilbL6VvNTlTafIs/Axt8mI0cY/OMW7ApiHwO3rXjQSqA4yrnKCJ
 h6M1QavqThd2DtMnOi0U5wwgtD2UjS+CMpK5XFxeIyl6GqTgZcaWm3VqRnG68KZD
 r5+o99GKWCHy0cckxq2WiWJouReeNZ9u9R6HNDw0Vb8UNyWgBR+v2MkX+SHS/c85
 2gXm10hwBH7BFnC4X8ceiuT/bm7xm9S6D/3LCVitlUTBRfqobsQEQjSciPeoOtL0
 rRSTKob7jtokiB2q01wx3q1jnUMpxE1fqJkpLjUvebTzw+a+xfPwy0nNTGq0XXIv
 WMVRRpSWCZm04Ru0q/K8cj0GOyur5x+ilefZ1V+/sRU5dVmGuJgbJUxei1HPC6eV
 z9Rn0aFv4g==
 =1GPi
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Set of fixes for the batched tag allocation (Ming, me)

 - add_disk() error handling fix (Luis)

 - Nested queue quiesce fixes (Ming)

 - Shared tags init error handling fix (Ye)

 - Misc cleanups (Jean, Ming, me)

* tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block:
  nvme: wait until quiesce is done
  scsi: make sure that request queue queiesce and unquiesce balanced
  scsi: avoid to quiesce sdev->request_queue two times
  blk-mq: add one API for waiting until quiesce is done
  blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
  block: fix device_add_disk() kobject_create_and_add() error handling
  block: ensure cached plug request matches the current queue
  block: move queue enter logic into blk_mq_submit_bio()
  block: make bio_queue_enter() fast-path available inline
  block: split request allocation components into helpers
  block: have plug stored requests hold references to the queue
  blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
  blk-mq: add RQF_ELV debug entry
  blk-mq: only try to run plug merge if request has same queue with incoming bio
  block: move RQF_ELV setting into allocators
  dm: don't stop request queue after the dm device is suspended
  block: replace always false argument with 'false'
  block: assign correct tag before doing prefetch of request
  blk-mq: fix redundant check of !e expression
2021-11-09 11:20:07 -08:00
Linus Torvalds
1dc1f92e24 for-5.16/bdev-size-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKp6YQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphBaD/9Wm1lUI5qPB9tIQERWNsx4phr2Xb5WV3rg
 6ZFPDnjw0EEdfJmHLMKT5w/lguARtTKJAPeMmvX38RhB8U+44jeok0Bc+mJajjBD
 7OkE2KHQ/GnnZxzHQVyoYYo390zLOwNqALDLE55bfQsxKjJXXM03Stwcubzg9MZ+
 JRR4mFISU+7qhYu6Qx1RAvEqrGqWzstgrkbsnf/MJ7dKAy3cyqOiXKc/gDKGipIr
 skSrwbsB9AE6zW5UMdWN0r/wz1fm2bfJOR1meYgUj83exi1EoWDt52oceYWPky2D
 ievHRWUXBkcGgDNvKO6ZO99spJqs/MUy0jhzAW46qYDYUfMcbzthewWzX5mxQBDx
 YtDgH949DLVB/jPg0eoTeB8j1hcfO/gU6jdR9bJl4ZQKE/9UB9RMfusXHklmcSTL
 nZY8SfWp2ecu6HkGUPXJf051mEwCHCaROIslVeNeZBAbh/duVz3VG9EIO+seLOFR
 JJaqsB3xVk1HhqUYoQ+IiAHAkROa73jSv4hsvDu8BX923pv4JWAbTslRFU8lCG1w
 heJZ/QiP20QtEcEVCg14wEMiH5tBMvM7OhslQbhOSs+wcm8zyShjRhoa2idq/aTw
 k289DEfI52k8SIyI0+rClbWR+hEQ9y2K5opur8gR1SuIv9rbM+UzBc5hDbfT6t3W
 4xPpYESkhg==
 =bjgx
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/bdev-size-2021-11-09' of git://git.kernel.dk/linux-block

Pull more bdev size updates from Jens Axboe:
 "Two followup changes for the bdev-size series from this merge window:

   - Add loff_t cast to bdev_nr_bytes() (Christoph)

   - Use bdev_nr_bytes() consistently for the block parts at least (me)"

* tag 'for-5.16/bdev-size-2021-11-09' of git://git.kernel.dk/linux-block:
  block: use new bdev_nr_bytes() helper for blkdev_{read,write}_iter()
  block: add a loff_t cast to bdev_nr_bytes
2021-11-09 11:16:20 -08:00
Ming Lei
9ef4d0209c blk-mq: add one API for waiting until quiesce is done
Some drivers(NVMe, SCSI) need to call quiesce and unquiesce in pair, but it
is hard to switch to this style, so these drivers need one atomic flag for
helping to balance quiesce and unquiesce.

When quiesce is in-progress, the driver still needs to wait until
the quiesce is done, so add API of blk_mq_wait_quiesce_done() for
these drivers.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 08:14:27 -07:00
Ye Bin
a846a8e6c9 blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
We got UAF report on v5.10 as follows:
[ 1446.674930] ==================================================================
[ 1446.675970] BUG: KASAN: use-after-free in blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.676902] Read of size 8 at addr ffff8880185afd10 by task kworker/1:2/12348
[ 1446.677851]
[ 1446.678073] CPU: 1 PID: 12348 Comm: kworker/1:2 Not tainted 5.10.0-10177-gc9c81b1e346a #2
[ 1446.679168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1446.680692] Workqueue: kthrotld blk_throtl_dispatch_work_fn
[ 1446.681448] Call Trace:
[ 1446.681800]  dump_stack+0x9b/0xce
[ 1446.682916]  print_address_description.constprop.6+0x3e/0x60
[ 1446.685999]  kasan_report.cold.9+0x22/0x3a
[ 1446.687186]  blk_mq_get_driver_tag+0x9a4/0xa90
[ 1446.687785]  blk_mq_dispatch_rq_list+0x21a/0x1d40
[ 1446.692576]  __blk_mq_do_dispatch_sched+0x394/0x830
[ 1446.695758]  __blk_mq_sched_dispatch_requests+0x398/0x4f0
[ 1446.698279]  blk_mq_sched_dispatch_requests+0xdf/0x140
[ 1446.698967]  __blk_mq_run_hw_queue+0xc0/0x270
[ 1446.699561]  __blk_mq_delay_run_hw_queue+0x4cc/0x550
[ 1446.701407]  blk_mq_run_hw_queue+0x13b/0x2b0
[ 1446.702593]  blk_mq_sched_insert_requests+0x1de/0x390
[ 1446.703309]  blk_mq_flush_plug_list+0x4b4/0x760
[ 1446.705408]  blk_flush_plug_list+0x2c5/0x480
[ 1446.708471]  blk_finish_plug+0x55/0xa0
[ 1446.708980]  blk_throtl_dispatch_work_fn+0x23b/0x2e0
[ 1446.711236]  process_one_work+0x6d4/0xfe0
[ 1446.711778]  worker_thread+0x91/0xc80
[ 1446.713400]  kthread+0x32d/0x3f0
[ 1446.714362]  ret_from_fork+0x1f/0x30
[ 1446.714846]
[ 1446.715062] Allocated by task 1:
[ 1446.715509]  kasan_save_stack+0x19/0x40
[ 1446.716026]  __kasan_kmalloc.constprop.1+0xc1/0xd0
[ 1446.716673]  blk_mq_init_tags+0x6d/0x330
[ 1446.717207]  blk_mq_alloc_rq_map+0x50/0x1c0
[ 1446.717769]  __blk_mq_alloc_map_and_request+0xe5/0x320
[ 1446.718459]  blk_mq_alloc_tag_set+0x679/0xdc0
[ 1446.719050]  scsi_add_host_with_dma.cold.3+0xa0/0x5db
[ 1446.719736]  virtscsi_probe+0x7bf/0xbd0
[ 1446.720265]  virtio_dev_probe+0x402/0x6c0
[ 1446.720808]  really_probe+0x276/0xde0
[ 1446.721320]  driver_probe_device+0x267/0x3d0
[ 1446.721892]  device_driver_attach+0xfe/0x140
[ 1446.722491]  __driver_attach+0x13a/0x2c0
[ 1446.723037]  bus_for_each_dev+0x146/0x1c0
[ 1446.723603]  bus_add_driver+0x3fc/0x680
[ 1446.724145]  driver_register+0x1c0/0x400
[ 1446.724693]  init+0xa2/0xe8
[ 1446.725091]  do_one_initcall+0x9e/0x310
[ 1446.725626]  kernel_init_freeable+0xc56/0xcb9
[ 1446.726231]  kernel_init+0x11/0x198
[ 1446.726714]  ret_from_fork+0x1f/0x30
[ 1446.727212]
[ 1446.727433] Freed by task 26992:
[ 1446.727882]  kasan_save_stack+0x19/0x40
[ 1446.728420]  kasan_set_track+0x1c/0x30
[ 1446.728943]  kasan_set_free_info+0x1b/0x30
[ 1446.729517]  __kasan_slab_free+0x111/0x160
[ 1446.730084]  kfree+0xb8/0x520
[ 1446.730507]  blk_mq_free_map_and_requests+0x10b/0x1b0
[ 1446.731206]  blk_mq_realloc_hw_ctxs+0x8cb/0x15b0
[ 1446.731844]  blk_mq_init_allocated_queue+0x374/0x1380
[ 1446.732540]  blk_mq_init_queue_data+0x7f/0xd0
[ 1446.733155]  scsi_mq_alloc_queue+0x45/0x170
[ 1446.733730]  scsi_alloc_sdev+0x73c/0xb20
[ 1446.734281]  scsi_probe_and_add_lun+0x9a6/0x2d90
[ 1446.734916]  __scsi_scan_target+0x208/0xc50
[ 1446.735500]  scsi_scan_channel.part.3+0x113/0x170
[ 1446.736149]  scsi_scan_host_selected+0x25a/0x360
[ 1446.736783]  store_scan+0x290/0x2d0
[ 1446.737275]  dev_attr_store+0x55/0x80
[ 1446.737782]  sysfs_kf_write+0x132/0x190
[ 1446.738313]  kernfs_fop_write_iter+0x319/0x4b0
[ 1446.738921]  new_sync_write+0x40e/0x5c0
[ 1446.739429]  vfs_write+0x519/0x720
[ 1446.739877]  ksys_write+0xf8/0x1f0
[ 1446.740332]  do_syscall_64+0x2d/0x40
[ 1446.740802]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1446.741462]
[ 1446.741670] The buggy address belongs to the object at ffff8880185afd00
[ 1446.741670]  which belongs to the cache kmalloc-256 of size 256
[ 1446.743276] The buggy address is located 16 bytes inside of
[ 1446.743276]  256-byte region [ffff8880185afd00, ffff8880185afe00)
[ 1446.744765] The buggy address belongs to the page:
[ 1446.745416] page:ffffea0000616b00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x185ac
[ 1446.746694] head:ffffea0000616b00 order:2 compound_mapcount:0 compound_pincount:0
[ 1446.747719] flags: 0x1fffff80010200(slab|head)
[ 1446.748337] raw: 001fffff80010200 ffffea00006a3208 ffffea000061bf08 ffff88801004f240
[ 1446.749404] raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
[ 1446.750455] page dumped because: kasan: bad access detected
[ 1446.751227]
[ 1446.751445] Memory state around the buggy address:
[ 1446.752102]  ffff8880185afc00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.753090]  ffff8880185afc80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.754079] >ffff8880185afd00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.755065]                          ^
[ 1446.755589]  ffff8880185afd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 1446.756574]  ffff8880185afe00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1446.757566] ==================================================================

Flag 'BLK_MQ_F_TAG_QUEUE_SHARED' will be set if the second device on the
same host initializes it's queue successfully. However, if the second
device failed to allocate memory in blk_mq_alloc_and_init_hctx() from
blk_mq_realloc_hw_ctxs() from blk_mq_init_allocated_queue(),
__blk_mq_free_map_and_rqs() will be called on error path, and if
'BLK_MQ_TAG_HCTX_SHARED' is not set, 'tag_set->tags' will be freed
while it's still used by the first device.

To fix this issue we move release newly allocated hardware context from
blk_mq_realloc_hw_ctxs to __blk_mq_update_nr_hw_queues. As there is needn't to
release hardware context in blk_mq_init_allocated_queue.

Fixes: 868f2f0b72 ("blk-mq: dynamic h/w context count")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211108074019.1058843-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-08 06:23:49 -07:00
Jens Axboe
138c1a3811 block: use new bdev_nr_bytes() helper for blkdev_{read,write}_iter()
We have new helpers for this, use them rather than the slower inode
size reads. This makes the read/write path consistent with most of
the rest of block as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/a72767cd-3c6d-47f7-80f4-aa025a17b2cb@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-05 08:32:05 -06:00
Luis Chamberlain
fe7d064fa3 block: fix device_add_disk() kobject_create_and_add() error handling
Commit 83cbce9574 ("block: add error handling for device_add_disk /
add_disk") added error handling to device_add_disk(), however the goto
label for the kobject_create_and_add() failure did not set the return
value correctly, and so we can end up in a situation where
kobject_create_and_add() fails but we report success.

Fixes: 83cbce9574 ("block: add error handling for device_add_disk / add_disk")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211103164023.1384821-1-mcgrof@kernel.org
[axboe: fold in followup fix from Wu Bo <wubo40@huawei.com>]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 23:21:03 -06:00
Jens Axboe
10c4787015 block: ensure cached plug request matches the current queue
If we're driving multiple devices, we could have pre-populated the cache
for a different device. Ensure that the empty request matches the current
queue.

Fixes: 47c122e35d ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 23:21:01 -06:00
Jens Axboe
900e080752 block: move queue enter logic into blk_mq_submit_bio()
Retain the old logic for the fops based submit, but for our internal
blk_mq_submit_bio(), move the queue entering logic into the core
function itself.

We need to be a bit careful if going into the scheduler, as a scheduler
or queue mappings can arbitrarily change before we have entered the queue.
Have the bio scheduler mapping do that separately, it's a very cheap
operation compared to actually doing merging locking and lookups.

Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: update to check merge post submit_bio_checks() doing remap...]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 23:20:10 -06:00
Jens Axboe
c98cb5bbda block: make bio_queue_enter() fast-path available inline
Just a prep patch for shifting the queue enter logic. This moves the
expected fast path inline, and leaves __bio_queue_enter() as an
out-of-line function call. We don't want to inline the latter, as it's
mostly slow path code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 12:54:33 -06:00
Jens Axboe
71539717c1 block: split request allocation components into helpers
This is in preparation for a fix, but serves as a cleanup as well moving
the cached vs regular alloc logic out of blk_mq_submit_bio().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 12:50:51 -06:00
Jens Axboe
c5fc7b9317 block: have plug stored requests hold references to the queue
Requests that were stored in the cache deliberately didn't hold an enter
reference to the queue, instead we grabbed one every time we pulled a
request out of there. That made for awkward logic on freeing the remainder
of the cached list, if needed, where we had to artificially raise the
queue usage count before each free.

Grab references up front for cached plug requests. That's safer, and also
more efficient.

Fixes: 47c122e35d ("block: pre-allocate requests if plug is started and is a batch")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 12:50:46 -06:00
Luis Chamberlain
26e06f5b13 block: update __register_blkdev() probe documentation
__register_blkdev() is used to register a probe callback, and
that callback is typically used to call add_disk(). Now that
we are able to capture errors for add_disk(), we need to fix
those probe calls where add_disk() fails and clean up resources.

We don't extend the probe call to return the error given:

1) we'd have to always special-case the case where the disk
   was already present, as otherwise concurrent requests to
   open an existing block device would fail, and this would be
   a userspace visible change
2) the error from ilookup() on blkdev_get_no_open() is sufficient
3) The only thing the probe call is used for is to support
   pre-devtmpfs, pre-udev semantics that want to create disks when
   their pre-created device node is accessed, and so we don't care
   for failures on probe there.

Expand documentation for the probe callback to ensure users cleanup
resources if add_disk() is used and to clarify this interface may be
removed in the future.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20211103230437.1639990-12-mcgrof@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-04 05:48:50 -06:00
Ming Lei
3b87c6ea67 blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
In case of shared tags and none io sched, batched completion still may
be run into, and hctx->nr_active is accounted when getting driver tag,
so it has to be updated in blk_mq_end_request_batch().

Otherwise, hctx->nr_active may become same with queue depth, then
hctx_may_queue() always return false, then io hang is caused.

Fixes the issue by updating the counter in batched way.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: f794f3351f ("block: add support for blk_mq_end_request_batch()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211102153619.3627505-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-03 09:27:57 -06:00
Ming Lei
62ba0c008f blk-mq: add RQF_ELV debug entry
Looks it is missed so add it.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211102133502.3619184-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-03 09:27:57 -06:00
Ming Lei
a1cb65377e blk-mq: only try to run plug merge if request has same queue with incoming bio
It is obvious that io merge can't be done between two different queues, so
just try to run io merge in case of same queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211102133502.3619184-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-03 09:27:57 -06:00
Jens Axboe
781dd830ec block: move RQF_ELV setting into allocators
It's not safe to do this before blk_queue_enter(), as the scheduler state
could have changed in between. Hence move the RQF_ELV setting into the
allocators, where we know the queue is already entered.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Steffen Maier <maier@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-03 09:26:26 -06:00
Jens Axboe
b22809092c block: replace always false argument with 'false'
A previous commit fixed up the condition for doing direct issue, but that
left the 'from_schedule' argument dead inside the branch. Replace it with
'false'.

Fixes: ff1552232b ("blk-mq: don't issue request directly in case that current is to be blocked")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 06:57:27 -06:00
Jens Axboe
a22c00be90 block: assign correct tag before doing prefetch of request
Ensure that current tag is correctly assigned before attempting
to prefetch the first cacheline of the request.

Fixes: 92aff191cc ("block: prefetch request to be initialized")
Reported-and-tested-by: syzbot+cd20829ac44b92bf6ed0@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-02 06:57:20 -06:00
Linus Torvalds
19901165d9 for-5.16/inode-sync-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MEkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkWyEACBp3TltQu/jvyFlCzuOQJqpIqVw6ZeRn9h
 0cYZaYsRzNBTzIOKogpmhT3lWYOMxIbFMq6RyzLCPaQz6juEP+tmQIdLdPMxC5ON
 XdzItF0bMaLzoW0IRK21/aF1s/7UFcr1OLT0BT8F0umeQQXcEOOSim4kZuK9u6mS
 4pOvh61yXeB7UZxDOpMqH3aVlwrLjIr51j0ECGx/Qz1OZtXREQSeptlRUKEKVTXB
 uYPCB9FLL6ZWFyiDAuaiO4Gi//dhpoOe7Yich9m0tbtfei8gl74TqgzeaCBu+gFj
 aRyfwhyvFcm69MJqPGmRBDVxtXVC6ofjd4G6PSG8R/cAuAgPFywL/s0ETmjUJBvY
 HqnExUnMcr8FUHGIfYHmX7EWCAtD+FbpUSnCgWH2ulUhziKFR/LLE/ZYayPbhrgL
 aA89BYpeDS/POc94KXJJON/Ux612vGwhJxVsngYBEboYNeiP7YwsaQapU9RsKp0o
 YTlhz8zFuToUPEh6BQLYuOZek5AsEue5o7525Aj0vdjpxH/qH6JhjE790c7yWhL+
 hbxlTAAdqdVO2Xxrr3qdMXBUI3wnFKKu8Z6+oqi7ujQRKJZmLnXYn4ZkNRs6C858
 3NEW0mySPHxNRCZrt2M7zWmoq/eZtcJIzPy4JMW3xkQgqgdImuT1z7PrgRDw6/h8
 GB382CO2AQ==
 =AKpp
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block

Pull block inode sync updates from Jens Axboe:
 "This contains improvements to how bdev inode syncing is handled,
  unifying the API"

* tag 'for-5.16/inode-sync-2021-10-29' of git://git.kernel.dk/linux-block:
  block: simplify the block device syncing code
  ntfs3: use sync_blockdev_nowait
  fat: use sync_blockdev_nowait
  btrfs: use sync_blockdev
  xen-blkback: use sync_blockdev
  block: remove __sync_blockdev
  fs: remove __sync_filesystem
2021-11-01 10:25:27 -07:00
Linus Torvalds
b6773cdb0e for-5.16/ki_complete-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MOUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmeqEACrayLMDMdlb1FduTYw29QAL7XxS375r92T
 bwLippmKQIFNi8p5ScHraelV5ixgxse2j68MexlQHpl9aHIn/oL7qHACIMgDP05m
 KaSy8Hr2abqr+zz+rLMhkm21zAva6aWjQu7NoEjBE4dC5L4l9p885LaA+jmqQUno
 1wvpaEcype8cITJ+sSCb3kD6nZx7y1Lt5zEefUfk6ruMm9x9FwvU6uc4rIHi+Zve
 Hwo8yGbTvlU8rGSi9naC/U8pIZ4bqEuTAcV5VHNrWG+b4aA/aFPpSjpIiSBZSXo0
 HXa+jmcr6gkejfPeOZkBbRub6Fm9Wq2pDAZskPWFX6zyX0pIV05GjJ2J/ba8rovn
 QrcfxaBv8XitKgrjFZeR0ZBqD2iJjPA/Yq5/r1ZmZ0wSHI3W4UuTGhQYEPyDLceH
 ZWq/wcfVFek4kAoCxCqy9kWiOujY90WWKQW3yD7b8FPZ0d+/R1Mn+drlYaSKN1Pk
 /9/+z1DaLtBWbJ2G+BQ9oUkYmNSapAiYc2YXVss86hmhLX+prFtSj3zECZUvhyAz
 b42A2DVsjU+65yT2zdPBXlMrbI91qNnvIXcz5szNdTfHTn9FiLQb4BffMV0FHT3g
 vap8N3Rb8UkZ3v4NCVAtlfcGr0kvYHQH+Qgh6oAlXB4NQoKJCVadzpTFPMWjx788
 oHBUjA0UTQ==
 =4vl/
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block

Pull kiocb->ki_complete() cleanup from Jens Axboe:
 "This removes the res2 argument from kiocb->ki_complete().

  Only the USB gadget code used it, everybody else passes 0. The USB
  guys checked the user gadget code they could find, and everybody just
  uses res as expected for the async interface"

* tag 'for-5.16/ki_complete-2021-10-29' of git://git.kernel.dk/linux-block:
  fs: get rid of the res2 iocb->ki_complete argument
  usb: remove res2 argument from gadget code completions
2021-11-01 10:17:11 -07:00
Linus Torvalds
71ae42629e for-5.16/passthrough-flag-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8MnsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpuBpEACzrzbUfkTQ33bwF60mZQaqbR0ha7TrP/hp
 oAqthmf1S2U+7mzXHQ+6MN7p4+TVPa/ITxQZtLTw7U/68+w68tTUZfZHJ5H6tSXu
 92OHFDDP4ZeqATRTcJBij/5Si9BiKBHexMqeyVYPw0DWdEukAko9f7Z81GonFbTu
 EIdIWivBc76bLiK/X3w7lhLcaNyUv9cKalwjbI4xtwcHtcIYj5d2jIc9PF2I9Xtl
 3oqNT4GOSv7s3mW7syB1UEPrzbhVIzCSNbMSviCoK7GA5g8EN5KMEGQQoUJ942Zv
 bHMjMpGrXsWebPto9maXycGY/9WsVcpNB7opyQRpyG8yDDZq0AFNJxD/NBMkQo4S
 Sfp0fxpVXDRWu7zX0EktwGyOp4YNwfS6pDeAhqhnSl2uPWTsxGZ0kXvlMpR9Rt/t
 TjEKZe6lmcC7s42rPVRBRw5HEzEsVovf0z4lyvC4M223CV3c5cuYkAAtCcqLdVWq
 JkceHSb7EKu7QY6jf3sBud14HaAj+sub7kffOWhhAxObg3Ytsql61AGzbhandnxT
 AtN3n9PBHNGmrSv4MiiuP+Dq5jeT5NspFkf1FvnRcfmZMJtH1VXHKr84JbAy4VHr
 5cZoDJzL9Zm1d865f+VWkZeYd3b2kKP8C0dm6tAn4VweT6eb8bu6tgB7wFQwLIFK
 aRxz5vQ1AQ==
 =dLYJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block

Pull QUEUE_FLAG_SCSI_PASSTHROUGH removal from Jens Axboe:
 "This contains a series leading to the removal of the
  QUEUE_FLAG_SCSI_PASSTHROUGH queue flag"

* tag 'for-5.16/passthrough-flag-2021-10-29' of git://git.kernel.dk/linux-block:
  block: remove blk_{get,put}_request
  block: remove QUEUE_FLAG_SCSI_PASSTHROUGH
  block: remove the initialize_rq_fn blk_mq_ops method
  scsi: add a scsi_alloc_request helper
  bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn
  nfsd/blocklayout: use ->get_unique_id instead of sending SCSI commands
  sd: implement ->get_unique_id
  block: add a ->get_unique_id method
2021-11-01 10:12:44 -07:00
Linus Torvalds
3f01727f75 for-5.16/bdev-size-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8L70QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo9YEAC17yEJ0xwwtUUwZW8avzss4vdcIreFdiZu
 gaS+9Oi1bLxj0d2SjaZXJxjT9K+W2LftEsLuQ4oM6VHiLQkcEDbjJdVm3goftTt5
 aOvVormDdKbWNcGSbgxA/OcyUT39DH7y17NRVdqYzQSpnrhCod/1tb2ssck0OoYb
 VEyBKogMwYeYR55Z3I8yL5pNcEhR8TihZv3rL1iQ7DNpvh5I0I9naSEtGNC84aLP
 s4nwRIG+TYll+mg0sfSB29KF7xkoFQO7X7s1rnC/on+gsFEzbJcgkJPDIWeVLnLm
 ma8F1i+vJliCGaztyXoleAdg5QDiFmwTQwXRPAk2u8njJhcKi/RwIk2QYMZBZmEJ
 bB5EJnlnEaWxjgpCD7JDrtKgIgpbbQHc5QVHRZccsu43UqvDqOZIlvZNYY+h3ivz
 jT1zKuKDaTf8YWbfdOJwqm9e+qyR0AFm3rLMdHO58QEh1DBvSLIIdRCNE8wX7nFM
 Wx/GmQEkPqNTIZwJOQJMygK+sIuFUDybt3oAH2pjX1zyMx7kTJkrXvj0dhSS/B5u
 +gfMs3otWqxQ4P1qfnaUd9mYl8JabV7le2NHzhjdARm4NKFJEtcJe5BJBwiMbo0n
 vodqt7aUIAXwMrZXnWZL+w8CobhJBp8I5XHUgng147gDBuCjYQjBQT334auAXxgz
 MUCgbjBDqw==
 =Vadi
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block

Pull bdev size cleanups from Jens Axboe:
 "Clean up the bdev size handling with new bdev_nr_bytes() helper"

* tag 'for-5.16/bdev-size-2021-10-29' of git://git.kernel.dk/linux-block: (34 commits)
  partitions/ibm: use bdev_nr_sectors instead of open coding it
  partitions/efi: use bdev_nr_bytes instead of open coding it
  block/ioctl: use bdev_nr_sectors and bdev_nr_bytes
  block: cache inode size in bdev
  udf: use sb_bdev_nr_blocks
  reiserfs: use sb_bdev_nr_blocks
  ntfs: use sb_bdev_nr_blocks
  jfs: use sb_bdev_nr_blocks
  ext4: use sb_bdev_nr_blocks
  block: add a sb_bdev_nr_blocks helper
  block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate
  squashfs: use bdev_nr_bytes instead of open coding it
  reiserfs: use bdev_nr_bytes instead of open coding it
  pstore/blk: use bdev_nr_bytes instead of open coding it
  ntfs3: use bdev_nr_bytes instead of open coding it
  nilfs2: use bdev_nr_bytes instead of open coding it
  nfs/blocklayout: use bdev_nr_bytes instead of open coding it
  jfs: use bdev_nr_bytes instead of open coding it
  hfsplus: use bdev_nr_sectors instead of open coding it
  hfs: use bdev_nr_sectors instead of open coding it
  ...
2021-11-01 09:50:37 -07:00
Linus Torvalds
33c8846c81 for-5.16/block-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
 ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
 lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
 OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
 Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
 E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
 mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
 DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
 oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
 Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
 /LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
 kMmpCygUrA==
 =QWC+
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - mq-deadline accounting improvements (Bart)

 - blk-wbt timer fix (Andrea)

 - Untangle the block layer includes (Christoph)

 - Rework the poll support to be bio based, which will enable adding
   support for polling for bio based drivers (Christoph)

 - Block layer core support for multi-actuator drives (Damien)

 - blk-crypto improvements (Eric)

 - Batched tag allocation support (me)

 - Request completion batching support (me)

 - Plugging improvements (me)

 - Shared tag set improvements (John)

 - Concurrent queue quiesce support (Ming)

 - Cache bdev in ->private_data for block devices (Pavel)

 - bdev dio improvements (Pavel)

 - Block device invalidation and block size improvements (Xie)

 - Various cleanups, fixes, and improvements (Christoph, Jackie,
   Masahira, Tejun, Yu, Pavel, Zheng, me)

* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
  blk-mq-debugfs: Show active requests per queue for shared tags
  block: improve readability of blk_mq_end_request_batch()
  virtio-blk: Use blk_validate_block_size() to validate block size
  loop: Use blk_validate_block_size() to validate block size
  nbd: Use blk_validate_block_size() to validate block size
  block: Add a helper to validate the block size
  block: re-flow blk_mq_rq_ctx_init()
  block: prefetch request to be initialized
  block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
  block: add rq_flags to struct blk_mq_alloc_data
  block: add async version of bio_set_polled
  block: kill DIO_MULTI_BIO
  block: kill unused polling bits in __blkdev_direct_IO()
  block: avoid extra iter advance with async iocb
  block: Add independent access ranges support
  blk-mq: don't issue request directly in case that current is to be blocked
  sbitmap: silence data race warning
  blk-cgroup: synchronize blkg creation against policy deactivation
  block: refactor bio_iov_bvec_set()
  block: add single bio async direct IO helper
  ...
2021-11-01 09:19:50 -07:00
Jean Sacren
ef1661ba6d blk-mq: fix redundant check of !e expression
In the if branch, e is checked.  In the else branch, ->dispatch_busy is
merely a number and has no effect on !e.  We should remove the check of
!e since it is always true.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Link: https://lore.kernel.org/r/20211029202945.3052-1-sakiwit@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-30 09:34:14 -06:00
Linus Torvalds
a379fbbcb8 block-5.15-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8DN8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptM8D/44F9YcY8qRhrZmsUFr0QFlvFHHUCVCWtDR
 JW3JQN3hV0zBEVIvc0P3NSKAih/1+rJ3WZmVZA0lczm5OHv4C+ESZSmcl3Muv4Tk
 skOWxwDTIoSCvC+DzDw8k5UluOucLU9V7uLHQYqDOsqHngLwUerGDMwGfkMXKkNb
 zRvVaqQMUJufY0tN5QaEjl+GsaXiZJ0pid0MOtXo8NeU+K0BDyoBUF5Gco3/8ZYa
 NtD4hwM48kYoCNJDAeJmRNo3vArPpZdiJ77jeVXHrj42Mp20LK/jD7PdguEbUzq5
 3uXhn0boZCKFGhWQntkL18WwaZbFRZzTBpBqpFIjQKIvicNRoGArIguwwmwdt53P
 lbsgGgyMqQ3KvuOIEgrAFieA/mQ8iw8Pf/QWiQRk2aYA5n+miex1XfmVX7dVipdm
 OcV5HtLrKPR1newr0/eZIvN31C3tgaViYxxQOunfW57fXPthCazal+ON6K5w9ZZ8
 y79P+K1czCS/edKLTB+idvmWWijoF4GRguUMoCKsD4uXOZO0tk/f/U/ds8/+LBxm
 KQv8T9wBd+r5h225cB+boM+zslkB4vqnCT+MyiIp070ZAbi/ohirYcFC2fM5j57B
 UZ58C+WjGjtC0nhb0xS8EGM43ow6hkS+8LPub3fb01AKKvfaw22grc9vHBaFgXgW
 5rqLDaAIcg==
 =73x5
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-29' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request:
      - fix nvmet-tcp header digest verification (Amit Engel)
      - fix a memory leak in nvmet-tcp when releasing a queue (Maurizio
        Lombardi)
      - fix nvme-tcp H2CData PDU send accounting again (Sagi Grimberg)
      - fix digest pointer calculation in nvme-tcp and nvmet-tcp (Varun
        Prakash)
      - fix possible nvme-tcp req->offset corruption (Varun Prakash)

 - Queue drain ordering fix (Ming)

 - Partition check regression for zoned devices (Shin'ichiro)

 - Zone queue restart fix (Naohiro)

* tag 'block-5.15-2021-10-29' of git://git.kernel.dk/linux-block:
  block: Fix partition check for host-aware zoned block devices
  nvmet-tcp: fix header digest verification
  nvmet-tcp: fix data digest pointer calculation
  nvme-tcp: fix data digest pointer calculation
  nvme-tcp: fix possible req->offset corruption
  block: schedule queue restart after BLK_STS_ZONE_RESOURCE
  block: drain queue after disk is removed from sysfs
  nvme-tcp: fix H2CData PDU send accounting (again)
  nvmet-tcp: fix a memory leak when releasing a queue
2021-10-29 11:10:29 -07:00
John Garry
9b84c629c9 blk-mq-debugfs: Show active requests per queue for shared tags
Currently we show the hctx.active value for the per-hctx "active" file.

However this is not maintained for shared tags, and we instead keep a
record of the number active requests per request queue - see commit
f1b49fdc1c ("blk-mq: Record active_queues_shared_sbitmap per tag_set for
when using shared sbitmap).

Change for the case of shared tags to show the active requests per request
queue by using __blk_mq_active_requests() helper.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1635496823-33515-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29 06:53:34 -06:00
Christoph Hellwig
0bf6d96cb8 block: remove blk_{get,put}_request
These are now pointless wrappers around blk_mq_{alloc,free}_request,
so remove them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211025070517.1548584-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-29 06:50:52 -06:00
Jens Axboe
02f7eab009 block: improve readability of blk_mq_end_request_batch()
It's faster and easier to read if we tolerate cur_hctx being NULL in
the "when to flush" condition. Rename last_hctx to cur_hctx while at it,
as it better describes the role of that variable.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-28 12:08:34 -06:00
Jens Axboe
c7b84d4226 block: re-flow blk_mq_rq_ctx_init()
Now that we have flags passed in, we can do a final re-arrange of the
flow of blk_mq_rq_ctx_init() so we're always writing request in the
order in which it is laid out.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-5-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 08:43:15 -06:00
Jens Axboe
92aff191cc block: prefetch request to be initialized
Now we have the tags available in __blk_mq_alloc_requests_batch(), we
can start fetching the first request cacheline before calling into the
request initialization.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-4-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 08:43:15 -06:00
Jens Axboe
fe6134f669 block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
Instead of getting this from data for every invocation of request
initialization, pass it in as an argument instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-3-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 08:43:15 -06:00
Jens Axboe
56f8da642b block: add rq_flags to struct blk_mq_alloc_data
There's a hole here we can use, and it's faster to set this earlier
rather than need to check q->elevator multiple times.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20211019153300.623322-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 08:43:15 -06:00
Shin'ichiro Kawasaki
e0c60d0102 block: Fix partition check for host-aware zoned block devices
Commit a33df75c63 ("block: use an xarray for disk->part_tbl") modified
the method to check partition existence in host-aware zoned block
devices from disk_has_partitions() helper function call to empty check
of xarray disk->part_tbl. However, disk->part_tbl always has single
entry for disk->part0 and never becomes empty. This resulted in the
host-aware zoned devices always judged to have partitions, and it made
the sysfs queue/zoned attribute to be "none" instead of "host-aware"
regardless of partition existence in the devices.

This also caused DEBUG_LOCKS_WARN_ON(lock->magic != lock) for
sdkp->rev_mutex in scsi layer when the kernel detects host-aware zoned
device. Since block layer handled the host-aware zoned devices as non-
zoned devices, scsi layer did not have chance to initialize the mutex
for zone revalidation. Therefore, the warning was triggered.

To fix the issues, call the helper function disk_has_partitions() in
place of disk->part_tbl empty check. Since the function was removed with
the commit a33df75c63, reimplement it to walk through entries in the
xarray disk->part_tbl.

Fixes: a33df75c63 ("block: use an xarray for disk->part_tbl")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.14+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211026060115.753746-1-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 06:58:01 -06:00
Pavel Begunkov
842e39b013 block: add async version of bio_set_polled
If we know that a iocb is async we can optimise bio_set_polled() a bit,
add a new helper bio_set_polled_async().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8fa137885164a5d05fadcff4c3521da8d5a83d00.1635337135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 06:54:58 -06:00
Pavel Begunkov
e71aa913e2 block: kill DIO_MULTI_BIO
Now __blkdev_direct_IO() serves only multi-bio I/O, thus remove
not used anymore single bio refcounting optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88eb488aae9ed4852a30f3a7132f296f56e43b80.1635337135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 06:54:58 -06:00
Pavel Begunkov
25d207dc22 block: kill unused polling bits in __blkdev_direct_IO()
With addition of __blkdev_direct_IO_async(), __blkdev_direct_IO() now
serves only multio-bio I/O, which we don't poll. Now we can remove
anything related to I/O polling from it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b8c597a6b7ee612df394853bfd24726aee5b898e.1635337135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 06:54:58 -06:00
Pavel Begunkov
1bb6b81029 block: avoid extra iter advance with async iocb
Nobody cares about iov iterators state if we return -EIOCBQUEUED, so as
the we now have __blkdev_direct_IO_async(), which gets pages only once,
we can skip expensive iov_iter_advance(). It's around 1-2% of all CPU
spent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a6158edfbfa2ae3bc24aed29a72f035df18fad2f.1635337135.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-27 06:54:58 -06:00
Damien Le Moal
a2247f19ee block: Add independent access ranges support
The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
(for ATA) contain parameters describing the set of contiguous LBAs that
can be served independently by a single LUN multi-actuator hard-disk.
Similarly, a logically defined block device composed of multiple disks
can in some cases execute requests directed at different sector ranges
in parallel. A dm-linear device aggregating 2 block devices together is
an example.

This patch implements support for exposing a block device independent
access ranges to the user through sysfs to allow optimizing device
accesses to increase performance.

To describe the set of independent sector ranges of a device (actuators
of a multi-actuator HDDs or table entries of a dm-linear device),
The type struct blk_independent_access_ranges is introduced. This
structure describes the sector ranges using an array of
struct blk_independent_access_range structures. This range structure
defines the start sector and number of sectors of the access range.
The ranges in the array cannot overlap and must contain all sectors
within the device capacity.

The function disk_set_independent_access_ranges() allows a device
driver to signal to the block layer that a device has multiple
independent access ranges.  In this case, a struct
blk_independent_access_ranges is attached to the device request queue
by the function disk_set_independent_access_ranges(). The function
disk_alloc_independent_access_ranges() is provided for drivers to
allocate this structure.

struct blk_independent_access_ranges contains kobjects (struct kobject)
to expose to the user through sysfs the set of independent access ranges
supported by a device. When the device is initialized, sysfs
registration of the ranges information is done from blk_register_queue()
using the block layer internal function
disk_register_independent_access_ranges(). If a driver calls
disk_set_independent_access_ranges() for a registered queue, e.g. when a
device is revalidated, disk_set_independent_access_ranges() will execute
disk_register_independent_access_ranges() to update the sysfs attribute
files.  The sysfs file structure created starts from the
independent_access_ranges sub-directory and contains the start sector
and number of sectors of each range, with the information for each range
grouped in numbered sub-directories.

E.g. for a dual actuator HDD, the user sees:

$ tree /sys/block/sdk/queue/independent_access_ranges/
/sys/block/sdk/queue/independent_access_ranges/
|-- 0
|   |-- nr_sectors
|   `-- sector
`-- 1
    |-- nr_sectors
    `-- sector

For a regular device with a single access range, the
independent_access_ranges sysfs directory does not exist.

Device revalidation may lead to changes to this structure and to the
attribute values. When manipulated, the queue sysfs_lock and
sysfs_dir_lock mutexes are held for atomicity, similarly to how the
blk-mq and elevator sysfs queue sub-directories are protected.

The code related to the management of independent access ranges is
added in the new file block/blk-ia-ranges.c.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 20:36:47 -06:00
Naohiro Aota
9586e67b91 block: schedule queue restart after BLK_STS_ZONE_RESOURCE
When dispatching a zone append write request to a SCSI zoned block device,
if the target zone of the request is already locked, the device driver will
return BLK_STS_ZONE_RESOURCE and the request will be pushed back to the
hctx dipatch queue. The queue will be marked as RESTART in
dd_finish_request() and restarted in __blk_mq_free_request(). However, this
restart applies to the hctx of the completed request. If the requeued
request is on a different hctx, dispatch will no be retried until another
request is submitted or the next periodic queue run triggers, leading to up
to 30 seconds latency for the requeued request.

Fix this problem by scheduling a queue restart similarly to the
BLK_STS_RESOURCE case or when we cannot get the budget.

Also, consolidate the checks into the "need_resource" variable to simplify
the condition.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Link: https://lore.kernel.org/r/20211026165127.4151055-1-naohiro.aota@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 16:00:36 -06:00
Ming Lei
d308ae0d29 block: drain queue after disk is removed from sysfs
Before removing disk from sysfs, userspace still may change queue via
sysfs, such as switching elevator or setting wbt latency, both may
reinitialize wbt, then the warning in blk_free_queue_stats() will be
triggered since rq_qos_exit() is moved to del_gendisk().

Fixes the issue by moving draining queue & tearing down after disk is
removed from sysfs, at that time no one can come into queue's
store()/show().

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 8e141f9eb8 ("block: drain file system I/O on del_gendisk")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026101204.2897166-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 08:44:38 -06:00
Ming Lei
ff1552232b blk-mq: don't issue request directly in case that current is to be blocked
When flushing plug list in case that current will be blocked, we can't
issue request directly because ->queue_rq() may sleep, otherwise scheduler
may complain.

Fixes: dc5fc361d8 ("block: attempt direct issue of plug list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211026082257.2889890-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-26 08:38:11 -06:00
Jens Axboe
6b19b766e8 fs: get rid of the res2 iocb->ki_complete argument
The second argument was only used by the USB gadget code, yet everyone
pays the overhead of passing a zero to be passed into aio, where it
ends up being part of the aio res2 value.

Now that everybody is passing in zero, kill off the extra argument.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 10:36:24 -06:00
Yu Kuai
0c9d338c84 blk-cgroup: synchronize blkg creation against policy deactivation
Our test reports a null pointer dereference:

[  168.534653] ==================================================================
[  168.535614] Disabling lock debugging due to kernel taint
[  168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  168.537274] #PF: supervisor read access in kernel mode
[  168.537964] #PF: error_code(0x0000) - not-present page
[  168.538667] PGD 0 P4D 0
[  168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
[  168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G    B             5.15.0-rc2-next-202100
[  168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
[  168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
[  168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
[  168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
[  168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
[  168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
[  168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
[  168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
[  168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
[  168.551221] FS:  00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
[  168.552278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
[  168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  168.555888] Call Trace:
[  168.556221]  <TASK>
[  168.556510]  blkg_create+0x1c0/0x8c0
[  168.556989]  blkg_conf_prep+0x574/0x650
[  168.557502]  ? stack_trace_save+0x99/0xd0
[  168.558033]  ? blkcg_conf_open_bdev+0x1b0/0x1b0
[  168.558629]  tg_set_conf.constprop.0+0xb9/0x280
[  168.559231]  ? kasan_set_track+0x29/0x40
[  168.559758]  ? kasan_set_free_info+0x30/0x60
[  168.560344]  ? tg_set_limit+0xae0/0xae0
[  168.560853]  ? do_sys_openat2+0x33b/0x640
[  168.561383]  ? do_sys_open+0xa2/0x100
[  168.561877]  ? __x64_sys_open+0x4e/0x60
[  168.562383]  ? __kasan_check_write+0x20/0x30
[  168.562951]  ? copyin+0x48/0x70
[  168.563390]  ? _copy_from_iter+0x234/0x9e0
[  168.563948]  tg_set_conf_u64+0x17/0x20
[  168.564467]  cgroup_file_write+0x1ad/0x380
[  168.565014]  ? cgroup_file_poll+0x80/0x80
[  168.565568]  ? __mutex_lock_slowpath+0x30/0x30
[  168.566165]  ? pgd_free+0x100/0x160
[  168.566649]  kernfs_fop_write_iter+0x21d/0x340
[  168.567246]  ? cgroup_file_poll+0x80/0x80
[  168.567796]  new_sync_write+0x29f/0x3c0
[  168.568314]  ? new_sync_read+0x410/0x410
[  168.568840]  ? __handle_mm_fault+0x1c97/0x2d80
[  168.569425]  ? copy_page_range+0x2b10/0x2b10
[  168.570007]  ? _raw_read_lock_bh+0xa0/0xa0
[  168.570622]  vfs_write+0x46e/0x630
[  168.571091]  ksys_write+0xcd/0x1e0
[  168.571563]  ? __x64_sys_read+0x60/0x60
[  168.572081]  ? __kasan_check_write+0x20/0x30
[  168.572659]  ? do_user_addr_fault+0x446/0xff0
[  168.573264]  __x64_sys_write+0x46/0x60
[  168.573774]  do_syscall_64+0x35/0x80
[  168.574264]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.574960] RIP: 0033:0x7fac74915130
[  168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
[  168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
[  168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
[  168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
[  168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
[  168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
[  168.583757]  </TASK>
[  168.584063] Modules linked in:
[  168.584494] CR2: 0000000000000008
[  168.584964] ---[ end trace 2475611ad0f77a1a ]---

This is because blkg_alloc() is called from blkg_conf_prep() without
holding 'q->queue_lock', and elevator is exited before blkg_create():

thread 1                            thread 2
blkg_conf_prep
 spin_lock_irq(&q->queue_lock);
 blkg_lookup_check -> return NULL
 spin_unlock_irq(&q->queue_lock);

 blkg_alloc
  blkcg_policy_enabled -> true
  pd = ->pd_alloc_fn
  blkg->pd[i] = pd
                                   blk_mq_exit_sched
                                    bfq_exit_queue
                                     blkcg_deactivate_policy
                                      spin_lock_irq(&q->queue_lock);
                                      __clear_bit(pol->plid, q->blkcg_pols);
                                      spin_unlock_irq(&q->queue_lock);
                                    q->elevator = NULL;
  spin_lock_irq(&q->queue_lock);
   blkg_create
    if (blkg->pd[i])
     ->pd_init_fn -> q->elevator is NULL
  spin_unlock_irq(&q->queue_lock);

Because blkcg_deactivate_policy() requires queue to be frozen, we can
grab q_usage_counter to synchoronize blkg_conf_prep() against
blkcg_deactivate_policy().

Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 08:06:27 -06:00
Pavel Begunkov
fa5fa8ec60 block: refactor bio_iov_bvec_set()
Combine bio_iov_bvec_set() and bio_iov_bvec_set_append() and let the
caller to do iov_iter_advance(). Also get rid of __bio_iov_bvec_set(),
which was duplicated in the final binary, and replace a weird
iov_iter_truncate() of a temporal iter copy with min() better reflecting
the intention.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/bcf1ac36fce769a514e19475f3623cd86a1d8b72.1635006010.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 08:00:48 -06:00
Pavel Begunkov
54a88eb838 block: add single bio async direct IO helper
As with __blkdev_direct_IO_simple(), we can implement direct IO more
efficiently if there is only one bio. Add __blkdev_direct_IO_async() and
blkdev_bio_end_io_async(). This patch brings me from 4.45-4.5 MIOPS with
nullblk to 4.7+.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f0ae4109b7a6934adede490f84d188d53b97051b.1635006010.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-25 08:00:48 -06:00
Linus Torvalds
9c0c4d24ac block-5.15-2021-10-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFzfzwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqvoEACZB9dFKiYyFcv6X6ARAhfKDuE4ImaJ/hLI
 VCt4M52P06nqywPQ7iyZMRWR/EVKW8ADDJiiAeXy3mDBgMqD6O27u894i1JP06sn
 5gHcSvfH1b9PNFMUw04aI3iXXFUzQU7pwn5z6o2nXlA4onPVW7vwTp8fcHd41Kep
 LqvZJbihW/iF9d0Wjs9LqPRBWXchtsVyxiNDBgC+kx5IYFn+oTnZOhlxw8ZiT/KH
 0v8FIq9HY+5n6UP7InZF2gtIQUyDTR5L1zKKvJu5LDoHvcNlM6Ke0m3DVPcgP79D
 2kKGyHOGfqC9Gr37qqOjgKqRO/Z/9SCvG39dmocAd/hh3AfUgKpDQs3HgLyx7ECT
 aRAe5n0XbfIVcHX1XaOc8cGrszan9YhJvt/dMCmkjaG/3hASlzl2kV4QF3f5IVjx
 oMgB1Kj8kyu6SqG8mCCjyGCxPpzNq8lVplJRlpifoz+ID/+hgt03aDoYVfPZkDRL
 nf4VdQCRSl3ZEXkHy1j6l6Nb2UgNEZP1B3a/9onSyBJ/WYqSfFMXrx29PSirz7m7
 x4jGOJvdqtNx09zjWHXc/d+I8BEXp4JDXe0GH0OHMiwCwz5PoMo99HRb+IuffKjR
 lWl4EimH0bfzOA/3vFr5TigfqbnDJ7HCRrGsodQX8gJhVaxVTWxeZG+7Y9qkLqnD
 JGlZeMQ37w==
 =uhGw
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Fix for the cgroup code not ussing irq safe stats updates, and one fix
  for an error handling condition in add_partition()"

* tag 'block-5.15-2021-10-22' of git://git.kernel.dk/linux-block:
  block: fix incorrect references to disk objects
  blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
2021-10-22 17:42:13 -10:00
John Garry
8bdf7b3fe1 blk-mq-sched: Don't reference queue tagset in blk_mq_sched_tags_teardown()
We should not reference the queue tagset in blk_mq_sched_tags_teardown()
(see function comment) for the blk-mq flags, so use the passed flags
instead.

This solves a use-after-free, similarly fixed earlier (and since broken
again) in commit f0c1c4d286 ("blk-mq: fix use-after-free in
blk_mq_exit_sched").

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: e155b0c238 ("blk-mq: Use shared tags for shared sbitmap support")
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1634890340-15432-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 09:16:12 -06:00
Pavel Begunkov
297db73184 block: fix req_bio_endio append error handling
Shinichiro Kawasaki reports that there is a bug in a recent
req_bio_endio() patch causing problems with zonefs. As Shinichiro
suggested, inverse the condition in zone append path to resemble how it
was before: fail when it's not fully completed.

Fixes: 478eb72b81 ("block: optimise req_bio_endio()")
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/344ea4e334aace9148b41af5f2426da38c8aa65a.1634914228.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 09:12:37 -06:00
Christoph Hellwig
1e03a36bdf block: simplify the block device syncing code
Get rid of the indirections and just provide a sync_bdevs
helper for the generic sync code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
70164eb6cc block: remove __sync_blockdev
Instead offer a new sync_blockdev_nowait helper for the !wait case.
This new helper is exported as it will grow modular callers in a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:36:55 -06:00
Christoph Hellwig
4845012eb5 block: remove QUEUE_FLAG_SCSI_PASSTHROUGH
Export scsi_device_from_queue for use with pktcdvd and use that instead
of the otherwise unused QUEUE_FLAG_SCSI_PASSTHROUGH queue flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20211021060607.264371-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:33:57 -06:00
Christoph Hellwig
4abafdc436 block: remove the initialize_rq_fn blk_mq_ops method
Entirely unused now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20211021060607.264371-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:33:57 -06:00
Christoph Hellwig
237ea1602f bsg-lib: initialize the bsg_job in bsg_transport_sg_io_fn
Directly initialize the bsg_job structure instead of relying on the
->.initialize_rq_fn indirection.  This also removes the superflous
initialization of the second request used for BIDI requests.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20211021060607.264371-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-22 08:33:57 -06:00
Eric Biggers
cb77cb5abe blk-crypto: rename blk_keyslot_manager to blk_crypto_profile
blk_keyslot_manager is misnamed because it doesn't necessarily manage
keyslots.  It actually does several different things:

  - Contains the crypto capabilities of the device.

  - Provides functions to control the inline encryption hardware.
    Originally these were just for programming/evicting keyslots;
    however, new functionality (hardware-wrapped keys) will require new
    functions here which are unrelated to keyslots.  Moreover,
    device-mapper devices already (ab)use "keyslot_evict" to pass key
    eviction requests to their underlying devices even though
    device-mapper devices don't have any keyslots themselves (so it
    really should be "evict_key", not "keyslot_evict").

  - Sometimes (but not always!) it manages keyslots.  Originally it
    always did, but device-mapper devices don't have keyslots
    themselves, so they use a "passthrough keyslot manager" which
    doesn't actually manage keyslots.  This hack works, but the
    terminology is unnatural.  Also, some hardware doesn't have keyslots
    and thus also uses a "passthrough keyslot manager" (support for such
    hardware is yet to be upstreamed, but it will happen eventually).

Let's stop having keyslot managers which don't actually manage keyslots.
Instead, rename blk_keyslot_manager to blk_crypto_profile.

This is a fairly big change, since for consistency it also has to update
keyslot manager-related function names, variable names, and comments --
not just the actual struct name.  However it's still a fairly
straightforward change, as it doesn't change any actual functionality.

Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-4-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 10:49:32 -06:00
Eric Biggers
1e8d44bddf blk-crypto: rename keyslot-manager files to blk-crypto-profile
In preparation for renaming struct blk_keyslot_manager to struct
blk_crypto_profile, rename the keyslot-manager.h and keyslot-manager.c
source files.  Renaming these files separately before making a lot of
changes to their contents makes it easier for git to understand that
they were renamed.

Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-3-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 10:49:32 -06:00
Eric Biggers
eebcafaebb blk-crypto-fallback: properly prefix function and struct names
For clarity, avoid using just the "blk_crypto_" prefix for functions and
structs that are specific to blk-crypto-fallback.  Instead, use
"blk_crypto_fallback_".  Some places already did this, but others
didn't.

This is also a prerequisite for using "struct blk_crypto_keyslot" to
mean a generic blk-crypto keyslot (which is what it sounds like).
Rename the fallback one to "struct blk_crypto_fallback_keyslot".

No change in behavior.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20211018180453.40441-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 10:49:32 -06:00
Xie Yongji
f059a1d2e2 block: Add invalidate_disk() helper to invalidate the gendisk
To hide internal implementation and simplify some driver code,
this adds a helper to invalidate the gendisk. It will clean the
gendisk's associated buffer/page caches and reset its internal
states.

Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210922123711.187-2-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 10:12:41 -06:00
Pavel Begunkov
e94f68527a block: kill extra rcu lock/unlock in queue enter
blk_try_enter_queue() already takes rcu_read_lock/unlock, so we can
avoid the second pair in percpu_ref_tryget_live(), use a newly added
percpu_ref_tryget_live_rcu().

As rcu_read_lock/unlock imply barrier()s, it's pretty noticeable,
especially for for !CONFIG_PREEMPT_RCU (default for some distributions),
where __rcu_read_lock/unlock() are not inlined.

3.20%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
3.05%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock

2.52%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
2.28%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6b11c67ea495ed9d44f067622d852de4a510ce65.1634822969.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:37:26 -06:00
Pavel Begunkov
6549a874fb block: convert fops.c magic constants to SHIFT_SECTOR
Don't use shifting by a magic number 9 but replace with a more
descriptive SHIFT_SECTOR.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/068782b9f7e97569fb59a99529b23bb17ea4c5e2.1634755800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:27:17 -06:00
Pavel Begunkov
179ae84f7e block: clean up blk_mq_submit_bio() merging
Combine blk_mq_sched_bio_merge() and blk_attempt_plug_merge() under a
common if, so we don't check it twice.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/daedc90d4029a5d1d73344771632b1faca3aaf81.1634755800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:27:17 -06:00
Pavel Begunkov
6450fe1f66 block: optimise boundary blkdev_read_iter's checks
Combine pos and len checks and mark unlikely. Also, don't reexpand if
it's not truncated.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fff34e613aeaae1ad12977dc4592cb1a1f5d3190.1634755800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:27:17 -06:00
Jackie Liu
057178cf51 fs: bdev: fix conflicting comment from lookup_bdev
We switched to directly use dev_t to get block device, lookup changed the
meaning of use, now we fix this conflicting comment.

Fixes: 4e7b5671c6 ("block: remove i_bdev")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211021071344.1600362-1-liu.yun@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:24:14 -06:00
John Garry
0994c64eb4 blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags
Since it is now possible for a tagset to share a single set of tags, the
iter function should not re-iter the tags for the count of #hw queues in
that case. Rather it should just iter once.

Fixes: e155b0c238 ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1634550083-202815-1-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-21 08:21:52 -06:00
Christoph Hellwig
008f75a20e block: cleanup the flush plug helpers
Consolidate the various helpers into a single blk_flush_plug helper that
takes a plk_plug and the from_scheduler bool and switch all callsites to
call it directly.  Checks that the plug is non-NULL must be performed by
the caller, something that most already do anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:56:11 -06:00
Pavel Begunkov
b600455d84 block: optimise blk_flush_plug_list
Don't call flush_plug_callbacks if there are no plug callbacks.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:56:11 -06:00
Christoph Hellwig
dbb6f764a0 blk-mq: move blk_mq_flush_plug_list to block/blk-mq.h
This helper is internal to the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:56:11 -06:00
Christoph Hellwig
a214b949d8 blk-mq: only flush requests from the plug in blk_mq_submit_bio
Replace the call to blk_flush_plug_list in blk_mq_submit_bio with a
direct call to blk_mq_flush_plug_list.  This means we do not flush
plug callback from stackable devices, which doesn't really help with
the accumulated requests anyway, and it also means the cached requests
aren't freed here as they can still be used later on.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211020144119.142582-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 09:56:11 -06:00
Jens Axboe
037057a5a9 block: remove inaccurate requeue check
This check is meant to catch cases where a requeue is attempted on a
request that is still inserted. It's never really been useful to catch any
misuse, and now it's actively wrong. Outside of that, this should not be a
BUG_ON() to begin with.

Remove the check as it's now causing active harm, as requeue off the plug
path will trigger it even though the request state is just fine.

Reported-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/linux-block/CAHj4cs80zAUc2grnCZ015-2Rvd-=gXRfB_dFKy=RTm+wRo09HQ@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:21:40 -06:00
Pavel Begunkov
c809084ab0 block: inline a part of bio_release_pages()
Inline BIO_NO_PAGE_REF check of bio_release_pages() to avoid function
call.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:08:07 -06:00
Pavel Begunkov
1497a51a32 block: don't bloat enter_queue with percpu_ref
percpu_ref_put() are inlined for performance and bloat the binary, we
don't care about the fail case of blk_try_enter_queue(), so we can
replace it with a call to blk_queue_exit().

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:08:06 -06:00
Pavel Begunkov
478eb72b81 block: optimise req_bio_endio()
First, get rid of an extra branch and chain error checks. Also reshuffle
it with bio_advance(), so it goes closer to the final check, with that
the compiler loads rq->rq_flags only once, and also doesn't reload
bio->bi_iter.bi_size if bio_advance() didn't actually advanced the iter.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:08:05 -06:00
Pavel Begunkov
859897c3fb block: convert leftovers to bdev_get_queue
Convert bdev->bd_disk->queue to bdev_get_queue(), which is faster.
Apparently, there are a few such spots in block that got lost during
rebases.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-20 08:08:03 -06:00
Ming Lei
e70feb8b3e blk-mq: support concurrent queue quiesce/unquiesce
blk_mq_quiesce_queue() has been used a bit wide now, so far we don't support
concurrent/nested quiesce. One biggest issue is that unquiesce can happen
unexpectedly in case that quiesce/unquiesce are run concurrently from
more than one context.

This patch introduces q->mq_quiesce_depth to deal concurrent quiesce,
and we only unquiesce queue when it is the last/outer-most one of all
contexts.

Several kernel panic issue has been reported[1][2][3] when running stress
quiesce test. And this patch has been verified in these reports.

[1] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m1fc52431fad7f33b1ffc3f12c4450e4238540787
[2] https://lore.kernel.org/linux-block/9b21c797-e505-3821-4f5b-df7bf9380328@huawei.com/T/#m10ad90afeb9c8cc318334190a7c24c8b5c5e0722
[3] https://listman.redhat.com/archives/dm-devel/2021-September/msg00189.html

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Zheng Liang
2fc428f6b7 block, bfq: fix UAF problem in bfqg_stats_init()
In bfq_pd_alloc(), the function bfqg_stats_init() init bfqg. If
blkg_rwstat_init() init bfqg_stats->bytes successful and init
bfqg_stats->ios failed, bfqg_stats_init() return failed, bfqg will
be freed. But blkg_rwstat->cpu_cnt is not deleted from the list of
percpu_counters. If we traverse the list of percpu_counters, It will
have UAF problem.

we should use blkg_rwstat_exit() to cleanup bfqg_stats bytes in the
above scenario.

Fixes: commit fd41e60331 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
Signed-off-by: Zheng Liang <zhengliang6@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20211018024225.1493938-1-zhengliang6@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 15:18:30 -06:00
Jens Axboe
a808a9d545 block: inline fast path of driver tag allocation
If we don't use an IO scheduler or have shared tags, then we don't need
to call into this external function at all. This saves ~2% for such
a setup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 15:18:25 -06:00
Christoph Hellwig
d92ca9d834 blk-mq: don't handle non-flush requests in blk_insert_flush
Return to the normal blk_mq_submit_bio flow if the bio did not end up
actually being a flush because the device didn't support it.  Note that
this is basically impossible to hit without special instrumentation given
that submit_bio_checks already clears these flags usually, so we'd need a
tight race to actually hit this code path.

With this the call to blk_mq_run_hw_queue for the flush requests can be
removed given that the actual flush requests are always issued via the
requeue workqueue which runs the queue unconditionally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019122553.2467817-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 11:10:09 -06:00
Jens Axboe
dc5fc361d8 block: attempt direct issue of plug list
If we have just one queue type in the plug list, then we can extend our
direct issue to cover a full plug list as well. This allows sending a
batch of requests for direct issue, which is more efficient than doing
one-at-a-time kind of issue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 09:22:04 -06:00
Jens Axboe
bc490f8173 block: change plugging to use a singly linked list
Use a singly linked list for the blk_plug. This saves 8 bytes in the
blk_plug struct, and makes for faster list manipulations than doubly
linked lists. As we don't use the doubly linked lists for anything,
singly linked is just fine.

This yields a bump in default (merging enabled) performance from 7.0
to 7.1M IOPS, and ~7.5M IOPS with merging disabled.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 09:21:42 -06:00
Christoph Hellwig
97eeb5fc14 partitions/ibm: use bdev_nr_sectors instead of open coding it
Use the proper helper to read the block device size and switch various
places to pass the size in terms of sectors which is more practical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062024.2171074-4-hch@lst.de
[axboe: fix comment typo]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 06:17:33 -06:00
Christoph Hellwig
f9831b8857 partitions/efi: use bdev_nr_bytes instead of open coding it
Use the proper helper to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062024.2171074-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 06:16:50 -06:00
Christoph Hellwig
946e993730 block/ioctl: use bdev_nr_sectors and bdev_nr_bytes
Use the proper helper to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062024.2171074-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 06:16:50 -06:00
Andrea Righi
480d42dc00 blk-wbt: prevent NULL pointer dereference in wb_timer_fn
The timer callback used to evaluate if the latency is exceeded can be
executed after the corresponding disk has been released, causing the
following NULL pointer dereference:

[ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ 119.987617] #PF: supervisor read access in kernel mode
[ 119.987971] #PF: error_code(0x0000) - not-present page
[ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
[ 119.988697] Oops: 0000 [#1] SMP NOPTI
[ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi #rc5+arighi
[ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
[ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 <4c> 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
[ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
[ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
[ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
[ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
[ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
[ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
[ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
[ 119.995906] Call Trace:
[ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
[ 119.996505] blk_stat_timer_fn+0x138/0x140
[ 119.996830] call_timer_fn+0x2b/0x100
[ 119.997136] __run_timers.part.0+0x1d1/0x240
[ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
[ 119.997826] ? ktime_get+0x3e/0xa0
[ 119.998110] ? native_apic_msr_write+0x2c/0x30
[ 119.998456] ? lapic_next_event+0x20/0x30
[ 119.998779] ? clockevents_program_event+0x94/0xf0
[ 119.999150] run_timer_softirq+0x2a/0x50
[ 119.999465] __do_softirq+0xcb/0x26f
[ 119.999764] irq_exit_rcu+0x8c/0xb0
[ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
[ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20

In this case simply return from the timer callback (no action
required) to prevent the NULL pointer dereference.

BugLink: https://bugs.launchpad.net/bugs/1947557
Link: https://lore.kernel.org/linux-mm/YWRNVTk9N8K0RMst@arighi-desktop/
Fixes: 34dbad5d26 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Link: https://lore.kernel.org/r/YW6N2qXpBU3oc50q@arighi-desktop
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 06:13:41 -06:00
Jens Axboe
6155631a0c block: align blkdev_dio inlined bio to a cacheline
We get all sorts of unreliable and funky results since the bio is
designed to align on a cacheline, which it does not when inlined like
this.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:55:52 -06:00
Jens Axboe
e028f167ec block: move blk_mq_tag_to_rq() inline
This is in the fast path of driver issue or completion, and it's a single
array index operation. Move it inline to avoid a function call for it.

This does mean making struct blk_mq_tags block layer public, but there's
not really much in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:55:41 -06:00
Jens Axboe
df87eb0fce block: get rid of plug list sorting
Even if we have multiple queues in the plug list, chances that they
are very interspersed is minimal. Don't bother spending CPU cycles
sorting the list.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:55:26 -06:00
Jens Axboe
87c037d11b block: return whether or not to unplug through boolean
Instead of returning the same queue request through a request pointer,
use a boolean to accomplish the same.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:55:04 -06:00
Christoph Hellwig
8a7d267b4a block: don't call blk_status_to_errno in blk_update_request
We only need to call it to resolve the blk_status_t -> errno mapping for
tracing, so move the conversion into the tracepoints that are not called
at all when tracing isn't enabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:54:57 -06:00
Jens Axboe
db9a02baa2 block: move bdev_read_only() into the header
This is called for every write in the fast path, move it inline next
to get_disk_ro() which is called internally.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:53:22 -06:00
Jens Axboe
e0d78afeb8 block: fix too broad elevator check in blk_mq_free_request()
We added RQF_ELV to tell whether there's an IO scheduler attached, and
RQF_ELVPRIV tells us whether there's an IO scheduler with private data
attached. Don't check RQF_ELV in blk_mq_free_request(), what we care
about here is just if we have scheduler private data attached.

This fixes a boot crash

Fixes: 2ff0682da6 ("block: store elevator state in request")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Reported-by: syzbot+eb8104072aeab6cc1195@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 05:48:15 -06:00
Jens Axboe
f09313c57a block: cache inode size in bdev
Reading the inode size brings in a new cacheline for IO submit, and
it's in the hot path being checked for every single IO. When doing
millions of IOs per core per second, this is noticeable overhead.

Cache the nr_sectors in the bdev itself.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:43:23 -06:00
Christoph Hellwig
2a93ad8fcb block: use bdev_nr_bytes instead of open coding it in blkdev_fallocate
Use the proper helper to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:43:23 -06:00
Jens Axboe
f794f3351f block: add support for blk_mq_end_request_batch()
Instead of calling blk_mq_end_request() on a single request, add a helper
that takes the new struct io_comp_batch and completes any request stored
in there.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:43 -06:00
Jens Axboe
5a72e899ce block: add a struct io_comp_batch argument to fops->iopoll()
struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:40 -06:00
Jens Axboe
013a7f9543 block: provide helpers for rq_list manipulation
Instead of open-coding the list additions, traversal, and removal,
provide a basic set of helpers.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:39 -06:00
Jens Axboe
afd7de03c5 block: remove some blk_mq_hw_ctx debugfs entries
Just like the blk_mq_ctx counterparts, we've got a bunch of counters
in here that are only for debugfs and are of questionnable value. They
are:

- dispatched, index of how many requests were dispatched in one go

- poll_{considered,invoked,success}, which track poll sucess rates. We're
  confident in the iopoll implementation at this point, don't bother
  tracking these.

As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes,
dropping a whole cacheline.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:33 -06:00
Jens Axboe
9a14d6ce41 block: remove debugfs blk_mq_ctx dispatched/merged/completed attributes
These were added as part of early days debugging for blk-mq, and they
are not really useful anymore. Rather than spend cycles updating them,
just get rid of them.

As a bonus, this shrinks the per-cpu software queue size from 256b
to 192b. That's a whole cacheline less.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:39:48 -06:00
Pavel Begunkov
128459062b block: cache rq_flags inside blk_mq_rq_ctx_init()
Add a local variable for rq_flags, it helps to compile out some of
rq_flags reloads.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:38:45 -06:00
Pavel Begunkov
605f784e4f block: blk_mq_rq_ctx_init cache ctx/q/hctx
We should have enough of registers in blk_mq_rq_ctx_init(), store them
in local vars, so we don't keep reloading them.

note: keeping q->elevator may look unnecessary, but it's also used
inside inlined blk_mq_tags_from_data().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:38:43 -06:00
Pavel Begunkov
4f266f2be8 block: skip elevator fields init for non-elv queue
Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is
no elevator. Also, move some other initialisers that imply barriers to
the end, so the compiler is free to rearrange and optimise other the
rest of them.

note: fold in a change from Jens leaving queue_list unconditional, as
it might lead to problems otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:38:42 -06:00
Zqiang
9fbfabfda2 block: fix incorrect references to disk objects
When adding partitions to the disk, the reference count of the disk
object is increased. then alloc partition device and called
device_add(), if the device_add() return error, the reference
count of the disk object will be reduced twice, at put_device(pdev)
and put_disk(disk). this leads to the end of the object's life cycle
prematurely, and trigger following calltrace.

  __init_work+0x2d/0x50 kernel/workqueue.c:519
  synchronize_rcu_expedited+0x3af/0x650 kernel/rcu/tree_exp.h:847
  bdi_remove_from_list mm/backing-dev.c:938 [inline]
  bdi_unregister+0x17f/0x5c0 mm/backing-dev.c:946
  release_bdi+0xa1/0xc0 mm/backing-dev.c:968
  kref_put include/linux/kref.h:65 [inline]
  bdi_put+0x72/0xa0 mm/backing-dev.c:976
  bdev_free_inode+0x11e/0x220 block/bdev.c:408
  i_callback+0x3f/0x70 fs/inode.c:226
  rcu_do_batch kernel/rcu/tree.c:2508 [inline]
  rcu_core+0x76d/0x16c0 kernel/rcu/tree.c:2743
  __do_softirq+0x1d7/0x93b kernel/softirq.c:558
  invoke_softirq kernel/softirq.c:432 [inline]
  __irq_exit_rcu kernel/softirq.c:636 [inline]
  irq_exit_rcu+0xf2/0x130 kernel/softirq.c:648
  sysvec_apic_timer_interrupt+0x93/0xc0

making disk is NULL when calling put_disk().

Reported-by: Hao Sun <sunhao.th@gmail.com>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211018103422.2043-1-qiang.zhang1211@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 11:20:38 -06:00
Jens Axboe
2ff0682da6 block: store elevator state in request
Add an rq private RQF_ELV flag, which tells the block layer that this
request was initialized on a queue that has an IO scheduler attached.
This allows for faster checking in the fast path, rather than having to
deference rq->q later on.

Elevator switching does full quiesce of the queue before detaching an
IO scheduler, so it's safe to cache this in the request itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 08:51:52 -06:00
Jens Axboe
90b8faa0e8 block: only mark bio as tracked if it really is tracked
We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even
though we may not even have an rq_qos handler. Only mark it as TRACKED if
it really is potentially tracked.

This saves considerable time for the case where the bio isn't tracked:

     2.64%     -1.65%  [kernel.vmlinux]  [k] bio_endio

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 08:50:47 -06:00
Jens Axboe
9be3e06fb7 block: move update request helpers into blk-mq.c
For some reason we still have them in blk-core, with the rest of the
request completion being in blk-mq. That causes and out-of-line call
for each completion.

Move them into blk-mq.c instead, where they belong.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 08:50:28 -06:00
Jens Axboe
c477b79778 block: remove useless caller argument to print_req_error()
We have exactly one caller of this, just get rid of adding the useless
function name to the output.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 08:50:22 -06:00
Jens Axboe
d4aa57a1ca block: don't bother iter advancing a fully done bio
If we're completing nbytes and nbytes is the size of the bio, don't bother
with calling into the iterator increment helpers. Just clear the bio
size and we're done.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 08:50:16 -06:00
Pavel Begunkov
ed6cddefdf block: convert the rest of block to bdev_get_queue
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:37 -06:00
Pavel Begunkov
eab4e02733 block: use bdev_get_queue() in blk-core.c
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/efc41f880262517c8dc32f932f1b23112f21b255.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Pavel Begunkov
3caee4634b block: use bdev_get_queue() in bio.c
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Pavel Begunkov
025a38651b block: use bdev_get_queue() in bdev.c
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached
queue pointer and so is faster.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a352936ce5d9ac719645b1e29b173d931ebcdc02.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Pavel Begunkov
17220ca5ce block: cache request queue in bdev
There are tons of places where we need to get a request_queue only
having bdev, which turns into bdev->bd_disk->queue. There are probably a
hundred of such places considering inline helpers, and enough of them
are in hot paths.

Cache queue pointer in struct block_device and make use of it in
bdev_get_queue().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Jens Axboe
abd45c159d block: handle fast path of bio splitting inline
The fast path is no splitting needed. Separate the handling into a
check part we can inline, and an out-of-line handling path if we do
need to split.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Jens Axboe
09ce874425 block: use flags instead of bit fields for blkdev_dio
This generates a lot better code for me, and bumps performance from
7650K IOPS to 7750K IOPS. Looking at profiles for the run and running
perf diff, it confirms that we're now sending a lot less time there:

     6.38%     -2.80%  [kernel.vmlinux]  [k] blkdev_direct_IO

Taking it from the 2nd most cycle consumer to only the 9th most at
3.35% of the CPU time.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Pavel Begunkov
fac7c6d529 block: cache bdev in struct file for raw bdev IO
bdev = &BDEV_I(file->f_mapping->host)->bdev

Getting struct block_device from a file requires 2 memory dereferences
as illustrated above, that takes a toll on performance, so cache it in
yet unused file->private_data. That gives a noticeable peak performance
improvement.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/8415f9fe12e544b9da89593dfbca8de2b52efe03.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
a614dd2280 block: don't allow writing to the poll queue attribute
The poll attribute is a historic artefact from before when we had
explicit poll queues that require driver specific configuration.
Just print a warning when writing to the attribute.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
3e08773c38 block: switch polling to be bio based
Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
1a7e76e4f1 block: use SLAB_TYPESAFE_BY_RCU for the bio slab
This flags ensures that the pages will not be reused for non-bio
allocations before the end of an RCU grace period.  With that we can
safely use a RCU lookup for bio polling as long as we are fine with
occasionally polling the wrong device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
6ce913fe3e block: rename REQ_HIPRI to REQ_POLLED
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
d729cf9acb io_uring: don't sleep when polling for I/O
There is no point in sleeping for the expected I/O completion timeout
in the io_uring async polling model as we never poll for a specific
I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
ef99b2d376 block: replace the spin argument to blk_iopoll with a flags argument
Switch the boolean spin argument to blk_poll to passing a set of flags
instead.  This will allow to control polling behavior in a more fine
grained way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
28a1ae6b9d blk-mq: remove blk_qc_t_valid
Move the trivial check into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
efbabbe121 blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internal
Merge both functions into their only caller to keep the blk-mq tag to
blk_qc_t mapping as private as possible in blk-mq.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
c6699d6fe0 blk-mq: factor out a "classic" poll helper
Factor the code to do the classic full metal polling out of blk_poll into
a separate blk_mq_poll_classic helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
f70299f0d5 blk-mq: factor out a blk_qc_to_hctx helper
Add a helper to get the hctx from a request_queue and cookie, and fold
the blk_qc_t_to_queue_num helper into it as no other callers are left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
71fc3f5e2c block: don't try to poll multi-bio I/Os in __blkdev_direct_IO
If an iocb is split into multiple bios we can't poll for both.  So don't
even bother to try to poll in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012111226.760968-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Jens Axboe
d38a9c04c0 block: only check previous entry for plug merge attempt
Currently we scan the entire plug list, which is potentially very
expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with
merging enabled, and profiling shows that the plug merge check is the
(by far) most expensive thing we're doing:

  Overhead  Command   Shared Object     Symbol
  +   20.89%  io_uring  [kernel.vmlinux]  [k] blk_attempt_plug_merge
  +    4.98%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
  +    4.78%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO
  +    4.61%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio

Instead of browsing the whole list, just check the previously inserted
entry. That is enough for a naive merge check and will catch most cases,
and for devices that need full merging, the IO scheduler attached to
such devices will do that anyway. The plug merge is meant to be an
inexpensive check to avoid getting a request, but if we repeatedly
scan the list for every single insert, it is very much not a cheap
check.

With this patch, the workload instead runs at ~7.0M IOPS, providing
a 25% improvement. Disabling merging entirely yields another 5%
improvement.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Masahiro Yamada
4c928904ff block: move CONFIG_BLOCK guard to top Makefile
Every object under block/ depends on CONFIG_BLOCK.

Move the guard to the top Makefile since there is no point to
descend into block/ if CONFIG_BLOCK=n.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-5-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Masahiro Yamada
b8b98a6225 block: move menu "Partition type" to block/partitions/Kconfig
Move the menu to the relevant place.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-4-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Masahiro Yamada
c50fca55d4 block: simplify Kconfig files
Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is
selected from drivers/md/Kconfig, which is entirely dependent on BLOCK.

Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig.

Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because
COMPAT and PCI are boolean.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-3-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Masahiro Yamada
df252bde82 block: remove redundant =y from BLK_CGROUP dependency
CONFIG_BLK_CGROUP is a boolean option, that is, its value is 'y' or 'n'.
The comparison to 'y' is redundant.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210927140000.866249-2-masahiroy@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Jens Axboe
349302da83 block: improve batched tag allocation
Add a blk_mq_get_tags() helper, which uses the new sbitmap API for
allocating a batch of tags all at once. This both simplifies the block
code for batched allocation, and it is also more efficient than just
doing repeated calls into __sbitmap_queue_get().

This reduces the sbitmap overhead in peak runs from ~3% to ~1% and
yields a performanc increase from 6.6M IOPS to 6.8M IOPS for a single
CPU core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Pavel Begunkov
8971a3b7f1 blk-mq: optimise *end_request non-stat path
We already have a blk_mq_need_time_stamp() check in
__blk_mq_end_request() to get a timestamp, hide all the statistics
accounting under it. It cuts some cycles for requests that don't need
stats, and is free otherwise.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e0f2ea812e93a8adcd07101212e7d7e70ca304e7.1634115360.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
4f7ab09a1c block: mark bio_truncate static
bio_truncate is only used in bio.c, so mark it static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
ff18d77b5f block: move bio_get_{first,last}_bvec out of bio.h
bio_get_first_bvec and bio_get_last_bvec are only used in blk-merge.c,
so move them there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
9774b39175 block: mark __bio_try_merge_page static
Mark __bio_try_merge_page static and move it up a bit to avoid the need
for a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
9a6083becb block: move bio_full out of bio.h
bio_full is only used in bio.c, so move it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
8addffd657 block: move bio_mergeable out of bio.h
bio_mergeable is only needed by I/O schedulers, so move it to
blk-mq-sched.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:35 -06:00
Christoph Hellwig
9e8c0d0d4d block: remove BIO_BUG_ON
BIO_DEBUG is always defined, so just switch the two instances to use
BUG_ON directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012161804.991559-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Pavel Begunkov
e9ea15963f blk-mq: inline hot part of __blk_mq_sched_restart
Extract a fast check out of __block_mq_sched_restart() and inline it for
performance reasons.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/894abaa0998e5999f2fe18f271e5efdfc2c32bd2.1633781740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Pavel Begunkov
be6bfe36db block: inline hot paths of blk_account_io_*()
Extract hot paths of __blk_account_io_start() and
__blk_account_io_done() into inline functions, so we don't always pay
for function calls.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b0662a636bd4cc7b4f84c9d0a41efa46a688ef13.1633781740.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Christoph Hellwig
8a709512ea block: merge block_ioctl into blkdev_ioctl
Simplify the ioctl path and match the code structure on the compat side.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Christoph Hellwig
84b8514b46 block: move the *blkdev_ioctl declarations out of blkdev.h
These are only used inside of block/.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Christoph Hellwig
fea349b037 block: unexport blkdev_ioctl
With the raw driver gone, there is no modular user left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104450.659013-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:34 -06:00
Jens Axboe
4a60f360a5 block: don't dereference request after flush insertion
We could have a race here, where the request gets freed before we call
into blk_mq_run_hw_queue(). If this happens, we cannot rely on the state
of the request.

Grab the hardware context before inserting the flush.

Fixes: 0f38d76646 ("blk-mq: cleanup blk_mq_submit_bio")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:18 -06:00
Christoph Hellwig
0f38d76646 blk-mq: cleanup blk_mq_submit_bio
Move the blk_mq_alloc_data stack allocation only into the branch
that actually needs it, and use rq->mq_hctx instead of data.hctx
to refer to the hctx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
Christoph Hellwig
b90cfaed37 blk-mq: cleanup and rename __blk_mq_alloc_request
The newly added loop for the cached requests in __blk_mq_alloc_request
is a little too convoluted for my taste, so unwind it a bit.  Also
rename the function to __blk_mq_alloc_requests now that it can allocate
more than a single request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012104045.658051-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
Jens Axboe
47c122e35d block: pre-allocate requests if plug is started and is a batch
The caller typically has a good (or even exact) idea of how many requests
it needs to submit. We can make the request/tag allocation a lot more
efficient if we just allocate N requests/tags upfront when we queue the
first bio from the batch.

Provide a new plug start helper that allows the caller to specify how many
IOs are expected. This sets plug->nr_ios, and we can use that for smarter
request allocation. The plug provides a holding spot for requests, and
request allocation will check it before calling into the normal request
allocation path.

The blk_finish_plug() is called, check if there are unused requests and
free them. This should not happen in normal operations. The exception is
if we get merging, then we may be left with requests that need freeing
when done.

This raises the per-core performance on my setup from ~5.8M to ~6.1M
IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
Jens Axboe
ba0ffdd8ce block: bump max plugged deferred size from 16 to 32
Particularly for NVMe with efficient deferred submission for many
requests, there are nice benefits to be seen by bumping the default max
plug count from 16 to 32. This is especially true for virtualized setups,
where the submit part is more expensive. But can be noticed even on
native hardware.

Reduce the multiple queue factor from 4 to 2, since we're changing the
default size.

While changing it, move the defines into the block layer private header.
These aren't values that anyone outside of the block layer uses, or
should use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
Jens Axboe
0006707723 block: inherit request start time from bio for BLK_CGROUP
Doing high IOPS testing with blk-cgroups enabled spends ~15-20% of the
time just doing ktime_get_ns() -> readtsc. We essentially read and
set the start time twice, one for the bio and then again when that bio
is mapped to a request.

Given that the time between the two is very short, inherit the bio
start time instead of reading it again. This cuts 1/3rd of the overhead
of the time keeping.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
Jens Axboe
a7b36ee6ba block: move blk-throtl fast path inline
Even if no policies are defined, we spend ~2% of the total IO time
checking. Move the fast path inline.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
079a2e3e86 blk-mq: Change shared sbitmap naming to shared tags
Now that shared sbitmap support really means shared tags, rename symbols
to match that.

Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-15-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
ae0f1a732f blk-mq: Stop using pointers for blk_mq_tags bitmap tags
Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.

This essentially reverts commit 222a5ae03c ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").

Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
e155b0c238 blk-mq: Use shared tags for shared sbitmap support
Currently we use separate sbitmap pairs and active_queues atomic_t for
shared sbitmap support.

However a full sets of static requests are used per HW queue, which is
quite wasteful, considering that the total number of requests usable at
any given time across all HW queues is limited by the shared sbitmap depth.

As such, it is considerably more memory efficient in the case of shared
sbitmap to allocate a set of static rqs per tag set or request queue, and
not per HW queue.

So replace the sbitmap pairs and active_queues atomic_t with a shared
tags per tagset and request queue, which will hold a set of shared static
rqs.

Since there is now no valid HW queue index to be passed to the blk_mq_ops
.init and .exit_request callbacks, pass an invalid index token. This
changes the semantics of the APIs, such that the callback would need to
validate the HW queue index before using it. Currently no user of shared
sbitmap actually uses the HW queue index (as would be expected).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-13-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
645db34e50 blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()
Refactor blk_mq_free_map_and_requests() such that it can be used at many
sites at which the tag map and rqs are freed.

Also rename to blk_mq_free_map_and_rqs(), which is shorter and matches the
alloc equivalent.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-12-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
63064be150 blk-mq: Add blk_mq_alloc_map_and_rqs()
Add a function to combine allocating tags and the associated requests,
and factor out common patterns to use this new function.

Some function only call blk_mq_alloc_map_and_rqs() now, but more
functionality will be added later.

Also make blk_mq_alloc_rq_map() and blk_mq_alloc_rqs() static since they
are only used in blk-mq.c, and finally rename some functions for
conciseness and consistency with other function names:
- __blk_mq_alloc_map_and_{request -> rqs}()
- blk_mq_alloc_{map_and_requests -> set_map_and_rqs}()

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-11-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:03 -06:00
John Garry
a7e7388dce blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()
Put the functionality to update the sched shared sbitmap size in a common
function.

Since the same formula is always used to resize, and it can be got from
the request queue argument, so just pass the request queue pointer.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-10-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
4f245d5bf0 blk-mq: Don't clear driver tags own mapping
Function blk_mq_clear_rq_mapping() is required to clear the sched tags
mappings in driver tags rqs[].

But there is no need for a driver tags to clear its own mapping, so skip
clearing the mapping in this scenario.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-9-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
f32e4eafaf blk-mq: Pass driver tags to blk_mq_clear_rq_mapping()
Function blk_mq_clear_rq_mapping() will be used for shared sbitmap tags
in future, so pass a driver tags pointer instead of the tagset container
and HW queue index.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-8-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
1820f4f0a5 blk-mq-sched: Rename blk_mq_sched_free_{requests -> rqs}()
To be more concise and consistent in naming, rename
blk_mq_sched_free_requests() -> blk_mq_sched_free_rqs().

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1633429419-228500-7-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
d99a6bb337 blk-mq-sched: Rename blk_mq_sched_alloc_{tags -> map_and_rqs}()
Function blk_mq_sched_alloc_tags() does same as
__blk_mq_alloc_map_and_request(), so give a similar name to be consistent.

Similarly rename label err_free_tags -> err_free_map_and_rqs.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-6-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
f6adcef5f3 blk-mq: Invert check in blk_mq_update_nr_requests()
It's easier to read:

if (x)
	X;
else
	Y;

over:

if (!x)
	Y;
else
	X;

No functional change intended.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-5-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
8fa044640f blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests()
For shared sbitmap, if the call to blk_mq_tag_update_depth() was
successful for any hctx when hctx->sched_tags is not set, then it would be
successful for all (due to nature in which blk_mq_tag_update_depth()
fails).

As such, there is no need to call blk_mq_tag_resize_shared_sbitmap() for
each hctx. So relocate the call until after the hctx iteration under the
!q->elevator check, which is equivalent (to !hctx->sched_tags).

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-4-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
d2a27964e6 block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ
It is a bit confusing that there is BLKDEV_MAX_RQ and MAX_SCHED_RQ, as
the name BLKDEV_MAX_RQ would imply the max requests always, which it is
not.

Rename to BLKDEV_MAX_RQ to BLKDEV_DEFAULT_RQ, matching its usage - that being
the default number of requests assigned when allocating a request queue.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
John Garry
65de57bb2e blk-mq: Change rqs check in blk_mq_free_rqs()
The original code in commit 24d2f90309 ("blk-mq: split out tag
initialization, support shared tags") would check tags->rqs is non-NULL and
then dereference tags->rqs[].

Then in commit 2af8cbe305 ("blk-mq: split tag ->rqs[] into two"), we
started to dereference tags->static_rqs[], but continued to check non-NULL
tags->rqs.

Check tags->static_rqs as non-NULL instead, which is more logical.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/1633429419-228500-2-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
8a3ee6778e block: print the current process in handle_bad_sector
Make the bad sector information a little more useful by printing
current->comm to identify the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20210928052755.113016-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Bart Van Assche
322cff70d4 block/mq-deadline: Prioritize high-priority requests
In addition to reverting commit 7b05bf7710 ("Revert "block/mq-deadline:
Prioritize high-priority requests""), this patch uses 'jiffies' instead
of ktime_get() in the code for aging lower priority requests.

This patch has been tested as follows:

Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: 555 K IOPS.

Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: about 380 K IOPS.

Ran the following script:

set -e
scriptdir=$(dirname "$0")
if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
modprobe scsi_debug ndelay=1000000 max_queue=16
sd=''
while [ -z "$sd" ]; do
  sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done
echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
if [ -e /sys/fs/cgroup/io.prio.class ]; then
  cd /sys/fs/cgroup
  echo restrict-to-be >io.prio.class
  echo +io > cgroup.subtree_control
else
  cd /sys/fs/cgroup/blkio/
  echo restrict-to-be >blkio.prio.class
fi
echo $$ >cgroup.procs
mkdir -p hipri
cd hipri
if [ -e io.prio.class ]; then
  echo none-to-rt >io.prio.class
else
  echo none-to-rt >blkio.prio.class
fi
{ "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
echo $$ >cgroup.procs
"${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt

Result:
* 11000 IOPS for the high-priority job
*    40 IOPS for the low-priority job

If the prio aging expiry time is changed from 100s into 0, the IOPS results
change into 6712 and 6796 IOPS.

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Link: https://lore.kernel.org/r/20210927220328.1410161-5-bvanassche@acm.org
[axboe: @latest -> @latest_start]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Bart Van Assche
bce0363ed8 block/mq-deadline: Stop using per-CPU counters
Calculating the sum over all CPUs of per-CPU counters frequently is
inefficient. Hence switch from per-CPU to individual counters. Three
counters are protected by the mq-deadline spinlock since these are
only accessed from contexts that already hold that spinlock. The fourth
counter is atomic because protecting it with the mq-deadline spinlock
would trigger lock contention.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Bart Van Assche
32f64cad97 block/mq-deadline: Add an invariant check
Check a statistics invariant at module unload time. When running
blktests, the invariant is verified every time a request queue is
removed and hence is verified at least once per test.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Bart Van Assche
e2c7275dc0 block/mq-deadline: Improve request accounting further
The scheduler .insert_requests() callback is called when a request is
queued for the first time and also when it is requeued. Only count a
request the first time it is queued. Additionally, since the mq-deadline
scheduler only performs zone locking for requests that have been
inserted, skip the zone unlock code for requests that have not been
inserted into the mq-deadline scheduler.

Fixes: 38ba64d12d ("block/mq-deadline: Track I/O statistics")
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210927220328.1410161-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
24b83deb29 block: move struct request to blk-mq.h
struct request is only used by blk-mq drivers, so move it and all
related declarations to blk-mq.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
fe45e630a1 block: move integrity handling out of <linux/blkdev.h>
Split the integrity/metadata handling definitions out into a new header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
badf7f6437 block: move a few merge helpers out of <linux/blkdev.h>
These are block-layer internal helpers, so move them to block/blk.h and
block/blk-merge.c.  Also update a comment a bit to use better grammar.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
b81e0c2372 block: drop unused includes in <linux/genhd.h>
Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Christoph Hellwig
2e9bc3465a block: move elevator.h to block/
Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer.  Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Christoph Hellwig
e41d12f539 mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so
break the include chain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Tejun Heo
3c08b0931e blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
c3df5fb57f ("cgroup: rstat: fix A-A deadlock on 32bit around
u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks.
Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it.

Fixes: 2d146aa3aa ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:01 -06:00
Linus Torvalds
f2b3420b92 block-5.15-2021-10-17
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFsIqAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppbBEACDewLUv7bg1VFIGdroRN51OGiOv1oV+8HP
 ruY7O9CPtV7wcb3lA1Zy9igICuzuC5culHjbRJrNIUeTdWCQHCFk/sfKSD6VGMoT
 cFTqpKxV7M3vYr9G2m5TFWgY2mfS+I5fxyDZxK2z2esHCFw6TZ7A5W13xScVXKP+
 QdNFSlTrGkpggsSIEeHApG+NLsIecnkT4qzm8zPfUodUtQ3A8JMjQjnYUFEAWfWv
 l9x9zDIzaGjPtXf5soFEvmdh1ALh3WWiYb1kIwK1FeP/PYX0JV/3zCMgqOwpK+4b
 69OM3Q0NPHvu2TgSRK+ghekAtz5qgPDMCrzdhSgLYJEL/PGAOboqjrB9E+wWoEjd
 IKrYLx4Xao2TUZLJF2y34hHfODGdasx7d+wS191UpVFEZHFhDhIaazZ2rDd5xnQK
 LdzQw1JQF/igJovHauhSkGFIdJWBSDneLQoMimBnitZlsWARUmFSZej34FFRLZsW
 8ZXfqipn/x+fh4sQ/HdEfWxnGHtveDpU+0Ka5bMUe/tJ9RPtmn/Ye7nFjYecC6NY
 4UzFSNn+4e9DpHaDuP3I/eA1YBmVlcB5Hum3ve7X6ovwpjArYg3dgJOEi8uCZjfb
 hdMANmkVptcPiEO9njEHhC7S8+Nm3t+8o3qQceN81j6Vcjgzt/Y/n3Z6UkKeSlkn
 Ila+cZI1oA==
 =J/e4
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Bigger than usual for this point in time, the majority is fixing some
  issues around BDI lifetimes with the move from the request_queue to
  the disk in this release. In detail:

   - Series on draining fs IO for del_gendisk() (Christoph)

   - NVMe pull request via Christoph:
        - fix the abort command id (Keith Busch)
        - nvme: fix per-namespace chardev deletion (Adam Manzanares)

   - brd locking scope fix (Tetsuo)

   - BFQ fix (Paolo)"

* tag 'block-5.15-2021-10-17' of git://git.kernel.dk/linux-block:
  block, bfq: reset last_bfqq_created on group change
  block: warn when putting the final reference on a registered disk
  brd: reduce the brd_devices_mutex scope
  kyber: avoid q->disk dereferences in trace points
  block: keep q_usage_counter in atomic mode after del_gendisk
  block: drain file system I/O on del_gendisk
  block: split bio_queue_enter from blk_queue_enter
  block: factor out a blk_try_enter_queue helper
  block: call submit_bio_checks under q_usage_counter
  nvme: fix per-namespace chardev deletion
  block/rnbd-clt-sysfs: fix a couple uninitialized variable bugs
  nvme-pci: Fix abort command id
2021-10-17 19:25:20 -10:00
Tejun Heo
5370b0f490 blk-cgroup: blk_cgroup_bio_start() should use irq-safe operations on blkg->iostat_cpu
c3df5fb57f ("cgroup: rstat: fix A-A deadlock on 32bit around
u64_stats_sync") made u64_stats updates irq-safe to avoid A-A deadlocks.
Unfortunately, the conversion missed one in blk_cgroup_bio_start(). Fix it.

Fixes: 2d146aa3aa ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+
Reported-by: syzbot+9738c8815b375ce482a1@syzkaller.appspotmail.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YWi7NrQdVlxD6J9W@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17 10:40:10 -06:00
Paolo Valente
d29bd41428 block, bfq: reset last_bfqq_created on group change
Since commit 430a67f9d6 ("block, bfq: merge bursts of newly-created
queues"), BFQ maintains a per-group pointer to the last bfq_queue
created. If such a queue, say bfqq, happens to move to a different
group, then bfqq is no more a valid last bfq_queue created for its
previous group. That pointer must then be cleared. Not resetting such
a pointer may also cause UAF, if bfqq happens to also be freed after
being moved to a different group. This commit performs this missing
reset. As such it fixes commit 430a67f9d6 ("block, bfq: merge bursts
of newly-created queues").

Such a missing reset is most likely the cause of the crash reported in [1].
With some analysis, we found that this crash was due to the
above UAF. And such UAF did go away with this commit applied [1].

Anyway, before this commit, that crash happened to be triggered in
conjunction with commit 2d52c58b9c ("block, bfq: honor already-setup
queue merges"). The latter was then reverted by commit ebc69e897e
("Revert "block, bfq: honor already-setup queue merges""). Yet commit
2d52c58b9c ("block, bfq: honor already-setup queue merges") contains
no error related with the above UAF, and can then be restored.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503

Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Grzegorz Kowal <custos.mentis@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211015144336.45894-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17 07:03:02 -06:00
Christoph Hellwig
a20417611b block: warn when putting the final reference on a registered disk
Warn when the last reference on a live disk is put without calling
del_gendisk first.  There are some BDI related bug reports that look
like a case of this, so make sure we have the proper instrumentation
to catch it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014130231.1468538-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17 06:59:45 -06:00
Christoph Hellwig
c41108049d kyber: avoid q->disk dereferences in trace points
q->disk becomes invalid after the gendisk is removed.  Work around this
by caching the dev_t for the tracepoints.  The real fix would be to
properly tear down the I/O schedulers with the gendisk, but that is
a much more invasive change.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211012093301.GA27795@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:57 -06:00
Christoph Hellwig
aec89dc5d4 block: keep q_usage_counter in atomic mode after del_gendisk
Don't switch back to percpu mode to avoid the double RCU grace period
when tearing down SCSI devices.  After removing the disk only passthrough
commands can be send anyway.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-6-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:54 -06:00
Christoph Hellwig
8e141f9eb8 block: drain file system I/O on del_gendisk
Instead of delaying draining of file system I/O related items like the
blk-qos queues, the integrity read workqueue and timeouts only when the
request_queue is removed, do that when del_gendisk is called.  This is
important for SCSI where the upper level drivers that control the gendisk
are separate entities, and the disk can be freed much earlier than the
request_queue, or can even be unbound without tearing down the queue.

Fixes: edb0872f44 ("block: move the bdi from the request_queue to the gendisk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:50 -06:00
Christoph Hellwig
a6741536f4 block: split bio_queue_enter from blk_queue_enter
To prepare for fixing a gendisk shutdown race, open code the
blk_queue_enter logic in bio_queue_enter.  This also removes the
pointless flags translation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:47 -06:00
Christoph Hellwig
1f14a09890 block: factor out a blk_try_enter_queue helper
Factor out the code to try to get q_usage_counter without blocking into
a separate helper.  Both to improve code readability and to prepare for
splitting bio_queue_enter from blk_queue_enter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:44 -06:00
Christoph Hellwig
cc9c884dd7 block: call submit_bio_checks under q_usage_counter
Ensure all bios check the current values of the queue under freeze
protection, i.e. to make sure the zero capacity set by del_gendisk
is actually seen before dispatching to the driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-15 21:02:36 -06:00
Linus Torvalds
50eb0a06e6 block-5.15-2021-10-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFh4SQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphKJEACwasArWhI3zxFrl4QQfGsILTbqn7Ent7qp
 XsEunsQ/27wp8Pv2nTrQT/7Z2qYVuIs+EzPR1+DTFTVWycG3ZZCMp6ACIuAVH73b
 nD2vNQdHGcWqeipxbRYA/JBON+tVcFGbsFPREQUqn++3FIBHhKUvd/NGRPrim62s
 a63WEKZuHo/HHptvG9zmtqQAJ1N9wdA75VBn9GNgpqpW/lLePufNykEdox7/HHFg
 jJzvW8rppRBaLzoDj4RC4hp7Xgf0CPWScKqI3R20CHg5WjXkjDSvIWT3VQuFA/kU
 aRBO2cvoVYUw7w4xRosxTdJObaDXZZnXewA0SOGwGVAb9pIW5dhzby+bpuRx/Wkw
 3WrYGxndu1eis3YAoCVxllzo0zFMXGLjM3nfERketPHFPicLVVd0Pye+YAemyvMp
 Y9RWIRTRg6w/t0NicfYdK8LBTqEqdoys5OZZgaCyIPDV5dwhHCSDuACAwpR4qTAF
 PodCBG4DAc5TpE0Vy5QPywIe1cRvF8lnTZ+ZYXR7g3Ub1KoIl24gXuLvNMYrm0qb
 92l9S1+Bk1lT3nVThxz+rJSHsumlZHROd5TLkQs1S2bb7E9Pxyc6IW+H0hviutgc
 bN+aBf7O+U+oDoqkOGWFXB4aJMqFjjVW6z3DmhqE7MyVPE5jkpFSXY19ge2HOVwd
 AVXocaNAdw==
 =AbHC
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Two small fixes for this release:

   - Add missing QUEUE_FLAG_HCTX_ACTIVE in the debugfs handling
     (Johannes)

   - Fix double free / UAF issue in __alloc_disk_node (Tetsuo)"

* tag 'block-5.15-2021-10-09' of git://git.kernel.dk/linux-block:
  block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output
  block: genhd: fix double kfree() in __alloc_disk_node()
2021-10-09 14:51:59 -07:00
Johannes Thumshirn
1dbdd99b51 block: decode QUEUE_FLAG_HCTX_ACTIVE in debugfs output
While debugging an issue we've found that $DEBUGFS/block/$disk/state
doesn't decode QUEUE_FLAG_HCTX_ACTIVE but only displays its numerical
value.

Add QUEUE_FLAG(HCTX_ACTIVE) to the blk_queue_flag_name array so it'll get
decoded properly.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/4351076388918075bd80ef07756f9d2ce63be12c.1633332053.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-04 06:58:39 -06:00
Linus Torvalds
ab2a7a35c4 block-5.15-2021-10-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmFXvGIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprAgD/9WKQIVPyzXC3Ui8OkSb9DM+dnU47+4b6sL
 Naud4khgrvMjTIdZC9YjwAhKBRNsXbcTXRLbBsxVzk7HhGI31jR711qaOZ/jMRbc
 dPwcpF82/Yr8BlIl5i3SBvxFNW8FAJWLTVvPvJXSjUBPECTgY6MnchojOjYz1CrN
 a0TmtHs5Feu+nLyzKgQlPMv3P8tvrS1hQNpfvbK1oWeEQuyONGacK82nl853ak30
 c0uywVsvO+yiBY2xWhiTqperbPwnLxCYOMOdBqbNvlPlGIkEbkW1YWhm2/5f3vRi
 37C3ibmt2SCnJaAyiNUKrYSTCK9BoeWFqtxVwyGB8WxWsUXrdocEOoBdohxDHrcp
 gHIqxPVUD2hyEroRQseq4NTvERkRvOiF4jAFA3lHOw5zLtPGNEvC0EUFRMm3Ur3/
 DyviDBPy321pvrbqGg5uIL3o6wuXDQNnwxdXa2wElWDLuTYp8tB25rqAe2IRJ+iS
 k32AsgfL+y5fhjsA6ros/hpQ6FQTAKSW3i6S/0L+QB6H0jYb1ue9JPrA+3MoizvI
 p+PdmpLeFVQxnfthT2J+6pzQADoxcMwg5ALgyLFvhhL1YAqyhHkKgdAIS3gKXxYm
 GbQLd2hkv7wPZHEsS/XiGCj+zzyawjsr5hSHRqkbZ1QlyfnuTz5uI65WzST68o4d
 KOAvQTx4aQ==
 =SMoX
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few block fixes for this release:

   - Revert a BFQ commit that causes breakage for people. Unfortunately
     it was auto-selected for stable as well, so now 5.14.7 suffers from
     it too. Hopefully stable will pick up this revert quickly too, so
     we can remove the issue on that end as well.

   - Add a quirk for Apple NVMe controllers, which due to their
     non-compliance broke due to the introduction of command sequences
     (Keith)

   - Use shifts in nbd, fixing a __divdi3 issue (Nick)"

* tag 'block-5.15-2021-10-01' of git://git.kernel.dk/linux-block:
  nbd: use shifts rather than multiplies
  Revert "block, bfq: honor already-setup queue merges"
  nvme: add command id quirk for apple controllers
2021-10-02 11:00:36 -07:00
Tetsuo Handa
06cc978d3f block: genhd: fix double kfree() in __alloc_disk_node()
syzbot is reporting use-after-free read at bdev_free_inode() [1], for
kfree() from __alloc_disk_node() is called before bdev_free_inode()
(which is called after RCU grace period) reads bdev->bd_disk and calls
kfree(bdev->bd_disk).

Fix use-after-free read followed by double kfree() problem
by making sure that bdev->bd_disk is NULL when calling iput().

Link: https://syzkaller.appspot.com/bug?extid=8281086e8a6fbfbd952a [1]
Reported-by: syzbot <syzbot+8281086e8a6fbfbd952a@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/e6dd13c5-8db0-4392-6e78-a42ee5d2a1c4@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-02 07:29:20 -06:00
Jens Axboe
ebc69e897e Revert "block, bfq: honor already-setup queue merges"
This reverts commit 2d52c58b9c.

We have had several folks complain that this causes hangs for them, which
is especially problematic as the commit has also hit stable already.

As no resolution seems to be forthcoming right now, revert the patch.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 2d52c58b9c ("block, bfq: honor already-setup queue merges")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-28 06:33:15 -06:00
Linus Torvalds
bb19237bf6 SCSI fixes on 20210925
Thirty Three fixes, I'm afraid.  Essentially the build up from the
 last couple of weeks while I've been dealling with Linux Plumbers
 conference infrastructure issues.  It's mostly the usual assortment of
 spelling fixes and minor corrections.  The only core relevant changes
 are to the sd driver to reduce the spin up message spew and fix a
 small memory leak on the freeing path.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJsEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYU+VfiYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishVuMAP9Y2+rE
 AyQ3Y9dubie/AyKQ/2ZEFO/G2wz+jEJvfppXJwD4ulhhDFh/l4iNPVa7GW9M/ti3
 gd+6gMzzRC9/B+u2gQ==
 =+K+l
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Thirty-three fixes, I'm afraid.

  Essentially the build up from the last couple of weeks while I've been
  dealling with Linux Plumbers conference infrastructure issues. It's
  mostly the usual assortment of spelling fixes and minor corrections.

  The only core relevant changes are to the sd driver to reduce the spin
  up message spew and fix a small memory leak on the freeing path"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (33 commits)
  scsi: ses: Retry failed Send/Receive Diagnostic commands
  scsi: target: Fix spelling mistake "CONFLIFT" -> "CONFLICT"
  scsi: lpfc: Fix gcc -Wstringop-overread warning, again
  scsi: lpfc: Use correct scnprintf() limit
  scsi: lpfc: Fix sprintf() overflow in lpfc_display_fpin_wwpn()
  scsi: core: Remove 'current_tag'
  scsi: acornscsi: Remove tagged queuing vestiges
  scsi: fas216: Kill scmd->tag
  scsi: qla2xxx: Restore initiator in dual mode
  scsi: ufs: core: Unbreak the reset handler
  scsi: sd_zbc: Support disks with more than 2**32 logical blocks
  scsi: ufs: core: Revert "scsi: ufs: Synchronize SCSI and UFS error handling"
  scsi: bsg: Fix device unregistration
  scsi: sd: Make sd_spinup_disk() less noisy
  scsi: ufs: ufs-pci: Fix Intel LKF link stability
  scsi: mpt3sas: Clean up some inconsistent indenting
  scsi: megaraid: Clean up some inconsistent indenting
  scsi: sr: Fix spelling mistake "does'nt" -> "doesn't"
  scsi: Remove SCSI CDROM MAINTAINERS entry
  scsi: megaraid: Fix Coccinelle warning
  ...
2021-09-25 16:05:56 -07:00
Ming Lei
f278eb3d81 block: hold ->invalidate_lock in blkdev_fallocate
When running ->fallocate(), blkdev_fallocate() should hold
mapping->invalidate_lock to prevent page cache from being accessed,
otherwise stale data may be read in page cache.

Without this patch, blktests block/009 fails sometimes. With this patch,
block/009 can pass always.

Also as Jan pointed out, no pages can be created in the discarded area
while you are holding the invalidate_lock, so remove the 2nd
truncate_bdev_range().

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210923023751.1441091-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24 11:06:58 -06:00
Ming Lei
a647a524a4 block: don't call rq_qos_ops->done_bio if the bio isn't tracked
rq_qos framework is only applied on request based driver, so:

1) rq_qos_done_bio() needn't to be called for bio based driver

2) rq_qos_done_bio() needn't to be called for bio which isn't tracked,
such as bios ended from error handling code.

Especially in bio_endio():

1) request queue is referred via bio->bi_bdev->bd_disk->queue, which
may be gone since request queue refcount may not be held in above two
cases

2) q->rq_qos may be freed in blk_cleanup_queue() when calling into
__rq_qos_done_bio()

Fix the potential kernel panic by not calling rq_qos_ops->done_bio if
the bio isn't tracked. This way is safe because both ioc_rqos_done_bio()
and blkcg_iolatency_done_bio() are nop if the bio isn't tracked.

Reported-by: Yu Kuai <yukuai3@huawei.com>
Cc: tj@kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24 11:04:39 -06:00
Li Jinlin
858560b276 blk-cgroup: fix UAF by grabbing blkcg lock before destroying blkg pd
KASAN reports a use-after-free report when doing fuzz test:

[693354.104835] ==================================================================
[693354.105094] BUG: KASAN: use-after-free in bfq_io_set_weight_legacy+0xd3/0x160
[693354.105336] Read of size 4 at addr ffff888be0a35664 by task sh/1453338

[693354.105607] CPU: 41 PID: 1453338 Comm: sh Kdump: loaded Not tainted 4.18.0-147
[693354.105610] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.81 07/02/2018
[693354.105612] Call Trace:
[693354.105621]  dump_stack+0xf1/0x19b
[693354.105626]  ? show_regs_print_info+0x5/0x5
[693354.105634]  ? printk+0x9c/0xc3
[693354.105638]  ? cpumask_weight+0x1f/0x1f
[693354.105648]  print_address_description+0x70/0x360
[693354.105654]  kasan_report+0x1b2/0x330
[693354.105659]  ? bfq_io_set_weight_legacy+0xd3/0x160
[693354.105665]  ? bfq_io_set_weight_legacy+0xd3/0x160
[693354.105670]  bfq_io_set_weight_legacy+0xd3/0x160
[693354.105675]  ? bfq_cpd_init+0x20/0x20
[693354.105683]  cgroup_file_write+0x3aa/0x510
[693354.105693]  ? ___slab_alloc+0x507/0x540
[693354.105698]  ? cgroup_file_poll+0x60/0x60
[693354.105702]  ? 0xffffffff89600000
[693354.105708]  ? usercopy_abort+0x90/0x90
[693354.105716]  ? mutex_lock+0xef/0x180
[693354.105726]  kernfs_fop_write+0x1ab/0x280
[693354.105732]  ? cgroup_file_poll+0x60/0x60
[693354.105738]  vfs_write+0xe7/0x230
[693354.105744]  ksys_write+0xb0/0x140
[693354.105749]  ? __ia32_sys_read+0x50/0x50
[693354.105760]  do_syscall_64+0x112/0x370
[693354.105766]  ? syscall_return_slowpath+0x260/0x260
[693354.105772]  ? do_page_fault+0x9b/0x270
[693354.105779]  ? prepare_exit_to_usermode+0xf9/0x1a0
[693354.105784]  ? enter_from_user_mode+0x30/0x30
[693354.105793]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.105875] Allocated by task 1453337:
[693354.106001]  kasan_kmalloc+0xa0/0xd0
[693354.106006]  kmem_cache_alloc_node_trace+0x108/0x220
[693354.106010]  bfq_pd_alloc+0x96/0x120
[693354.106015]  blkcg_activate_policy+0x1b7/0x2b0
[693354.106020]  bfq_create_group_hierarchy+0x1e/0x80
[693354.106026]  bfq_init_queue+0x678/0x8c0
[693354.106031]  blk_mq_init_sched+0x1f8/0x460
[693354.106037]  elevator_switch_mq+0xe1/0x240
[693354.106041]  elevator_switch+0x25/0x40
[693354.106045]  elv_iosched_store+0x1a1/0x230
[693354.106049]  queue_attr_store+0x78/0xb0
[693354.106053]  kernfs_fop_write+0x1ab/0x280
[693354.106056]  vfs_write+0xe7/0x230
[693354.106060]  ksys_write+0xb0/0x140
[693354.106064]  do_syscall_64+0x112/0x370
[693354.106069]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.106114] Freed by task 1453336:
[693354.106225]  __kasan_slab_free+0x130/0x180
[693354.106229]  kfree+0x90/0x1b0
[693354.106233]  blkcg_deactivate_policy+0x12c/0x220
[693354.106238]  bfq_exit_queue+0xf5/0x110
[693354.106241]  blk_mq_exit_sched+0x104/0x130
[693354.106245]  __elevator_exit+0x45/0x60
[693354.106249]  elevator_switch_mq+0xd6/0x240
[693354.106253]  elevator_switch+0x25/0x40
[693354.106257]  elv_iosched_store+0x1a1/0x230
[693354.106261]  queue_attr_store+0x78/0xb0
[693354.106264]  kernfs_fop_write+0x1ab/0x280
[693354.106268]  vfs_write+0xe7/0x230
[693354.106271]  ksys_write+0xb0/0x140
[693354.106275]  do_syscall_64+0x112/0x370
[693354.106280]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[693354.106329] The buggy address belongs to the object at ffff888be0a35580
                 which belongs to the cache kmalloc-1k of size 1024
[693354.106736] The buggy address is located 228 bytes inside of
                 1024-byte region [ffff888be0a35580, ffff888be0a35980)
[693354.107114] The buggy address belongs to the page:
[693354.107273] page:ffffea002f828c00 count:1 mapcount:0 mapping:ffff888107c17080 index:0x0 compound_mapcount: 0
[693354.107606] flags: 0x17ffffc0008100(slab|head)
[693354.107760] raw: 0017ffffc0008100 ffffea002fcbc808 ffffea0030bd3a08 ffff888107c17080
[693354.108020] raw: 0000000000000000 00000000001c001c 00000001ffffffff 0000000000000000
[693354.108278] page dumped because: kasan: bad access detected

[693354.108511] Memory state around the buggy address:
[693354.108671]  ffff888be0a35500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[693354.116396]  ffff888be0a35580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.124473] >ffff888be0a35600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.132421]                                                        ^
[693354.140284]  ffff888be0a35680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.147912]  ffff888be0a35700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[693354.155281] ==================================================================

blkgs are protected by both queue and blkcg locks and holding
either should stabilize them. However, the path of destroying
blkg policy data is only protected by queue lock in
blkcg_activate_policy()/blkcg_deactivate_policy(). Other tasks
can get the blkg policy data before the blkg policy data is
destroyed, and use it after destroyed, which will result in a
use-after-free.

CPU0                             CPU1
blkcg_deactivate_policy
  spin_lock_irq(&q->queue_lock)
                                 bfq_io_set_weight_legacy
                                   spin_lock_irq(&blkcg->lock)
                                   blkg_to_bfqg(blkg)
                                     pd_to_bfqg(blkg->pd[pol->plid])
                                     ^^^^^^blkg->pd[pol->plid] != NULL
                                           bfqg != NULL
  pol->pd_free_fn(blkg->pd[pol->plid])
    pd_to_bfqg(blkg->pd[pol->plid])
    bfqg_put(bfqg)
      kfree(bfqg)
  blkg->pd[pol->plid] = NULL
  spin_unlock_irq(q->queue_lock);
                                   bfq_group_set_weight(bfqg, val, 0)
                                     bfqg->entity.new_weight
                                     ^^^^^^trigger uaf here
                                   spin_unlock_irq(&blkcg->lock);

Fix by grabbing the matching blkcg lock before trying to
destroy blkg policy data.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210914042605.3260596-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15 12:03:18 -06:00
Yanfei Xu
6f5ddde410 blkcg: fix memory leak in blk_iolatency_init
BUG: memory leak
unreferenced object 0xffff888129acdb80 (size 96):
  comm "syz-executor.1", pid 12661, jiffies 4294962682 (age 15.220s)
  hex dump (first 32 bytes):
    20 47 c9 85 ff ff ff ff 20 d4 8e 29 81 88 ff ff   G...... ..)....
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff82264ec8>] kmalloc include/linux/slab.h:591 [inline]
    [<ffffffff82264ec8>] kzalloc include/linux/slab.h:721 [inline]
    [<ffffffff82264ec8>] blk_iolatency_init+0x28/0x190 block/blk-iolatency.c:724
    [<ffffffff8225b8c4>] blkcg_init_queue+0xb4/0x1c0 block/blk-cgroup.c:1185
    [<ffffffff822253da>] blk_alloc_queue+0x22a/0x2e0 block/blk-core.c:566
    [<ffffffff8223b175>] blk_mq_init_queue_data block/blk-mq.c:3100 [inline]
    [<ffffffff8223b175>] __blk_mq_alloc_disk+0x25/0xd0 block/blk-mq.c:3124
    [<ffffffff826a9303>] loop_add+0x1c3/0x360 drivers/block/loop.c:2344
    [<ffffffff826a966e>] loop_control_get_free drivers/block/loop.c:2501 [inline]
    [<ffffffff826a966e>] loop_control_ioctl+0x17e/0x2e0 drivers/block/loop.c:2516
    [<ffffffff81597eec>] vfs_ioctl fs/ioctl.c:51 [inline]
    [<ffffffff81597eec>] __do_sys_ioctl fs/ioctl.c:874 [inline]
    [<ffffffff81597eec>] __se_sys_ioctl fs/ioctl.c:860 [inline]
    [<ffffffff81597eec>] __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
    [<ffffffff843fa745>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff843fa745>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Once blk_throtl_init() queue init failed, blkcg_iolatency_exit() will
not be invoked for cleanup. That leads a memory leak. Swap the
blk_throtl_init() and blk_iolatency_init() calls can solve this.

Reported-by: syzbot+01321b15cc98e6bf96d6@syzkaller.appspotmail.com
Fixes: 19688d7f95 (block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls)
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210915072426.4022924-1-yanfei.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-15 11:57:44 -06:00
Lihong Kou
3df49967f6 block: flush the integrity workqueue in blk_integrity_unregister
When the integrity profile is unregistered there can still be integrity
reads queued up which could see a NULL verify_fn as shown by the race
window below:

CPU0                                    CPU1
  process_one_work                      nvme_validate_ns
    bio_integrity_verify_fn                nvme_update_ns_info
	                                     nvme_update_disk_info
	                                       blk_integrity_unregister
                                               ---set queue->integrity as 0
	bio_integrity_process
	--access bi->profile->verify_fn(bi is a pointer of queue->integity)

Before calling blk_integrity_unregister in nvme_update_disk_info, we must
make sure that there is no work item in the kintegrityd_wq. Just call
blk_flush_integrity to flush the work queue so the bug can be resolved.

Signed-off-by: Lihong Kou <koulihong@huawei.com>
[hch: split up and shortened the changelog]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14 20:03:30 -06:00
Christoph Hellwig
783a40a1b3 block: check if a profile is actually registered in blk_integrity_unregister
While clearing the profile itself is harmless, we really should not clear
the stable writes flag if it wasn't set due to a registered integrity
profile.

Reported-by: Lihong Kou <koulihong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14 20:03:30 -06:00
Zenghui Yu
1a0db7744e scsi: bsg: Fix device unregistration
device_initialize() is used to take a refcount on the device. However,
put_device() is not called during device teardown. This leads to a
leak of private data of the driver core, dev_name(), etc. This is
reported by kmemleak at boot time if we compile kernel with
DEBUG_TEST_DRIVER_REMOVE.

Fix memory leaks during unregistration and implement a release
function.

Link: https://lore.kernel.org/r/20210911105306.1511-1-yuzenghui@huawei.com
Fixes: ead09dd3ae ("scsi: bsg: Simplify device registration")
Reviewed-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-14 00:22:15 -04:00
Ming Lei
67f3b2f822 blk-mq: avoid to iterate over stale request
blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.

So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].

scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.

Fixes the issue by not iterating over stale request.

Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210906065003.439019-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-12 19:32:43 -06:00
Linus Torvalds
c0f7e49fc4 block-5.15-2021-09-11
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE8ueIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkSYD/9eaQ1Hxc+X+4eVb3A9Cpy36Qy/uY/hArnT
 kSUDtQitrRigqhStaD0MGpknWFnZE4cSojbYN0OoEWL7GC8idSZXx7KrVJpSHGbM
 XGVEflohvjDLNPkV99gmlzF2o6zPlWESApU1/HO2x+Ws1oKaYDAfFVf0CPGPe2C6
 MRerU5v3HSmTC0eFZxU246bwwX/phNuNDokndR27rrsjK0mLF5UoMKySeqy3INp5
 6mj3R+HNIW5j8eQk/HJPW7dgiKpWYneWV2Z90DuOLbcJ+wnx7s07wT1yRnOFUTsb
 p2ojVWmXtCJ1kRex6bK/eeIJC5TYvT3bNwsnIRmJHd9btHqhm2uKy77m3S1AuE7w
 K8bN581aXlr/3pUbFyYZDZQbYshUn25YP9OlyS9r4pklCh9C5KneL1b4xswWTDTB
 whvPZlkot3rGD8LHDpV5xVVzeaAcbSXanIRROjxHqQSRRTA9BjG3E4A2cDh8nmYD
 mRGEimfZcoojF2EQJYswPOQ24cZwpnihPpJO9NkOodRqfasn6XakAGg6SONFYyQ0
 Ewa6QzIOCebBgOVGbzMtpoDpnySE12ONmrDCbSEiYFJLXBMMiqgNON/Xaq0tmXHT
 lsDpyz3ytWAB9OZ3M0/9arZzlFf/E+FRqt4ExelmwxiutKRb1dIKQq8xip/YxdA+
 Y86kwUoAXQ==
 =1ajD
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph:
     - fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
     - update a MAINTAINERS email address (Chaitanya Kulkarni)
     - set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
     - handle errors from add_disk() (Luis Chamberlain)
     - update the keep alive interval when kato is modified (Tatsuya Sasaki)
     - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
     - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
     - only call synchronize_srcu when clearing current path (Daniel Wagner)
     - revalidate paths during rescan (Hannes Reinecke)

 - Split out the fs/block_dev into block/fops.c and block/bdev.c, which
   has been long overdue. Do this now before -rc1, to avoid annoying
   conflicts due to this (Christoph)

 - blk-throtl use-after-free fix (Li)

 - Improve plug depth for multi-device plugs, greatly increasing md
   resync performance (Song)

 - blkdev_show() locking fix (Tetsuo)

 - n64cart error check fix (Yang)

* tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
  n64cart: fix return value check in n64cart_probe()
  blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
  block: move fs/block_dev.c to block/bdev.c
  block: split out operations on block special files
  blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
  block: genhd: don't call blkdev_show() with major_names_lock held
  nvme: update MAINTAINERS email address
  nvme: add error handling support for add_disk()
  nvme: only call synchronize_srcu when clearing current path
  nvme: update keep alive interval when kato is modified
  nvme-tcp: Do not reset transport on data digest errors
  nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
  nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
  nvmet: looks at the passthrough controller when initializing CAP
  nvme: move nvme_multi_css into nvme.h
  nvme-multipath: revalidate paths during rescan
  nvme-multipath: set QUEUE_FLAG_NOWAIT
2021-09-11 10:19:51 -07:00
Song Liu
7f2a6a69f7 blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
Limiting number of request to BLK_MAX_REQUEST_COUNT at blk_plug hurts
performance for large md arrays. [1] shows resync speed of md array drops
for md array with more than 16 HDDs.

Fix this by allowing more request at plug queue. The multiple_queue flag
is used to only apply higher limit to multiple queue cases.

[1] https://lore.kernel.org/linux-raid/CAFDAVznS71BXW8Jxv6k9dXc2iR3ysX3iZRBww_rzA8WifBFxGg@mail.gmail.com/
Tested-by: Marcin Wanat <marcin.wanat@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 17:05:12 -06:00
Christoph Hellwig
0dca4462ed block: move fs/block_dev.c to block/bdev.c
Move it together with the rest of the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:39:40 -06:00
Christoph Hellwig
cd82cca7eb block: split out operations on block special files
Add a new block/fops.c for all the file and address_space operations
that provide the block special file support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-2-hch@lst.de
[axboe: correct trailing whitespace while at it]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:39:29 -06:00
Li Jinlin
884f0e84f1 blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
The pending timer has been set up in blk_throtl_init(). However, the
timer is not deleted in blk_throtl_exit(). This means that the timer
handler may still be running after freeing the timer, which would
result in a use-after-free.

Fix by calling del_timer_sync() to delete the timer in blk_throtl_exit().

Signed-off-by: Li Jinlin <lijinlin3@huawei.com>
Link: https://lore.kernel.org/r/20210907121242.2885564-1-lijinlin3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:36:56 -06:00
Tetsuo Handa
dfbb3409b2 block: genhd: don't call blkdev_show() with major_names_lock held
If CONFIG_BLK_DEV_LOOP && CONFIG_MTD (at least; there might be other
combinations), lockdep complains circular locking dependency at
__loop_clr_fd(), for major_names_lock serves as a locking dependency
aggregating hub across multiple block modules.

 ======================================================
 WARNING: possible circular locking dependency detected
 5.14.0+ #757 Tainted: G            E
 ------------------------------------------------------
 systemd-udevd/7568 is trying to acquire lock:
 ffff88800f334d48 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x70/0x560

 but task is already holding lock:
 ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #6 (&lo->lo_mutex){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_killable_nested+0x17/0x20
        lo_open+0x23/0x50 [loop]
        blkdev_get_by_dev+0x199/0x540
        blkdev_open+0x58/0x90
        do_dentry_open+0x144/0x3a0
        path_openat+0xa57/0xda0
        do_filp_open+0x9f/0x140
        do_sys_openat2+0x71/0x150
        __x64_sys_openat+0x78/0xa0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #5 (&disk->open_mutex){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_nested+0x17/0x20
        bd_register_pending_holders+0x20/0x100
        device_add_disk+0x1ae/0x390
        loop_add+0x29c/0x2d0 [loop]
        blk_request_module+0x5a/0xb0
        blkdev_get_no_open+0x27/0xa0
        blkdev_get_by_dev+0x5f/0x540
        blkdev_open+0x58/0x90
        do_dentry_open+0x144/0x3a0
        path_openat+0xa57/0xda0
        do_filp_open+0x9f/0x140
        do_sys_openat2+0x71/0x150
        __x64_sys_openat+0x78/0xa0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #4 (major_names_lock){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_nested+0x17/0x20
        blkdev_show+0x19/0x80
        devinfo_show+0x52/0x60
        seq_read_iter+0x2d5/0x3e0
        proc_reg_read_iter+0x41/0x80
        vfs_read+0x2ac/0x330
        ksys_read+0x6b/0xd0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #3 (&p->lock){+.+.}-{3:3}:
        lock_acquire+0xbe/0x1f0
        __mutex_lock_common+0xb6/0xe10
        mutex_lock_nested+0x17/0x20
        seq_read_iter+0x37/0x3e0
        generic_file_splice_read+0xf3/0x170
        splice_direct_to_actor+0x14e/0x350
        do_splice_direct+0x84/0xd0
        do_sendfile+0x263/0x430
        __se_sys_sendfile64+0x96/0xc0
        do_syscall_64+0x3d/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 -> #2 (sb_writers#3){.+.+}-{0:0}:
        lock_acquire+0xbe/0x1f0
        lo_write_bvec+0x96/0x280 [loop]
        loop_process_work+0xa68/0xc10 [loop]
        process_one_work+0x293/0x480
        worker_thread+0x23d/0x4b0
        kthread+0x163/0x180
        ret_from_fork+0x1f/0x30

 -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
        lock_acquire+0xbe/0x1f0
        process_one_work+0x280/0x480
        worker_thread+0x23d/0x4b0
        kthread+0x163/0x180
        ret_from_fork+0x1f/0x30

 -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
        validate_chain+0x1f0d/0x33e0
        __lock_acquire+0x92d/0x1030
        lock_acquire+0xbe/0x1f0
        flush_workqueue+0x8c/0x560
        drain_workqueue+0x80/0x140
        destroy_workqueue+0x47/0x4f0
        __loop_clr_fd+0xb4/0x400 [loop]
        blkdev_put+0x14a/0x1d0
        blkdev_close+0x1c/0x20
        __fput+0xfd/0x220
        task_work_run+0x69/0xc0
        exit_to_user_mode_prepare+0x1ce/0x1f0
        syscall_exit_to_user_mode+0x26/0x60
        do_syscall_64+0x4c/0xb0
        entry_SYSCALL_64_after_hwframe+0x44/0xae

 other info that might help us debug this:

 Chain exists of:
   (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&lo->lo_mutex);
                                lock(&disk->open_mutex);
                                lock(&lo->lo_mutex);
   lock((wq_completion)loop0);

  *** DEADLOCK ***

 2 locks held by systemd-udevd/7568:
  #0: ffff888012554128 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_put+0x4c/0x1d0
  #1: ffff888014a7d4a0 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x4d/0x400 [loop]

 stack backtrace:
 CPU: 0 PID: 7568 Comm: systemd-udevd Tainted: G            E     5.14.0+ #757
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
 Call Trace:
  dump_stack_lvl+0x79/0xbf
  print_circular_bug+0x5d6/0x5e0
  ? stack_trace_save+0x42/0x60
  ? save_trace+0x3d/0x2d0
  check_noncircular+0x10b/0x120
  validate_chain+0x1f0d/0x33e0
  ? __lock_acquire+0x953/0x1030
  ? __lock_acquire+0x953/0x1030
  __lock_acquire+0x92d/0x1030
  ? flush_workqueue+0x70/0x560
  lock_acquire+0xbe/0x1f0
  ? flush_workqueue+0x70/0x560
  flush_workqueue+0x8c/0x560
  ? flush_workqueue+0x70/0x560
  ? sched_clock_cpu+0xe/0x1a0
  ? drain_workqueue+0x41/0x140
  drain_workqueue+0x80/0x140
  destroy_workqueue+0x47/0x4f0
  ? blk_mq_freeze_queue_wait+0xac/0xd0
  __loop_clr_fd+0xb4/0x400 [loop]
  ? __mutex_unlock_slowpath+0x35/0x230
  blkdev_put+0x14a/0x1d0
  blkdev_close+0x1c/0x20
  __fput+0xfd/0x220
  task_work_run+0x69/0xc0
  exit_to_user_mode_prepare+0x1ce/0x1f0
  syscall_exit_to_user_mode+0x26/0x60
  do_syscall_64+0x4c/0xb0
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f0fd4c661f7
 Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 13 fc ff ff
 RSP: 002b:00007ffd1c9e9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
 RAX: 0000000000000000 RBX: 00007f0fd46be6c8 RCX: 00007f0fd4c661f7
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000006
 RBP: 0000000000000006 R08: 000055fff1eaf400 R09: 0000000000000000
 R10: 00007f0fd46be6c8 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000000000002f08 R15: 00007ffd1c9ea050

Commit 1c500ad706 ("loop: reduce the loop_ctl_mutex scope") is for
breaking "loop_ctl_mutex => &lo->lo_mutex" dependency chain. But enabling
a different block module results in forming circular locking dependency
due to shared major_names_lock mutex.

The simplest fix is to call probe function without holding
major_names_lock [1], but Christoph Hellwig does not like such idea.
Therefore, instead of holding major_names_lock in blkdev_show(),
introduce a different lock for blkdev_show() in order to break
"sb_writers#$N => &p->lock => major_names_lock" dependency chain.

Link: https://lkml.kernel.org/r/b2af8a5b-3c1b-204e-7f56-bea0b15848d6@i-love.sakura.ne.jp [1]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Link: https://lore.kernel.org/r/18a02da2-0bf3-550e-b071-2b4ab13c49f0@i-love.sakura.ne.jp
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-07 08:36:21 -06:00
Linus Torvalds
1dbe7e386f block-5.15-2021-09-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmE1hQcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprjfEACdG+medwQOPpKNSoAvQmYyQnZRMbPjiruv
 A4nW2L6MKaExO59qQLVbBYHaH2+ng2UR/p5jNi2AKm+hrQEYllxlNvuCkRBIn97J
 r45R48mzBbHjR4kE3Fdu1mOFpBWOuU9JrtzHI+JF/Sl/qPIxKYNHf5E66T6l90Fz
 0hJkorAoVB7+hQYixdmkM9quZy11D5SY3aM+bG8r2uNjZTBEHMfmOen8o1giR0vC
 EOHzObuC6WLjLGQInNW+Cq2//vVVybQa79mhOUMp93z5nhDMtwUu7MH4B4kmGpix
 GLjDa1DukUZe7nGcnsRKmjjXQ+BpG6YF52Z2RfVZpWZn83t5c4YQsq++TPZ8KfpK
 4NAFFuSbGM/+QWwEiiyWu00syvpzrEJ4ZIJyZX3FYEeKyKWVRGHqlMDcS9LstYOk
 4OfgQUcJ7f/fXeedwi0OGJS1BLr6fi8RnazIafCNIIJLe1XIwTsNufPCNxWYqDAi
 0XhH+uYGD38VoUiR5JymZku6frwY4kxssA1khPPE5jWbzCZXiHprwwzaP4hBNNeZ
 c5cn9/1ZQSoTE3ebrX9pzTn5wRZwAL+iDhZ2SpLlN2Ji1BJ4EM9H8qFGj3U/CSM4
 OWKY2c0VwJYQUhjO4QDBx0MblJgNy8HsvmqGETuxUlk56j3Q1Mx3ViPV43amP9eM
 OM4mGige3Q==
 =4SCA
 -----END PGP SIGNATURE-----

Merge tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "Was going to send this one in later this week, but given that -Werror
  is now enabled (or at least available), the mq-deadline fix really
  should go in for the folks hitting that.

   - Ensure dd_queued() is only there if needed (Geert)

   - Fix a kerneldoc warning for bio_alloc_kiocb()

   - BFQ fix for queue merging

   - loop locking fix (Tetsuo)"

* tag 'block-5.15-2021-09-05' of git://git.kernel.dk/linux-block:
  loop: reduce the loop_ctl_mutex scope
  bio: fix kerneldoc documentation for bio_alloc_kiocb()
  block, bfq: honor already-setup queue merges
  block/mq-deadline: Move dd_queued() to fix defined but not used warning
2021-09-06 10:06:26 -07:00
Linus Torvalds
14726903c8 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "173 patches.

  Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
  pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
  bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
  hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
  oom-kill, migration, ksm, percpu, vmstat, and madvise)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (173 commits)
  mm/madvise: add MADV_WILLNEED to process_madvise()
  mm/vmstat: remove unneeded return value
  mm/vmstat: simplify the array size calculation
  mm/vmstat: correct some wrong comments
  mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
  selftests: vm: add COW time test for KSM pages
  selftests: vm: add KSM merging time test
  mm: KSM: fix data type
  selftests: vm: add KSM merging across nodes test
  selftests: vm: add KSM zero page merging test
  selftests: vm: add KSM unmerge test
  selftests: vm: add KSM merge test
  mm/migrate: correct kernel-doc notation
  mm: wire up syscall process_mrelease
  mm: introduce process_mrelease system call
  memblock: make memblock_find_in_range method private
  mm/mempolicy.c: use in_task() in mempolicy_slab_node()
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  mm/mempolicy: advertise new MPOL_PREFERRED_MANY
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  ...
2021-09-03 10:08:28 -07:00
Christoph Hellwig
f358afc52c mm: remove flush_kernel_dcache_page
flush_kernel_dcache_page is a rather confusing interface that implements a
subset of flush_dcache_page by not being able to properly handle page
cache mapped pages.

The only callers left are in the exec code as all other previous callers
were incorrect as they could have dealt with page cache pages.  Replace
the calls to flush_kernel_dcache_page with calls to flush_dcache_page,
which for all architectures does either exactly the same thing, can
contains one or more of the following:

 1) an optimization to defer the cache flush for page cache pages not
    mapped into userspace
 2) additional flushing for mapped page cache pages if cache aliases
    are possible

Link: https://lkml.kernel.org/r/20210712060928.4161649-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Shi <alexs@kernel.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:13 -07:00
Jens Axboe
0ef47db1cb bio: fix kerneldoc documentation for bio_alloc_kiocb()
Apparently the last fixup got butter fingered a bit, the correct variable
name is 'nr_vecs', not 'nr_iovecs'.

Link: https://lore.kernel.org/lkml/20210903164939.02f6e8c5@canb.auug.org.au/
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-03 07:42:13 -06:00
Linus Torvalds
a9c9a6f741 SCSI misc on 20210902
This series consists of the usual driver updates (ufs, qla2xxx,
 target, smartpqi, lpfc, mpt3sas).  The core change causing the most
 churn was replacing the command request field request with a macro,
 allowing us to offset map to it and remove the redundant field; the
 same was also done for the tag field.  The most impactful change is
 the final removal of scsi_ioctl, which has been deprecated for over a
 decade.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYTD/TiYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishdUkAQCjb3Ux
 4K9438mMelHlzM4er1S1IJ0WNnvObaVMNO9LBwD+JUz+rHsrKvuEX9j3g3C3u6JH
 hC3BUEW8f2LLnujWanQ=
 =lC5o
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This series consists of the usual driver updates (ufs, qla2xxx,
  target, smartpqi, lpfc, mpt3sas).

  The core change causing the most churn was replacing the command
  request field request with a macro, allowing us to offset map to it
  and remove the redundant field; the same was also done for the tag
  field.

  The most impactful change is the final removal of scsi_ioctl, which
  has been deprecated for over a decade"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
  scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
  scsi: ufs: ufs-exynos: Fix static checker warning
  scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
  scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
  scsi: lpfc: Copyright updates for 14.0.0.1 patches
  scsi: lpfc: Update lpfc version to 14.0.0.1
  scsi: lpfc: Add bsg support for retrieving adapter cmf data
  scsi: lpfc: Add cmf_info sysfs entry
  scsi: lpfc: Add debugfs support for cm framework buffers
  scsi: lpfc: Add support for maintaining the cm statistics buffer
  scsi: lpfc: Add rx monitoring statistics
  scsi: lpfc: Add support for the CM framework
  scsi: lpfc: Add cmfsync WQE support
  scsi: lpfc: Add support for cm enablement buffer
  scsi: lpfc: Add cm statistics buffer support
  scsi: lpfc: Add EDC ELS support
  scsi: lpfc: Expand FPIN and RDF receive logging
  scsi: lpfc: Add MIB feature enablement support
  scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
  scsi: fc: Add EDC ELS definition
  ...
2021-09-02 15:09:46 -07:00
Paolo Valente
2d52c58b9c block, bfq: honor already-setup queue merges
The function bfq_setup_merge prepares the merging between two
bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
the process that generated that I/O is disassociated from bfqq and
associated with new_bfqq (merging is actually a redirection). In this
respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
the number of processes that are expected to be associated with
new_bfqq.

Unfortunately, the stable-merging mechanism interferes with this
setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
before all the expected processes have been associated with
bfqq->new_bfqq, bfqq may happen to be stably merged with a different
queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
gets changed. So, some of the processes that have been already
accounted for in the ref counter of the previous new_bfqq will not be
associated with that queue.  This creates an unbalance, because those
references will never be decremented.

This commit fixes this issue by reestablishing the previous, natural
behaviour: once bfqq->new_bfqq has been set, it will not be changed
until all expected redirections have occurred.

Signed-off-by: Davide Zini <davidezini2@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02 06:36:58 -06:00