Commit Graph

5296 Commits

Author SHA1 Message Date
Damien Le Moal
0fe37724f8 block: fix bd_size_lock use
Some block device drivers, e.g. the skd driver, call set_capacity() with
IRQ disabled. This results in lockdep ito complain about inconsistent
lock states ("inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage")
because set_capacity takes a block device bd_size_lock using the
functions spin_lock() and spin_unlock(). Ensure a consistent locking
state by replacing these calls with spin_lock_irqsave() and
spin_lock_irqrestore(). The same applies to bdev_set_nr_sectors().
With this fix, all lockdep complaints are resolved.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28 07:31:50 -07:00
Baolin Wang
6c635caef4 blk-cgroup: Use cond_resched() when destroy blkgs
On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-28 07:31:48 -07:00
Dinghao Liu
ef49d40b61 block: Fix an error handling in add_partition
Once we have called device_initialize(), we should use put_device() to
give up the reference on error, just like what we have done on failure
of device_add().

Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 21:35:58 -07:00
Ming Lei
2569063c71 blk-mq: test QUEUE_FLAG_HCTX_ACTIVE for sbitmap_shared in hctx_may_queue
In case of blk_mq_is_sbitmap_shared(), we should test QUEUE_FLAG_HCTX_ACTIVE against
q->queue_flags instead of BLK_MQ_S_TAG_ACTIVE.

So fix it.

Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: f1b49fdc1c ("blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 21:25:17 -07:00
John Garry
02f938e9fe blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
Showing the hctx flags for when BLK_MQ_F_TAG_HCTX_SHARED is set gives
something like:

root@debian:/home/john# more /sys/kernel/debug/block/sda/hctx0/flags
alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED|3

Add the decoding for that flag.

Fixes: 32bc15afed ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-08 08:20:27 -07:00
Ming Lei
aebf5db917 block: fix use-after-free in disk_part_iter_next
Make sure that bdgrab() is done on the 'block_device' instance before
referring to it for avoiding use-after-free.

Cc: <stable@vger.kernel.org>
Reported-by: syzbot+825f0f9657d4e528046e@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05 11:35:17 -07:00
Jan Kara
6d4d273588 bfq: Fix computation of shallow depth
BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05 11:33:50 -07:00
Tejun Heo
d16baa3f14 blk-iocost: fix NULL iocg deref from racing against initialization
When initializing iocost for a queue, its rqos should be registered before
the blkcg policy is activated to allow policy data initiailization to lookup
the associated ioc. This unfortunately means that the rqos methods can be
called on bios before iocgs are attached to all existing blkgs.

While the race is theoretically possible on ioc_rqos_throttle(), it mostly
happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
The former determines it from the passed in @rqos and then bails before
dereferencing iocg if the looked up ioc is disabled, which most likely is
the case if initialization is still in progress. The latter looked up ioc by
dereferencing the possibly NULL iocg making it a lot more prone to actually
triggering the bug.

* Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
  up ioc for consistency.

* Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
  dereferencing it.

* Explain the danger of NULL iocgs in blk_iocost_init().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jonathan Lemon <bsd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05 11:33:32 -07:00
Andres Freund
dc30432605 block: add debugfs stanza for QUEUE_FLAG_NOWAIT
This was missed in 021a24460d. Leads to the numeric value of
QUEUE_FLAG_NOWAIT (i.e. 29) showing up in
/sys/kernel/debug/block/*/state.

Fixes: 021a24460d
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-29 16:47:46 -07:00
Christoph Hellwig
7b51e703a8 block: update some copyrights
Update copyrights for files that have gotten some major rewrites lately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22 08:43:06 -07:00
Sebastian Andrzej Siewior
71425189b2 blk-mq: Don't complete on a remote CPU in force threaded mode
With force threaded interrupts enabled, raising softirq from an SMP
function call will always result in waking the ksoftirqd thread. This is
not optimal given that the thread runs at SCHED_OTHER priority.

Completing the request in hard IRQ-context on PREEMPT_RT (which enforces
the force threaded mode) is bad because the completion handler may
acquire sleeping locks which violate the locking context.

Disable request completing on a remote CPU in force threaded mode.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 13:41:30 -07:00
Baolin Wang
76efc1c770 blk-iocost: Add iocg idle state tracepoint
It will be helpful to trace the iocg's whole state, including active and
idle state. And we can easily expand the original iocost_iocg_activate
trace event to support a state trace class, including active and idle
state tracing.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 07:55:44 -07:00
Daniel Wagner
e6582cb5da blk-mq: Remove 'running from the wrong CPU' warning
It's guaranteed that no request is in flight when a hctx is going
offline. This warning is only triggered when the wq's CPU is hot
plugged and the blk-mq is not synced up yet.

As this state is temporary and the request is still processed
correctly, better remove the warning as this is the fast path.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-16 14:55:25 -07:00
Linus Torvalds
60f7c503d9 SCSI misc on 20201216
This series consists of the usual driver updates (ufs, qla2xxx,
 smartpqi, target, zfcp, fnic, mpt3sas, ibmvfc) plus a load of
 cleanups, a major power management rework and a load of assorted minor
 updates.  There are a few core updates (formatting fixes being the big
 one) but nothing major this cycle.
 
 Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
 -----BEGIN PGP SIGNATURE-----
 
 iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCX9o0KSYcamFtZXMuYm90
 dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishbOZAP9D5NTN
 J7dJUo2MIMy84YBu+d9ag7yLlNiRWVY2yw5vHwD/Z7JjAVLwz/tzmyjU9//o2J6w
 hwhOv6Uto89gLCWSEz8=
 =KUPT
 -----END PGP SIGNATURE-----

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
 "This consists of the usual driver updates (ufs, qla2xxx, smartpqi,
  target, zfcp, fnic, mpt3sas, ibmvfc) plus a load of cleanups, a major
  power management rework and a load of assorted minor updates.

  There are a few core updates (formatting fixes being the big one) but
  nothing major this cycle"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
  scsi: mpt3sas: Update driver version to 36.100.00.00
  scsi: mpt3sas: Handle trigger page after firmware update
  scsi: mpt3sas: Add persistent MPI trigger page
  scsi: mpt3sas: Add persistent SCSI sense trigger page
  scsi: mpt3sas: Add persistent Event trigger page
  scsi: mpt3sas: Add persistent Master trigger page
  scsi: mpt3sas: Add persistent trigger pages support
  scsi: mpt3sas: Sync time periodically between driver and firmware
  scsi: qla2xxx: Update version to 10.02.00.104-k
  scsi: qla2xxx: Fix device loss on 4G and older HBAs
  scsi: qla2xxx: If fcport is undergoing deletion complete I/O with retry
  scsi: qla2xxx: Fix the call trace for flush workqueue
  scsi: qla2xxx: Fix flash update in 28XX adapters on big endian machines
  scsi: qla2xxx: Handle aborts correctly for port undergoing deletion
  scsi: qla2xxx: Fix N2N and NVMe connect retry failure
  scsi: qla2xxx: Fix FW initialization error on big endian machines
  scsi: qla2xxx: Fix crash during driver load on big endian machines
  scsi: qla2xxx: Fix compilation issue in PPC systems
  scsi: qla2xxx: Don't check for fw_started while posting NVMe command
  scsi: qla2xxx: Tear down session if FW say it is down
  ...
2020-12-16 13:34:31 -08:00
Linus Torvalds
69f637c335 for-5.11/drivers-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XgdYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjTBD/4me2TNvGOogbcL0b1leAotndJ7spI/IcFM
 NUMNy3pOGuRBcRjwle85xq44puAjlNkZE2LLatem5sT7ZvS+8lPNnOIoTYgfaCjt
 PhKx2sKlLumVm3BwymYAPcPtke4fikGG15Mwu5nX1oOehmyGrjObGAr3Lo6gexCT
 tQoCOczVqaTsV+iTXrLlmgEgs07J9Tm93uh2cNR8Jgroxb8ivuWeUq4YgbV4kWk+
 Y8XvOyVE/yba0vQf5/hHtWuVoC6RdELnqZ6NCkcP/EicdBecwk1GMJAej1S3zPS1
 0BT7GSFTpm3YUHcygD6LRmRg4I/BmWDTDtMi84+jLat6VvSG1HwIm//qHiCJh3ku
 SlvFZENIWAv5LP92x2vlR5Lt7uE3GK2V/5Pxt2fekyzCth6mzu+hLH4CBPQ3xgyd
 E1JqIQ/ilbXstp+EYoivV5x8yltZQnKEZRopws0EOqj1LsmDPj9XT1wzE9RnB0o+
 PWu/DNhQFhhcmP7Z8uLgPiKIVpyGs+vjxiJLlTtGDFTCy6M5JbcgzGkEkSmnybxH
 7lSanjpLt1dWj85FBMc6fNtJkv2rBPfb4+j0d1kZ45Dzcr4umirGIh7wtCHcgc83
 brmXSt29hlKHseSHMMuNWK8haXcgAE7gq9tD8GZ/kzM7+vkmLLxHJa22Qhq5rp4w
 URPeaBaQJw==
 =ayp2
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Nothing major in here:

   - NVMe pull request from Christoph:
        - nvmet passthrough improvements (Chaitanya Kulkarni)
        - fcloop error injection support (James Smart)
        - read-only support for zoned namespaces without Zone Append
          (Javier González)
        - improve some error message (Minwoo Im)
        - reject I/O to offline fabrics namespaces (Victor Gladkov)
        - PCI queue allocation cleanups (Niklas Schnelle)
        - remove an unused allocation in nvmet (Amit Engel)
        - a Kconfig spelling fix (Colin Ian King)
        - nvme_req_qid simplication (Baolin Wang)

   - MD pull request from Song:
        - Fix race condition in md_ioctl() (Dae R. Jeong)
        - Initialize read_slot properly for raid10 (Kevin Vigor)
        - Code cleanup (Pankaj Gupta)
        - md-cluster resync/reshape fix (Zhao Heming)

   - Move null_blk into its own directory (Damien Le Moal)

   - null_blk zone and discard improvements (Damien Le Moal)

   - bcache race fix (Dongsheng Yang)

   - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
     Lutz Pogrell, Md Haris Iqbal)

   - lightnvm NULL pointer deref fix (tangzhenhao)

   - sr in_interrupt() removal (Sebastian Andrzej Siewior)

   - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
     Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
     as it made it easier for them to funnel the feature through the
     block driver tree.

   - Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
  block: drop dead assignments in loop_init()
  sr: Remove in_interrupt() usage in sr_init_command().
  sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
  cdrom: Reset sector_size back it is not 2048.
  drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
  null_blk: Move driver into its own directory
  null_blk: Allow controlling max_hw_sectors limit
  null_blk: discard zones on reset
  null_blk: cleanup discard handling
  null_blk: Improve implicit zone close
  null_blk: improve zone locking
  block: Align max_hw_sectors to logical blocksize
  null_blk: Fail zone append to conventional zones
  null_blk: Fix zone size initialization
  bcache: fix race between setting bdev state to none and new write request direct to backing
  block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
  block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
  block/rnbd: call kobject_put in the failure path
  Documentation/ABI/rnbd-srv: add document for force_close
  block/rnbd-srv: close a mapped device from server side.
  ...
2020-12-16 13:09:32 -08:00
Linus Torvalds
ac7ac4618c for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
 wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
 N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
 e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
 fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
 IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
 Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
 Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
 GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
 Q1Xqs6xwww==
 =zo4w
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Another series of killing more code than what is being added, again
  thanks to Christoph's relentless cleanups and tech debt tackling.

  This contains:

   - blk-iocost improvements (Baolin Wang)

   - part0 iostat fix (Jeffle Xu)

   - Disable iopoll for split bios (Jeffle Xu)

   - block tracepoint cleanups (Christoph Hellwig)

   - Merging of struct block_device and hd_struct (Christoph Hellwig)

   - Rework/cleanup of how block device sizes are updated (Christoph
     Hellwig)

   - Simplification of gendisk lookup and removal of block device
     aliasing (Christoph Hellwig)

   - Block device ioctl cleanups (Christoph Hellwig)

   - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

   - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

   - sbitmap improvements (Pavel Begunkov)

   - Hybrid polling fix (Pavel Begunkov)

   - bvec iteration improvements (Pavel Begunkov)

   - Zone revalidation fixes (Damien Le Moal)

   - blk-throttle limit fix (Yu Kuai)

   - Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
  blk-mq: fix msec comment from micro to milli seconds
  blk-mq: update arg in comment of blk_mq_map_queue
  blk-mq: add helper allocating tagset->tags
  Revert "block: Fix a lockdep complaint triggered by request queue flushing"
  nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
  blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
  block: disable iopoll for split bio
  block: Improve blk_revalidate_disk_zones() checks
  sbitmap: simplify wrap check
  sbitmap: replace CAS with atomic and
  sbitmap: remove swap_lock
  sbitmap: optimise sbitmap_deferred_clear()
  blk-mq: skip hybrid polling if iopoll doesn't spin
  blk-iocost: Factor out the base vrate change into a separate function
  blk-iocost: Factor out the active iocgs' state check into a separate function
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Fix some typos in comments
  blktrace: fix up a kerneldoc comment
  block: remove the request_queue to argument request based tracepoints
  ...
2020-12-16 12:57:51 -08:00
Linus Torvalds
adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Minwoo Im
fa94ba8a7b blk-mq: fix msec comment from micro to milli seconds
Delay to wait for queue running is milli second unit which is passed to
delayed work via msecs_to_jiffies() which is to convert milliseconds to
jiffies.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 11:13:41 -07:00
Minwoo Im
d220a21410 blk-mq: update arg in comment of blk_mq_map_queue
Update mis-named argument description of blk_mq_map_queue().  This patch
also updates description that argument to software queue percpu context.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 11:13:41 -07:00
Minwoo Im
91cdf265b7 blk-mq: add helper allocating tagset->tags
tagset->set is allocated from blk_mq_alloc_tag_set() rather than being
reallocated.  This patch added a helper to make its meaning explicitly
which is to allocate rather than to reallocate.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 11:13:41 -07:00
Ming Lei
7aa390ec2d Revert "block: Fix a lockdep complaint triggered by request queue flushing"
This reverts commit b3c6a59975.

Now we can avoid nvme-loop lockdep warning of 'lockdep possible recursive locking'
by nvme-loop's lock class, no need to apply dynamically allocated lock class key,
so revert commit b3c6a5997541("block: Fix a lockdep complaint triggered by request
queue flushing").

This way fixes horrible SCSI probe delay issue on megaraid_sas, and it is reported
the whole probe may take more than half an hour.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 20:30:19 -07:00
Ming Lei
fb01a2932e blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
flush_end_io() may be called recursively from some driver, such as
nvme-loop, so lockdep may complain 'possible recursive locking'.
Commit b3c6a5997541("block: Fix a lockdep complaint triggered by
request queue flushing") tried to address this issue by assigning
dynamically allocated per-flush-queue lock class. This solution
adds synchronize_rcu() for each hctx's release handler, and causes
horrible SCSI MQ probe delay(more than half an hour on megaraid sas).

Add new API of blk_mq_hctx_set_fq_lock_class() for these drivers, so
we just need to use driver specific lock class for avoiding the
lockdep warning of 'possible recursive locking'.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reported-by: Qian Cai <cai@redhat.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 20:30:19 -07:00
Jeffle Xu
cc29e1bf0d block: disable iopoll for split bio
iopoll is initially for small size, latency sensitive IO. It doesn't
work well for big IO, especially when it needs to be split to multiple
bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
indeed the cookie of the last split bio. The completion of *this* last
split bio done by iopoll doesn't mean the whole original bio has
completed. Callers of iopoll still need to wait for completion of other
split bios.

Besides bio splitting may cause more trouble for iopoll which isn't
supposed to be used in case of big IO.

iopoll for split bio may cause potential race if CPU migration happens
during bio submission. Since the returned cookie is that of the last
split bio, polling on the corresponding hardware queue doesn't help
complete other split bios, if these split bios are enqueued into
different hardware queues. Since interrupts are disabled for polling
queues, the completion of these other split bios depends on timeout
mechanism, thus causing a potential hang.

iopoll for split bio may also cause hang for sync polling. Currently
both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
in direct IO routine. These routines will submit bio without REQ_NOWAIT
flag set, and then start sync polling in current process context. The
process may hang in blk_mq_get_tag() if the submitted bio has to be
split into multiple bios and can rapidly exhaust the queue depth. The
process are waiting for the completion of the previously allocated
requests, which should be reaped by the following polling, and thus
causing a deadlock.

To avoid these subtle trouble described above, just disable iopoll for
split bio and return BLK_QC_T_NONE in this case. The side effect is that
non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
since the returned cookie is never used for non-HIPRI IO.

Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 20:29:15 -07:00
Damien Le Moal
817046ecdd block: Align max_hw_sectors to logical blocksize
Block device drivers do not have to call blk_queue_max_hw_sectors() to
set a limit on request size if the default limit BLK_SAFE_MAX_SECTORS
is acceptable. However, this limit (255 sectors) may not be aligned
to the device logical block size which cannot be used as is for a
request maximum size. This is the case for the null_blk device driver.

Modify blk_queue_max_hw_sectors() to make sure that the request size
limits specified by the max_hw_sectors and max_sectors queue limits
are always aligned to the device logical block size. Additionally, to
avoid introducing a dependence on the execution order of this function
with blk_queue_logical_block_size(), also modify
blk_queue_logical_block_size() to perform the same alignment when the
logical block size is set after max_hw_sectors.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 17:36:03 -07:00
Damien Le Moal
2afdeb23e4 block: Improve blk_revalidate_disk_zones() checks
Improves the checks on the zones of a zoned block device done in
blk_revalidate_disk_zones() by making sure that the device report_zones
method did report at least one zone and that the zones reported exactly
cover the entire disk capacity, that is, that there are no missing zones
at the end of the disk sector range.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 17:34:21 -07:00
Pavel Begunkov
f6f371f7db blk-mq: skip hybrid polling if iopoll doesn't spin
If blk_poll() is not going to spin (i.e. @spin=false), it also must not
sleep in hybrid polling, otherwise it might be pretty suprising for
users trying to do a quick check and expecting no-wait behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 16:49:04 -07:00
Baolin Wang
926f75f6a9 blk-iocost: Factor out the base vrate change into a separate function
Factor out the base vrate change code into a separate function
to fimplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 13:20:31 -07:00
Baolin Wang
2474787a75 blk-iocost: Factor out the active iocgs' state check into a separate function
Factor out the iocgs' state check into a separate function to
simplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 13:20:31 -07:00
Baolin Wang
c09245f61c blk-iocost: Move the usage ratio calculation to the correct place
We only use the hweight based usage ratio to calculate the new
hweight_inuse of the iocg to decide if this iocg can donate some
surplus vtime.

Thus move the usage ratio calculation to the correct place to
avoid unnecessary calculation for some vtime shortage iocgs.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 13:20:31 -07:00
Baolin Wang
647c9f03b2 blk-iocost: Remove unnecessary advance declaration
Remove unnecessary advance declaration of struct ioc_gq.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 13:20:31 -07:00
Baolin Wang
5ba1add216 blk-iocost: Fix some typos in comments
Fix some typos in comments.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-07 13:20:31 -07:00
Linus Torvalds
be1515bad7 block-5.10-2020-12-05
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/L/n0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgptxfD/9Dvasr1LIF9jTL5nU3iwwCAKVneI0VXsvn
 x9xLv+3AlvyhKJSpDxjyYrrsNwo2r2Hay/3Yix079wMyDNcjuUeaDVscQ2Ed6WXI
 slOhefyd4sgZf0kpB7TkwSrU1idoSEhOgcbZSiPLnlb071rYbJm5M6W15j9knUW0
 HfomlryBgTX6b0et+YHvboTQ+31ZWTpZdMk9XEvQhCVynPOwU858rRKnFZUQSJ04
 LGKXcA5xchxD8HVImwGv6QFTtFSzoyT3g68fEXEM9rPoEM8T0utVFPfF7o8L4Oui
 nLxpBThwN18xwcc2iei8Fko2H6RgAmaD4cR4rgZaesa4QavPtT8N+JebMX47Tvhy
 BBUbD/gRMPt1nO9ufPuLDyYBda2Ne5Z1DkBSIuYgZrreiOBdYGaJDbGIJ9uGwi+j
 u4QvooSMoXb/7XGojxaJCjPDlNyhOb+fbZaOJDU49xQR3k01B6vteDnxGBlvbYHn
 xuC82LTpTkWQT65ciaFAsW0L2lE6zlgkWWP3ahzH/qAg0HvODT9LDPNxvUpgZqUT
 zoUHua3jAY2JBudUMLlacD+/ZS/T7krMl2XLSvyKaOIAyAlaluxK/uykQ+Z3sWgF
 XGYQ/yv/4mQ1rTozqRFKndRPN++FCRzIwcHk8iBQCRlh8yLB/gVBr/Q6SgJrdnEj
 WWwc1tRmwA==
 =o4GN
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Single fix for an issue with chunk_sectors and stacked devices"

* tag 'block-5.10-2020-12-05' of git://git.kernel.dk/linux-block:
  block: use gcd() to fix chunk_sectors limit stacking
2020-12-05 14:45:30 -08:00
Linus Torvalds
b3298500b2 - Fix DM's bio splitting changes that were made during v5.9.
Restores splitting in terms of varied per-target ti->max_io_len
   rather than use block core's single stacked 'chunk_sectors' limit.
 
 - Like DM crypt, update DM integrity to not use crypto drivers that
   have CRYPTO_ALG_ALLOCATES_MEMORY set.
 
 - Fix DM writecache target's argument parsing and status display.
 
 - Remove needless BUG() from dm writecache's persistent_memory_claim()
 
 - Remove old gcc workaround in DM cache target's block_div() for ARM
   link errors now that gcc >= 4.9 is required.
 
 - Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.
 
 - Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
   dm_table_event().
 
 - Remove invalid sparse annotations from dm_prepare_ioctl() and
   dm_unprepare_ioctl().
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAl/KonYTHHNuaXR6ZXJA
 cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWgSJCACNYEndubrROJZL+FOUQixzZQiOfphw
 9Brb/XbdXWXIv7F+JV85E6olOqz7JjTGrO91uD5kwHEtVhDx5zT/GCm+5FoBrLa/
 FuTphPRWNimZSU1umJe2AG9hOiDpPJJUe/wwj3QkBH2TeEHwBHblB8BkRFzxnP+p
 0dGybrQBMtrH3GO65YG7qaASeBPl1+G3mVHfzViyhk1uoZL1y9pKbzPK60TkHcsa
 VCGTPke5Ri3hvd85hmpDcXmyxjxZfCA8Jc/DrQ+DDEwakHoJFwlSzP7fqwHnpKHT
 RDL4iOID54SViSGqzNcxlGtr/EHyN9Mom2d4Nnb0cgsRG4woCeJWJZMM
 =+o1m
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM's bio splitting changes that were made during v5.9. This
   restores splitting in terms of varied per-target ti->max_io_len
   rather than use block core's single stacked 'chunk_sectors' limit.

 - Like DM crypt, update DM integrity to not use crypto drivers that
   have CRYPTO_ALG_ALLOCATES_MEMORY set.

 - Fix DM writecache target's argument parsing and status display.

 - Remove needless BUG() from dm writecache's persistent_memory_claim()

 - Remove old gcc workaround in DM cache target's block_div() for ARM
   link errors now that gcc >= 4.9 is required.

 - Fix RCU locking in dm_blk_report_zones and dm_dax_zero_page_range.

 - Remove old, and now frowned upon, BUG_ON(in_interrupt()) in
   dm_table_event().

 - Remove invalid sparse annotations from dm_prepare_ioctl() and
   dm_unprepare_ioctl().

* tag 'for-5.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: remove invalid sparse __acquires and __releases annotations
  dm: fix double RCU unlock in dm_dax_zero_page_range() error path
  dm: fix IO splitting
  dm writecache: remove BUG() and fail gracefully instead
  dm table: Remove BUG_ON(in_interrupt())
  dm: fix bug with RCU locking in dm_blk_report_zones
  Revert "dm cache: fix arm link errors with inline"
  dm writecache: fix the maximum number of arguments
  dm writecache: advance the number of arguments when reporting max_age
  dm integrity: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
2020-12-04 13:28:39 -08:00
Mike Snitzer
3ee16db390 dm: fix IO splitting
Commit 882ec4e609 ("dm table: stack 'chunk_sectors' limit to account
for target-specific splitting") caused a couple regressions:
1) Using lcm_not_zero() when stacking chunk_sectors was a bug because
   chunk_sectors must reflect the most limited of all devices in the
   IO stack.
2) DM targets that set max_io_len but that do _not_ provide an
   .iterate_devices method no longer had there IO split properly.

And commit 5091cdec56 ("dm: change max_io_len() to use
blk_max_size_offset()") also caused a regression where DM no longer
supported varied (per target) IO splitting. The implication being the
potential for severely reduced performance for IO stacks that use a DM
target like dm-cache to hide performance limitations of a slower
device (e.g. one that requires 4K IO splitting).

Coming full circle: Fix all these issues by discontinuing stacking
chunk_sectors up using ti->max_io_len in dm_calculate_queue_limits(),
add optional chunk_sectors override argument to blk_max_size_offset()
and update DM's max_io_len() to pass ti->max_io_len to its
blk_max_size_offset() call.

Passing in an optional chunk_sectors override to blk_max_size_offset()
allows for code reuse of block's centralized calculation for max IO
size based on provided offset and split boundary.

Fixes: 882ec4e609 ("dm table: stack 'chunk_sectors' limit to account for target-specific splitting")
Fixes: 5091cdec56 ("dm: change max_io_len() to use blk_max_size_offset()")
Cc: stable@vger.kernel.org
Reported-by: John Dorminy <jdorminy@redhat.com>
Reported-by: Bruce Johnston <bjohnsto@redhat.com>
Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: John Dorminy <jdorminy@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 14:53:15 -05:00
Christoph Hellwig
a54895fa05 block: remove the request_queue to argument request based tracepoints
The request_queue can trivially be derived from the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 09:42:00 -07:00
Christoph Hellwig
1c02fca620 block: remove the request_queue argument to the block_bio_remap tracepoint
The request_queue can trivially be derived from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 09:42:00 -07:00
Christoph Hellwig
eb6f7f7cd3 block: remove the request_queue argument to the block_split tracepoint
The request_queue can trivially be derived from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 09:42:00 -07:00
Christoph Hellwig
e8a676d61c block: simplify and extend the block_bio_merge tracepoint class
The block_bio_merge tracepoint class can be reused for most bio-based
tracepoints.  For that it just needs to lose the superfluous q and rq
parameters.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-04 09:42:00 -07:00
Yu Kuai
acaf523a7b blk-throttle: don't check whether or not lower limit is valid if CONFIG_BLK_DEV_THROTTLING_LOW is off
blk_throtl_update_limit_valid() will search for descendants to see if
'LIMIT_LOW' of bps/iops and READ/WRITE is nonzero. However, they're always
zero if CONFIG_BLK_DEV_THROTTLING_LOW is not set, furthermore, a lot of
time will be wasted to iterate descendants.

Thus do nothing in blk_throtl_update_limit_valid() in such situation.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02 12:44:20 -07:00
Jeffle Xu
b0d97557eb block: fix inflight statistics of part0
The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.

Thus the following case is possible:

$ cat /sys/block/vda/inflight
       0        0
$ cat /sys/block/vda/vda1/inflight
       0      128

While single queue device (on a previous version, e.g. v3.10) has no
this issue:

$cat /sys/block/sda/sda3/inflight
       0       33
$cat /sys/block/sda/inflight
       0       33

Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65 ("blk-mq: fix
sysfs inflight counter").

Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)

$cat /proc/diskstats
 259       0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
 259       2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0

This is introduced since commit f299b7c7a9 ("blk-mq: provide internal
in-flight variant").

Fixes: bf0ddaba65 ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9 ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02 12:43:02 -07:00
Pavel Begunkov
22b56c2964 bio: optimise bvec iteration
__bio_for_each_bvec(), __bio_for_each_segment() and bio_copy_data_iter()
fall under conditions of bvec_iter_advance_single(), which is a faster
and slimmer version of bvec_iter_advance(). Add
bio_advance_iter_single() and convert them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-02 09:46:55 -07:00
Christoph Hellwig
977115c0f6 block: stop using bdget_disk for partition 0
We can just dereference the point in struct gendisk instead.  Also
remove the now unused export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
0d02129e76 block: merge struct block_device and struct hd_struct
Instead of having two structures that represent each block device with
different life time rules, merge them into a single one.  This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
ad1eaa5344 block: switch disk_part_iter_* to use a struct block_device
Switch the partition iter infrastructure to iterate over block_device
references instead of hd_struct ones mostly used to get at the
block_device.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
71773cf797 block: pass a block_device to invalidate_partition
Pass the block_device actually needed instead of looking it up using
bdget_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
9fc995a6e0 block: pass a block_device to blk_alloc_devt
Pass the block_device actually needed instead of the hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
41e5c81984 block: remove the partno field from struct hd_struct
Just use the bd_partno field in struct block_device everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
8446fe9255 block: switch partition lookup to use struct block_device
Use struct block_device to lookup partitions on a disk.  This removes
all usage of struct hd_struct from the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
cb8432d650 block: allocate struct hd_struct as part of struct bdev_inode
Allocate hd_struct together with struct block_device to pre-load
the lifetime rule changes in preparation of merging the two structures.

Note that part0 was previously embedded into struct gendisk, but is
a separate allocation now, and already points to the block_device instead
of the hd_struct.  The lifetime of struct gendisk is still controlled by
the struct device embedded in the part0 hd_struct.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
83950d3590 block: move the policy field to struct block_device
Move the policy field to struct block_device and rename it to the
more descriptive bd_read_only.  Also turn the field into a bool as it
is used as such.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00