The insert_requests and dispatch_request elevator operations are
mandatory for the correct execution of an elevator, and all implemented
elevators (bfq, kyber and mq-deadline) implement them. As a result,
there is no need to check for these operations before calling them when
a queue has an elevator set. This simplifies the code in
__blk_mq_sched_dispatch_requests() and blk_mq_sched_insert_request().
To avoid out-of-tree elevators to crash the kernel in case of bad
implementation, add a check in elv_register() to verify that these
operations are implemented.
A small, probably not significant, IOPS improvement of 0.1% is observed
with this patch applied (4.117 MIOPS to 4.123 MIOPS, average of 20 fio
runs doing 4K random direct reads with psync and 32 jobs).
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210618015922.713999-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer provides emulation of zone management operations
targeting all zones of a zoned block device only for the zone reset
operation (REQ_OP_ZONE_RESET). In order to correctly implement
exporting of zoned block devices with NVMeOF, emulating zone management
operations targeting all zones of a device is also necessary for the
open, close and finish zone operations (REQ_OP_ZONE_OPEN,
REQ_OP_ZONE_CLOSE and REQ_OP_ZONE_FINISH).
Instead of duplicating the code, export the existing helper from block
layer so we can use a bio chaining pattern that is present in the block
layer for REQ_OP_ZONE RESET all emulation in the NVMeOF zoned block
device backend.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Yi reported several kernel panics on:
[16687.001777] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
...
[16687.163549] pc : __rq_qos_track+0x38/0x60
or
[ 997.690455] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
...
[ 997.850347] pc : __rq_qos_done+0x2c/0x50
Turns out it is caused by race between adding rq qos(wbt) and normal IO
because rq_qos_add can be run when IO is being submitted, fix this issue
by freezing queue before adding/deleting rq qos to queue.
rq_qos_exit() needn't to freeze queue because it is called after queue
has been frozen.
iolatency calls rq_qos_add() during allocating queue, so freezing won't
add delay because queue usage refcount works at atomic mode at that
time.
iocost calls rq_qos_add() when writing cgroup attribute file, that is
fine to freeze queue at that time since we usually freeze queue when
storing to queue sysfs attribute, meantime iocost only exists on the
root cgroup.
wbt_init calls it in blk_register_queue() and queue sysfs attribute
store(queue_wb_lat_store() when write it 1st time in case of !BLK_WBT_MQ),
the following patch will speedup the queue freezing in wbt_init.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Link: https://lore.kernel.org/r/20210609015822.103433-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:
CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker)
rq_qos_wait() rq_qos_wait()
acquire_inflight_cb() -> fails
acquire_inflight_cb() -> fails
completes IOs, inflight
decreased
prepare_to_wait_exclusive()
prepare_to_wait_exclusive()
has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
has_sleeper = !wq_has_single_sleeper() -> true
io_schedule() io_schedule()
Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.
Fixes: 545fbd0775 ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210607112613.25344-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
SCSI, ZNS and null_blk zoned devices support resetting all zones using
a single command (REQ_OP_ZONE_RESET_ALL), as indicated using the device
request queue flag QUEUE_FLAG_ZONE_RESETALL. This flag is not set for
device mapper targets creating zoned devices. In this case, a user
request for resetting all zones of a device is processed in
blkdev_zone_mgmt() by issuing a REQ_OP_ZONE_RESET operation for each
zone of the device. This leads to different behaviors of the
BLKRESETZONE ioctl() depending on the target device support for the
reset all operation. E.g.
blkzone reset /dev/sdX
will reset all zones of a SCSI device using a single command that will
ignore conventional, read-only or offline zones.
But a dm-linear device including conventional, read-only or offline
zones cannot be reset in the same manner as some of the single zone
reset operations issued by blkdev_zone_mgmt() will fail. E.g.:
blkzone reset /dev/dm-Y
blkzone: /dev/dm-0: BLKRESETZONE ioctl failed: Remote I/O error
To simplify applications and tools development, unify the behavior of
the all-zone reset operation by modifying blkdev_zone_mgmt() to not
issue a zone reset operation for conventional, read-only and offline
zones, thus mimicking what an actual reset-all device command does on a
device supporting REQ_OP_ZONE_RESET_ALL. This emulation is done using
the new function blkdev_zone_reset_all_emulated(). The zones needing a
reset are identified using a bitmap that is initialized using a zone
report. Since empty zones do not need a reset, also ignore these zones.
The function blkdev_zone_reset_all() is introduced for block devices
natively supporting reset all operations. blkdev_zone_mgmt() is modified
to call either function to execute an all zone reset request.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
[hch: split into multiple functions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Although the original intent was to use blk_update_request() in stacking
block drivers only, it is used much more widely today. Reflect this in the
documentation block above this function. See also:
* commit 32fab448e5 ("block: add request update interface").
* commit 2e60e02297 ("block: clean up request completion API").
* commit ed6565e734 ("block: handle partial completions for special
payload requests").
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20210519175226.8853-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Provided the device driver does not implement dispatch budget accounting
(which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
requests from the IO scheduler as long as it is willing to give out any.
That defeats scheduling heuristics inside the scheduler by creating
false impression that the device can take more IO when it in fact
cannot.
For example with BFQ IO scheduler on top of virtio-blk device setting
blkio cgroup weight has barely any impact on observed throughput of
async IO because __blk_mq_do_dispatch_sched() always sucks out all the
IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
when that is all dispatched, it will give out IO of lower weight cgroups
as well. And then we have to wait for all this IO to be dispatched to
the disk (which means lot of it actually has to complete) before the
IO scheduler is queried again for dispatching more requests. This
completely destroys any service differentiation.
So grab request tag for a request pulled out of the IO scheduler already
in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
cannot get it because we are unlikely to be able to dispatch it. That
way only single request is going to wait in the dispatch list for some
tag to free.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a flag to indicate that __device_add_disk did grab a queue reference
so that disk_release only drops it if we actually had it. This sort
out one of the major pitfals with partially initialized gendisk that
a lot of drivers did get wrong or still do.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace the check for DRIVER_SENSE with a check for
scsi_status_is_check_condition().
Audit all callsites to ensure the SAM status is set correctly. For
backwards compability move the DRIVER_SENSE definition to sg.h, and update
sg, bsg, and scsi_ioctl to set the DRIVER_SENSE driver_status whenever
SAM_STAT_CHECK_CONDITION is present.
[mkp: fix zeroday srp warning]
Link: https://lore.kernel.org/r/20210427083046.31620-10-hare@suse.de
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
fix
The tags used for an IO scheduler are currently per hctx.
As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.
This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.
Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b
In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.
Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).
For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
refcount_inc_not_zero() in bt_tags_iter() still may read one freed
request.
Fix the issue by the following approach:
1) hold a per-tags spinlock when reading ->rqs[tag] and calling
refcount_inc_not_zero in bt_tags_iter()
2) clearing stale request referred via ->rqs[tag] before freeing
request pool, the per-tags spinlock is held for clearing stale
->rq[tag]
So after we cleared stale requests, bt_tags_iter() won't observe
freed request any more, also the clearing will wait for pending
request reference.
The idea of clearing ->rqs[] is borrowed from John Garry's previous
patch and one recent David's patch.
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter(), and
this way will prevent the request from being re-used when ->fn is
running. The approach is same as what we do during handling timeout.
Fix request use-after-free(UAF) related with completion race or queue
releasing:
- If one rq is referred before rq->q is frozen, then queue won't be
frozen before the request is released during iteration.
- If one rq is referred after rq->q is frozen, refcount_inc_not_zero()
will return false, and we won't iterate over this request.
However, still one request UAF not covered: refcount_inc_not_zero() may
read one freed request, and it will be handled in next patch.
Tested-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For flush request, rq->end_io() may be called two times, one is from
timeout handling(blk_mq_check_expired()), another is from normal
completion(__blk_mq_end_request()).
Move blk_account_io_flush() after flush_rq->ref drops to zero, so
io accounting can be done just once for flush request.
Fixes: b686631865 ("block: add iostat counters for flush requests")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210511152236.763464-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg has always rejected to attach if any of the member tasks has shared
io_context. The rationale was that io_contexts can be shared across
different cgroups making it impossible to define what the appropriate
control behavior should be. However, this check causes more problems than it
solves:
* The check prevents controller enable and migrations but not CLONE_IO
itself, which can lead to surprises as the outcome changes depending on
the order of operations.
* Sharing within a cgroup is fine but the check can't distinguish that. This
leads to unnecessary conflicts with the recent CLONE_IO usage in io_uring.
io_context sharing doesn't make any difference for rq_qos based controllers
and the way it's used is safe as long as tasks aren't migrated dynamically
which is the vast majority of use cases. While we can try to make the check
more precise to avoid false positives, the added complexity doesn't seem
worthwhile. Let's just drop blkcg_can_attach().
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/YJrTvHbrRDbJjw+S@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have already delete block_dump feature in mark_inode_dirty() because
it can be replaced by tracepoints, now we also remove the part in
submit_bio() for the same reason. The part of block dump feature in
submit_bio() dump the write process, write region and sectors on the
target disk into kernel message. it can be replaced by
block_bio_queue tracepoint in submit_bio_checks(), so we do not need
block_dump anymore, remove the whole block_dump feature.
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
- Fix BLKRRPART and deletion race (Gulam, Christoph)
- NVMe pull request (Christoph):
- nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith
Busch)
- nvme-fc teardown fix (James Smart)
- nvmet/nvme-loop memory leak fixes (Wu Bo)"
* tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block:
block: fix a race between del_gendisk and BLKRRPART
block: prevent block device lookups at the beginning of del_gendisk
nvme-fc: clear q_live at beginning of association teardown
nvme-tcp: rerun io_work if req_list is not empty
nvme-tcp: fix possible use-after-completion
nvme-loop: fix memory leak in nvme_loop_create_ctrl()
nvmet: fix memory leak in nvmet_alloc_ctrl()
As an artifact of how gendisk lookup used to work in earlier kernels,
GENHD_FL_UP is only cleared very late in del_gendisk, and a global lock
is used to prevent opens from succeeding while del_gendisk is tearing
down the gendisk. Switch to clearing the flag early and under bd_mutex
so that callers can use bd_mutex to stabilize the flag, which removes
the need for the global mutex.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514131842.1600568-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A simplification of get_unaligned() clashes with callers that pass
in a character pointer, causing a harmless warning like:
block/partitions/msdos.c: In function 'msdos_partition':
include/asm-generic/unaligned.h:13:22: warning: 'packed' attribute ignored for field of type 'u8' {aka 'unsigned char'} [-Wattributes]
Remove the SYS_IND() macro with the get_unaligned() call
and just use the ->ind field directly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Pull block fixes from Jens Axboe:
- Fix for shared tag set exit (Bart)
- Correct ioctl range for zoned ioctls (Damien)
- Removed dead/unused function (Lin)
- Fix perf regression for shared tags (Ming)
- Fix out-of-bounds issue with kyber and preemption (Omar)
- BFQ merge fix (Paolo)
- Two error handling fixes for nbd (Sun)
- Fix weight update in blk-iocost (Tejun)
- NVMe pull request (Christoph):
- correct the check for using the inline bio in nvmet (Chaitanya
Kulkarni)
- demote unsupported command warnings (Chaitanya Kulkarni)
- fix corruption due to double initializing ANA state (me, Hou Pu)
- reset ns->file when open fails (Daniel Wagner)
- fix a NULL deref when SEND is completed with error in nvmet-rdma
(Michal Kalderon)
- Fix kernel-doc warning (Bart)
* tag 'block-5.13-2021-05-14' of git://git.kernel.dk/linux-block:
block/partitions/efi.c: Fix the efi_partition() kernel-doc header
blk-mq: Swap two calls in blk_mq_exit_queue()
blk-mq: plug request for shared sbitmap
nvmet: use new ana_log_size instead the old one
nvmet: seset ns->file when open fails
nbd: share nbd_put and return by goto put_nbd
nbd: Fix NULL pointer in flush_workqueue
blkdev.h: remove unused codes blk_account_rq
block, bfq: avoid circular stable merges
blk-iocost: fix weight updates of inner active iocgs
nvmet: demote fabrics cmd parse err msg to debug
nvmet: use helper to remove the duplicate code
nvmet: demote discovery cmd parse err msg to debug
nvmet-rdma: Fix NULL deref when SEND is completed with error
nvmet: fix inline bio check for passthru
nvmet: fix inline bio check for bdev-ns
nvme-multipath: fix double initialization of ANA state
kyber: fix out of bounds access when preempted
block: uapi: fix comment about block device ioctl
If a tag set is shared across request queues (e.g. SCSI LUNs) then the
block layer core keeps track of the number of active request queues in
tags->active_queues. blk_mq_tag_busy() and blk_mq_tag_idle() update that
atomic counter if the hctx flag BLK_MQ_F_TAG_QUEUE_SHARED is set. Make
sure that blk_mq_exit_queue() calls blk_mq_tag_idle() before that flag is
cleared by blk_mq_del_queue_tag_set().
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
Fixes: 0d2602ca30 ("blk-mq: improve support for shared tags maps")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210513171529.7977-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of shared sbitmap, request won't be held in plug list any more
sine commit 32bc15afed ("blk-mq: Facilitate a shared sbitmap per
tagset"), this way makes request merge from flush plug list & batching
submission not possible, so cause performance regression.
Yanhui reports performance regression when running sequential IO
test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
is emulated with image stored on xfs/megaraid_sas.
Fix the issue by recovering original behavior to allow to hold request
in plug list.
Cc: Yanhui Ma <yama@redhat.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: kashyap.desai@broadcom.com
Fixes: 32bc15afed ("blk-mq: Facilitate a shared sbitmap per tagset")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210514022052.1047665-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
BFQ may merge a new bfq_queue, stably, with the last bfq_queue
created. In particular, BFQ first waits a little bit for some I/O to
flow inside the new queue, say Q2, if this is needed to understand
whether it is better or worse to merge Q2 with the last queue created,
say Q1. This delayed stable merge is performed by assigning
bic->stable_merge_bfqq = Q1, for the bic associated with Q1.
Yet, while waiting for some I/O to flow in Q2, a non-stable queue
merge of Q2 with Q1 may happen, causing the bic previously associated
with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
be recycled as a non-shared bfq_queue. In that case, Q1 may then
happen to undergo a stable merge with the bfq_queue pointed by
bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
Q1. So Q1 would be merged with itself.
This commit fixes this error by intercepting this situation, and
canceling the schedule of the stable merge.
Fixes: 430a67f9d6 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the weight of an active iocg is updated, weight_updated() is called
which in turn calls __propagate_weights() to update the active and inuse
weights so that the effective hierarchical weights are update accordingly.
The current implementation is incorrect for inner active nodes. For an
active leaf iocg, inuse can be any value between 1 and active and the
difference represents how much the iocg is donating. When weight is updated,
as long as inuse is clamped between 1 and the new weight, we're alright and
this is what __propagate_weights() currently implements.
However, that's not how an active inner node's inuse is set. An inner node's
inuse is solely determined by the ratio between the sums of inuse's and
active's of its children - ie. they're results of propagating the leaves'
active and inuse weights upwards. __propagate_weights() incorrectly applies
the same clamping as for a leaf when an active inner node's weight is
updated. Consider a hierarchy which looks like the following with saturating
workloads in AA and BB.
R
/ \
A B
| |
AA BB
1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
2. echo 200 > A/io.weight
3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
it's already between 1 and the new active, making A:active=200,
A:inuse=100. As R's active_sum is updated along with A's active,
A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
hwi's remain unchanged at 0.5.
4. The weight of A is now twice that of B but AA and BB still have the same
hwi of 0.5 and thus are doing the same amount of IOs.
Fix it by making __propgate_weights() always calculate the inuse of an
active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@fb.com>
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Cc: stable@vger.kernel.org # v5.4+
Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>