Merges have the same problem that forced-bios had which is fixed by
the previous patch. The cost of a merge is calculated at the time of
issue and force-advances vtime into the future. Until global vtime
catches up, how the cgroup's hweight changes in the meantime doesn't
matter and it often leads to situations where the cost is calculated
at one hweight and paid at a very different one. See the previous
patch for more details.
Fix it by never advancing vtime into the future for merges. If budget
is available, vtime is advanced. Otherwise, the cost is charged as
debt.
This brings merge cost handling in line with issue cost handling in
ioc_rqos_throttle().
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, when a bio needs to be force-charged and there isn't enough
budget, vtime is simply pushed into the future. This means that the
cost of the whole bio is scaled using the current hweight and then
charged immediately. Until the global vtime advances beyond this
future vtime, the cgroup won't be allowed to issue normal IOs.
This is incorrect and can lead to, for example, exploding vrate or
extended stalls if vrate range is constrained. Consider the following
scenario.
1. A cgroup with a very low hweight runs out of budget.
2. A storm of swap-out happens on it. All of them are scaled
according to the current low hweight and charged to vtime pushing
it to a far future.
3. All other cgroups go idle and now the above cgroup has access to
the whole device. However, because vtime is already wound using
the past low hweight, what its current hweight is doesn't matter
until global vtime catches up to the local vtime.
4. As a result, either vrate gets ramped up extremely or the IOs stall
while the underlying device is idle.
This is because the hweight the overage is calculated at is different
from the hweight that it's being paid at.
Fix it by remembering the overage in absoulte vtime and continuously
paying with the actual budget according to the current hweight at each
period.
Note that non-forced bios which wait already remembers the cost in
absolute vtime. This brings forced-bio accounting in line.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ioc_pd_free() first cancels the hrtimers and then deactivates the
iocg. However, the iocg timer can run inbetween and reschedule the
hrtimers which will end up running after the iocg is freed leading to
crashes like the following.
general protection fault: 0000 [#1] SMP
...
RIP: 0010:iocg_kick_delay+0xbe/0x1b0
RSP: 0018:ffffc90003598ea0 EFLAGS: 00010046
RAX: 1cee00fd69512b54 RBX: ffff8881bba48400 RCX: 00000000000003e8
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881bba48400
RBP: 0000000000004e20 R08: 0000000000000002 R09: 00000000000003e8
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90003598ef0
R13: 00979f3810ad461f R14: ffff8881bba4b400 R15: 25439f950d26e1d1
FS: 0000000000000000(0000) GS:ffff88885f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f64328c7e40 CR3: 0000000002409005 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
iocg_delay_timer_fn+0x3d/0x60
__hrtimer_run_queues+0xfe/0x270
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x5e/0x120
apic_timer_interrupt+0xf/0x20
</IRQ>
Fix it by canceling hrtimers after deactivating the iocg.
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_iocost_init() forgot to free its percpu stat on the error path.
Fix it.
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Reported-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patchset implements IO cost model based work-conserving
proportional controller.
While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others. In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.
One challenge of controlling IO resources is the lack of trivially
observable cost metric. The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern. However, the cost isn't a complete mystery. Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.
The function which determines the cost of a given IO is the IO cost
model for the device. This controller distributes IO capacity based
on the costs estimated by such model. The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.
Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well. All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.
Please see the top comment in blk-iocost.c and documentation for
more details.
v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
for a divide-by-zero bug in current_hweight() triggered by zero
inuse_sum.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>