block, bfq: unconditionally plug I/O in asymmetric scenarios

bfq detects the creation of multiple bfq_queues shortly after each other, namely a burst of queue creations in the terminology used in the code. If the burst is large, then no queue in the burst is granted - either I/O-dispatch plugging when the queue remains temporarily idle while in service; - or weight raising, because it causes even longer plugging. In fact, such a plugging tends to lower throughput, while these bursts are typically due to applications or services that spawn multiple processes, to reach a common goal as soon as possible. Examples are a "git grep" or the booting of a system. Unfortunately, disabling plugging may cause a loss of service guarantees in asymmetric scenarios, i.e., if queue weights are differentiated or if more than one group is active. This commit addresses this issue by no longer disabling I/O-dispatch plugging for queues in large bursts. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-29 12:06:32 +01:00 · 2019-01-29 12:06:32 +01:00 · 530c4cbb3c
commit 530c4cbb3c
parent ac8b0cb415
1 changed files with 165 additions and 181 deletions
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@ -3479,191 +3479,175 @@ static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
 		bfqd->wr_busy_queues == 0;
 }
 /*
 * There is a case where idling must be performed not for
 * throughput concerns, but to preserve service guarantees.
 *
 * To introduce this case, we can note that allowing the drive
 * to enqueue more than one request at a time, and hence
 * delegating de facto final scheduling decisions to the
 * drive's internal scheduler, entails loss of control on the
 * actual request service order. In particular, the critical
 * situation is when requests from different processes happen
 * to be present, at the same time, in the internal queue(s)
 * of the drive. In such a situation, the drive, by deciding
 * the service order of the internally-queued requests, does
 * determine also the actual throughput distribution among
 * these processes. But the drive typically has no notion or
 * concern about per-process throughput distribution, and
 * makes its decisions only on a per-request basis. Therefore,
 * the service distribution enforced by the drive's internal
 * scheduler is likely to coincide with the desired
 * device-throughput distribution only in a completely
 * symmetric scenario where:
 * (i)  each of these processes must get the same throughput as
 *      the others;
 * (ii) the I/O of each process has the same properties, in
 *      terms of locality (sequential or random), direction
 *      (reads or writes), request sizes, greediness
 *      (from I/O-bound to sporadic), and so on.
 * In fact, in such a scenario, the drive tends to treat
 * the requests of each of these processes in about the same
 * way as the requests of the others, and thus to provide
 * each of these processes with about the same throughput
 * (which is exactly the desired throughput distribution). In
 * contrast, in any asymmetric scenario, device idling is
 * certainly needed to guarantee that bfqq receives its
 * assigned fraction of the device throughput (see [1] for
 * details).
 * The problem is that idling may significantly reduce
 * throughput with certain combinations of types of I/O and
 * devices. An important example is sync random I/O, on flash
 * storage with command queueing. So, unless bfqq falls in the
 * above cases where idling also boosts throughput, it would
 * be important to check conditions (i) and (ii) accurately,
 * so as to avoid idling when not strictly needed for service
 * guarantees.
 *
 * Unfortunately, it is extremely difficult to thoroughly
 * check condition (ii). And, in case there are active groups,
 * it becomes very difficult to check condition (i) too. In
 * fact, if there are active groups, then, for condition (i)
 * to become false, it is enough that an active group contains
 * more active processes or sub-groups than some other active
 * group. More precisely, for condition (i) to hold because of
 * such a group, it is not even necessary that the group is
 * (still) active: it is sufficient that, even if the group
 * has become inactive, some of its descendant processes still
 * have some request already dispatched but still waiting for
 * completion. In fact, requests have still to be guaranteed
 * their share of the throughput even after being
 * dispatched. In this respect, it is easy to show that, if a
 * group frequently becomes inactive while still having
 * in-flight requests, and if, when this happens, the group is
 * not considered in the calculation of whether the scenario
 * is asymmetric, then the group may fail to be guaranteed its
 * fair share of the throughput (basically because idling may
 * not be performed for the descendant processes of the group,
 * but it had to be).  We address this issue with the
 * following bi-modal behavior, implemented in the function
 * bfq_symmetric_scenario().
 *
 * If there are groups with requests waiting for completion
 * (as commented above, some of these groups may even be
 * already inactive), then the scenario is tagged as
 * asymmetric, conservatively, without checking any of the
 * conditions (i) and (ii). So the device is idled for bfqq.
 * This behavior matches also the fact that groups are created
 * exactly if controlling I/O is a primary concern (to
 * preserve bandwidth and latency guarantees).
 *
 * On the opposite end, if there are no groups with requests
 * waiting for completion, then only condition (i) is actually
 * controlled, i.e., provided that condition (i) holds, idling
 * is not performed, regardless of whether condition (ii)
 * holds. In other words, only if condition (i) does not hold,
 * then idling is allowed, and the device tends to be
 * prevented from queueing many requests, possibly of several
 * processes. Since there are no groups with requests waiting
 * for completion, then, to control condition (i) it is enough
 * to check just whether all the queues with requests waiting
 * for completion also have the same weight.
 *
 * Not checking condition (ii) evidently exposes bfqq to the
 * risk of getting less throughput than its fair share.
 * However, for queues with the same weight, a further
 * mechanism, preemption, mitigates or even eliminates this
 * problem. And it does so without consequences on overall
 * throughput. This mechanism and its benefits are explained
 * in the next three paragraphs.
 *
 * Even if a queue, say Q, is expired when it remains idle, Q
 * can still preempt the new in-service queue if the next
 * request of Q arrives soon (see the comments on
 * bfq_bfqq_update_budg_for_activation). If all queues and
 * groups have the same weight, this form of preemption,
 * combined with the hole-recovery heuristic described in the
 * comments on function bfq_bfqq_update_budg_for_activation,
 * are enough to preserve a correct bandwidth distribution in
 * the mid term, even without idling. In fact, even if not
 * idling allows the internal queues of the device to contain
 * many requests, and thus to reorder requests, we can rather
 * safely assume that the internal scheduler still preserves a
 * minimum of mid-term fairness.
 *
 * More precisely, this preemption-based, idleless approach
 * provides fairness in terms of IOPS, and not sectors per
 * second. This can be seen with a simple example. Suppose
 * that there are two queues with the same weight, but that
 * the first queue receives requests of 8 sectors, while the
 * second queue receives requests of 1024 sectors. In
 * addition, suppose that each of the two queues contains at
 * most one request at a time, which implies that each queue
 * always remains idle after it is served. Finally, after
 * remaining idle, each queue receives very quickly a new
 * request. It follows that the two queues are served
 * alternatively, preempting each other if needed. This
 * implies that, although both queues have the same weight,
 * the queue with large requests receives a service that is
 * 1024/8 times as high as the service received by the other
 * queue.
 *
 * The motivation for using preemption instead of idling (for
 * queues with the same weight) is that, by not idling,
 * service guarantees are preserved (completely or at least in
 * part) without minimally sacrificing throughput. And, if
 * there is no active group, then the primary expectation for
 * this device is probably a high throughput.
 *
 * We are now left only with explaining the additional
 * compound condition that is checked below for deciding
 * whether the scenario is asymmetric. To explain this
 * compound condition, we need to add that the function
 * bfq_symmetric_scenario checks the weights of only
 * non-weight-raised queues, for efficiency reasons (see
 * comments on bfq_weights_tree_add()). Then the fact that
 * bfqq is weight-raised is checked explicitly here. More
 * precisely, the compound condition below takes into account
 * also the fact that, even if bfqq is being weight-raised,
 * the scenario is still symmetric if all queues with requests
 * waiting for completion happen to be
 * weight-raised. Actually, we should be even more precise
 * here, and differentiate between interactive weight raising
 * and soft real-time weight raising.
 *
 * As a side note, it is worth considering that the above
 * device-idling countermeasures may however fail in the
 * following unlucky scenario: if idling is (correctly)
 * disabled in a time period during which all symmetry
 * sub-conditions hold, and hence the device is allowed to
 * enqueue many requests, but at some later point in time some
 * sub-condition stops to hold, then it may become impossible
 * to let requests be served in the desired order until all
 * the requests already queued in the device have been served.
 */
 static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
 						 struct bfq_queue *bfqq)
 {
-	/*
+	return (bfqq->wr_coeff > 1 &&
-	 * There is a case where idling must be performed not for
+		bfqd->wr_busy_queues <
-	 * throughput concerns, but to preserve service guarantees.
+		bfq_tot_busy_queues(bfqd)) ||
 	 *
 	 * To introduce this case, we can note that allowing the drive
 	 * to enqueue more than one request at a time, and thereby
 	 * delegating de facto final scheduling decisions to the
 	 * drive's internal scheduler, entails loss of control on the
 	 * actual request service order. In particular, the critical
 	 * situation is when requests from different processes happen
 	 * to be present, at the same time, in the internal queue(s)
 	 * of the drive. In such a situation, the drive, by deciding
 	 * the service order of the internally-queued requests, does
 	 * determine also the actual throughput distribution among
 	 * these processes. But the drive typically has no notion or
 	 * concern about per-process throughput distribution, and
 	 * makes its decisions only on a per-request basis. Therefore,
 	 * the service distribution enforced by the drive's internal
 	 * scheduler is likely to coincide with the desired
 	 * device-throughput distribution only in a completely
 	 * symmetric scenario where:
 	 * (i)  each of these processes must get the same throughput as
 	 *      the others;
 	 * (ii) the I/O of each process has the same properties, in
 	 *      terms of locality (sequential or random), direction
 	 *      (reads or writes), request sizes, greediness
 	 *      (from I/O-bound to sporadic), and so on.
 	 * In fact, in such a scenario, the drive tends to treat
 	 * the requests of each of these processes in about the same
 	 * way as the requests of the others, and thus to provide
 	 * each of these processes with about the same throughput
 	 * (which is exactly the desired throughput distribution). In
 	 * contrast, in any asymmetric scenario, device idling is
 	 * certainly needed to guarantee that bfqq receives its
 	 * assigned fraction of the device throughput (see [1] for
 	 * details).
 	 * The problem is that idling may significantly reduce
 	 * throughput with certain combinations of types of I/O and
 	 * devices. An important example is sync random I/O, on flash
 	 * storage with command queueing. So, unless bfqq falls in the
 	 * above cases where idling also boosts throughput, it would
 	 * be important to check conditions (i) and (ii) accurately,
 	 * so as to avoid idling when not strictly needed for service
 	 * guarantees.
 	 *
 	 * Unfortunately, it is extremely difficult to thoroughly
 	 * check condition (ii). And, in case there are active groups,
 	 * it becomes very difficult to check condition (i) too. In
 	 * fact, if there are active groups, then, for condition (i)
 	 * to become false, it is enough that an active group contains
 	 * more active processes or sub-groups than some other active
 	 * group. More precisely, for condition (i) to hold because of
 	 * such a group, it is not even necessary that the group is
 	 * (still) active: it is sufficient that, even if the group
 	 * has become inactive, some of its descendant processes still
 	 * have some request already dispatched but still waiting for
 	 * completion. In fact, requests have still to be guaranteed
 	 * their share of the throughput even after being
 	 * dispatched. In this respect, it is easy to show that, if a
 	 * group frequently becomes inactive while still having
 	 * in-flight requests, and if, when this happens, the group is
 	 * not considered in the calculation of whether the scenario
 	 * is asymmetric, then the group may fail to be guaranteed its
 	 * fair share of the throughput (basically because idling may
 	 * not be performed for the descendant processes of the group,
 	 * but it had to be).  We address this issue with the
 	 * following bi-modal behavior, implemented in the function
 	 * bfq_symmetric_scenario().
 	 *
 	 * If there are groups with requests waiting for completion
 	 * (as commented above, some of these groups may even be
 	 * already inactive), then the scenario is tagged as
 	 * asymmetric, conservatively, without checking any of the
 	 * conditions (i) and (ii). So the device is idled for bfqq.
 	 * This behavior matches also the fact that groups are created
 	 * exactly if controlling I/O is a primary concern (to
 	 * preserve bandwidth and latency guarantees).
 	 *
 	 * On the opposite end, if there are no groups with requests
 	 * waiting for completion, then only condition (i) is actually
 	 * controlled, i.e., provided that condition (i) holds, idling
 	 * is not performed, regardless of whether condition (ii)
 	 * holds. In other words, only if condition (i) does not hold,
 	 * then idling is allowed, and the device tends to be
 	 * prevented from queueing many requests, possibly of several
 	 * processes. Since there are no groups with requests waiting
 	 * for completion, then, to control condition (i) it is enough
 	 * to check just whether all the queues with requests waiting
 	 * for completion also have the same weight.
 	 *
 	 * Not checking condition (ii) evidently exposes bfqq to the
 	 * risk of getting less throughput than its fair share.
 	 * However, for queues with the same weight, a further
 	 * mechanism, preemption, mitigates or even eliminates this
 	 * problem. And it does so without consequences on overall
 	 * throughput. This mechanism and its benefits are explained
 	 * in the next three paragraphs.
 	 *
 	 * Even if a queue, say Q, is expired when it remains idle, Q
 	 * can still preempt the new in-service queue if the next
 	 * request of Q arrives soon (see the comments on
 	 * bfq_bfqq_update_budg_for_activation). If all queues and
 	 * groups have the same weight, this form of preemption,
 	 * combined with the hole-recovery heuristic described in the
 	 * comments on function bfq_bfqq_update_budg_for_activation,
 	 * are enough to preserve a correct bandwidth distribution in
 	 * the mid term, even without idling. In fact, even if not
 	 * idling allows the internal queues of the device to contain
 	 * many requests, and thus to reorder requests, we can rather
 	 * safely assume that the internal scheduler still preserves a
 	 * minimum of mid-term fairness.
 	 *
 	 * More precisely, this preemption-based, idleless approach
 	 * provides fairness in terms of IOPS, and not sectors per
 	 * second. This can be seen with a simple example. Suppose
 	 * that there are two queues with the same weight, but that
 	 * the first queue receives requests of 8 sectors, while the
 	 * second queue receives requests of 1024 sectors. In
 	 * addition, suppose that each of the two queues contains at
 	 * most one request at a time, which implies that each queue
 	 * always remains idle after it is served. Finally, after
 	 * remaining idle, each queue receives very quickly a new
 	 * request. It follows that the two queues are served
 	 * alternatively, preempting each other if needed. This
 	 * implies that, although both queues have the same weight,
 	 * the queue with large requests receives a service that is
 	 * 1024/8 times as high as the service received by the other
 	 * queue.
 	 *
 	 * The motivation for using preemption instead of idling (for
 	 * queues with the same weight) is that, by not idling,
 	 * service guarantees are preserved (completely or at least in
 	 * part) without minimally sacrificing throughput. And, if
 	 * there is no active group, then the primary expectation for
 	 * this device is probably a high throughput.
 	 *
 	 * We are now left only with explaining the additional
 	 * compound condition that is checked below for deciding
 	 * whether the scenario is asymmetric. To explain this
 	 * compound condition, we need to add that the function
 	 * bfq_symmetric_scenario checks the weights of only
 	 * non-weight-raised queues, for efficiency reasons (see
 	 * comments on bfq_weights_tree_add()). Then the fact that
 	 * bfqq is weight-raised is checked explicitly here. More
 	 * precisely, the compound condition below takes into account
 	 * also the fact that, even if bfqq is being weight-raised,
 	 * the scenario is still symmetric if all queues with requests
 	 * waiting for completion happen to be
 	 * weight-raised. Actually, we should be even more precise
 	 * here, and differentiate between interactive weight raising
 	 * and soft real-time weight raising.
 	 *
 	 * As a side note, it is worth considering that the above
 	 * device-idling countermeasures may however fail in the
 	 * following unlucky scenario: if idling is (correctly)
 	 * disabled in a time period during which all symmetry
 	 * sub-conditions hold, and hence the device is allowed to
 	 * enqueue many requests, but at some later point in time some
 	 * sub-condition stops to hold, then it may become impossible
 	 * to let requests be served in the desired order until all
 	 * the requests already queued in the device have been served.
 	 */
 	bool asymmetric_scenario = (bfqq->wr_coeff > 1 &&
 				    bfqd->wr_busy_queues <
 				    bfq_tot_busy_queues(bfqd)) ||
 		!bfq_symmetric_scenario(bfqd);
 	/*
 	 * Finally, there is a case where maximizing throughput is the
 	 * best choice even if it may cause unfairness toward
 	 * bfqq. Such a case is when bfqq became active in a burst of
 	 * queue activations. Queues that became active during a large
 	 * burst benefit only from throughput, as discussed in the
 	 * comments on bfq_handle_burst. Thus, if bfqq became active
 	 * in a burst and not idling the device maximizes throughput,
 	 * then the device must no be idled, because not idling the
 	 * device provides bfqq and all other queues in the burst with
 	 * maximum benefit. Combining this and the above case, we can
 	 * now establish when idling is actually needed to preserve
 	 * service guarantees.
 	 */
 	return asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
 }
 /*