linux/block
Jan Kara 142bbdfccc cfq: Disable writeback throttling by default
Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:

  8,0    1        0     8.347751764     0  m   N cfq workload slice:40000000
  8,0    1        0     8.347755256     0  m   N cfq293A  / set_active wl_class: 0 wl_type:0
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1     3814     8.347763916  5839 UT   N [kworker/u9:2] 1
  8,0    0        0     8.347777605     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    1        0     8.347784100     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3     1596     8.354364057     0  C   R 156109528 + 8 (6906954) [0]
  8,0    3        0     8.354383193     0  m   N cfq6196SN / complete rqnoidle 0
  8,0    3        0     8.354386476     0  m   N cfq schedule dispatch
  8,0    3        0     8.354399397     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354404705     0  m   N cfq293A  / dispatch_insert
  8,0    3        0     8.354409454     0  m   N cfq293A  / dispatched a request
  8,0    3        0     8.354412527     0  m   N cfq293A  / activate rq, drv=1
  8,0    3     1597     8.354414692     0  D   W 145961400 + 24 (6718452) [swapper/0]
  8,0    3        0     8.354484184     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    3        0     8.354487536     0  m   N cfq293A  / slice expired t=0
  8,0    3        0     8.354498013     0  m   N / served: vt=5888102466265088 min_vt=5888074869387264
  8,0    3        0     8.354502692     0  m   N cfq293A  / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
  8,0    3        0     8.354505695     0  m   N cfq293A  / del_from_rr
...
  8,0    0     1810     8.354728768     0  C   W 145961400 + 24 (314076) [0]
  8,0    0        0     8.354746927     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     3829     8.389886102  5839  G   W 145962968 + 24 [kworker/u9:2]
  8,0    1     3830     8.389888127  5839  P   N [kworker/u9:2]
  8,0    1     3831     8.389908102  5839  A   W 145978336 + 24 <- (8,4) 44000
  8,0    1     3832     8.389910477  5839  Q   W 145978336 + 24 [kworker/u9:2]
  8,0    1     3833     8.389914248  5839  I   W 145962968 + 24 (28146) [kworker/u9:2]
  8,0    1        0     8.389919137     0  m   N cfq293A  / insert_request
  8,0    1        0     8.389924305     0  m   N cfq293A  / add_to_rr
  8,0    1     3834     8.389933175  5839 UT   N [kworker/u9:2] 1
...
  8,0    0        0     9.455290997     0  m   N cfq workload slice:40000000
  8,0    0        0     9.455294769     0  m   N cfq293A  / set_active wl_class:0 wl_type:0
  8,0    0        0     9.455303499     0  m   N cfq293A  / fifo=ffff880003166090
  8,0    0        0     9.455306851     0  m   N cfq293A  / dispatch_insert
  8,0    0        0     9.455311251     0  m   N cfq293A  / dispatched a request
  8,0    0        0     9.455314324     0  m   N cfq293A  / activate rq, drv=1
  8,0    0     2043     9.455316210  6204  D   W 145962968 + 24 (1065401962) [pgioperf]
  8,0    0        0     9.455392407     0  m   N cfq293A  / Not idling.  st->count:1
  8,0    0        0     9.455395969     0  m   N cfq293A  / slice expired t=0
  8,0    0        0     9.455404210     0  m   N / served: vt=5888958194597888 min_vt=5888941810597888
  8,0    0        0     9.455410077     0  m   N cfq293A  / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
  8,0    0        0     9.455416851     0  m   N cfq293A  / del_from_rr
...
  8,0    0     2045     9.455648515     0  C   W 145962968 + 24 (332305) [0]
  8,0    0        0     9.455668350     0  m   N cfq293A  / complete rqnoidle 0
...
  8,0    1     4371     9.455710115  5839  G   W 145978336 + 24 [kworker/u9:2]
  8,0    1     4372     9.455712350  5839  P   N [kworker/u9:2]
  8,0    1     4373     9.455730159  5839  A   W 145986616 + 24 <- (8,4) 52280
  8,0    1     4374     9.455732674  5839  Q   W 145986616 + 24 [kworker/u9:2]
  8,0    1     4375     9.455737563  5839  I   W 145978336 + 24 (27448) [kworker/u9:2]
  8,0    1        0     9.455742871     0  m   N cfq293A  / insert_request
  8,0    1        0     9.455747550     0  m   N cfq293A  / add_to_rr
  8,0    1     4376     9.455756629  5839 UT   N [kworker/u9:2] 1

So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.

Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.

Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 08:15:08 -06:00
..
partitions partitions/efi: Fix integer overflow in GPT size calculation 2017-01-17 09:02:31 -07:00
badblocks.c badblocks: badblocks_set/clear update unacked_exist 2016-10-21 15:45:47 -06:00
bio-integrity.c block: remove bio_is_rw 2016-10-28 08:45:17 -06:00
bio.c blk-throttle: add a simple idle detection 2017-03-28 08:02:20 -06:00
blk-cgroup.c blkcg: allocate struct blkcg_gq outside request queue spinlock 2017-03-29 11:27:19 -06:00
blk-core.c block: fix inheriting request priority from bio 2017-04-04 15:39:47 -06:00
blk-exec.c block: introduce blk_rq_is_passthrough 2017-01-31 14:00:34 -07:00
blk-flush.c block: remove outdated part of blkdev_issue_flush() comment 2017-03-24 15:41:30 -06:00
blk-integrity.c block: constify struct blk_integrity_profile 2017-03-24 20:34:39 -06:00
blk-ioc.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2017-03-03 10:53:35 -08:00
blk-lib.c block: correct documentation for blkdev_issue_discard() flags 2017-03-24 15:41:28 -06:00
blk-map.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> 2017-03-02 08:42:36 +01:00
blk-merge.c block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
blk-mq-cpumap.c blk-mq: export blk_mq_map_queues 2016-11-08 17:30:00 -05:00
blk-mq-debugfs.c blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-mq-pci.c blk-mq-pci: Fix two spelling mistakes 2017-03-29 11:09:51 -06:00
blk-mq-sched.c blk-mq: move update of tags->rqs to __blk_mq_alloc_request() 2017-03-02 08:56:04 -07:00
blk-mq-sched.h blk-mq-sched: separate mark hctx and queue restart operations 2017-02-23 11:55:47 -07:00
blk-mq-sysfs.c blk-mq: free hctx->cpumask in release handler of hctx's kobject 2017-03-08 09:56:12 -07:00
blk-mq-tag.c blk-mq: Fix tagset reinit in the presence of cpu hot-unplug 2017-03-13 08:14:23 -06:00
blk-mq-tag.h blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset 2017-03-02 08:56:04 -07:00
blk-mq-virtio.c blk-mq: provide a default queue mapping for virtio device 2017-02-27 20:54:05 +02:00
blk-mq.c blk-mq: fix schedule-under-preempt for blocking drivers 2017-03-30 12:30:39 -06:00
blk-mq.h blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-settings.c block: optionally merge discontiguous discard bios into a single request 2017-02-08 13:43:08 -07:00
blk-softirq.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h> 2017-03-02 08:42:26 +01:00
blk-stat.c blk-throttle: add a mechanism to estimate IO latency 2017-03-28 08:02:20 -06:00
blk-stat.h blk-throttle: add a mechanism to estimate IO latency 2017-03-28 08:02:20 -06:00
blk-sysfs.c block: fix leak of q->rq_wb 2017-03-29 08:09:08 -06:00
blk-tag.c blk-mq-sched: add framework for MQ capable IO schedulers 2017-01-17 10:04:20 -07:00
blk-throttle.c blk-throttle: add latency target support 2017-03-28 08:02:20 -06:00
blk-timeout.c block: remove REQ_NO_TIMEOUT flag 2015-12-22 09:38:34 -07:00
blk-wbt.c blk-stat: convert to callback-based statistics reporting 2017-03-21 10:03:11 -06:00
blk-wbt.h block: track request size in blk_issue_stat 2017-03-28 08:02:20 -06:00
blk-zoned.c block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
blk.h blk-throttle: add a mechanism to estimate IO latency 2017-03-28 08:02:20 -06:00
bounce.c Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2015-09-19 18:57:09 -07:00
bsg-lib.c block: split scsi_request out of struct request 2017-01-27 15:08:35 -07:00
bsg.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
cfq-iosched.c cfq: Disable writeback throttling by default 2017-04-05 08:15:08 -06:00
cmdline-parser.c block: remove unrelated header files and export symbol 2014-01-21 20:18:26 -08:00
compat_ioctl.c block: Get rid of blk_get_backing_dev_info() 2017-02-02 08:21:32 -07:00
deadline-iosched.c block: enumify ELEVATOR_*_MERGE 2017-02-08 13:43:06 -07:00
elevator.c block: don't call ioc_exit_icq() with the queue lock held for blk-mq 2017-03-02 13:59:08 -07:00
genhd.c block: Fix oops scsi_disk_get() 2017-03-22 20:11:37 -06:00
ioctl.c block: Get rid of blk_get_backing_dev_info() 2017-02-02 08:21:32 -07:00
ioprio.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
Kconfig blk-throttle: add configure option for new .low interface 2017-03-28 08:02:20 -06:00
Kconfig.iosched block: get rid of blk-mq default scheduler choice Kconfig entries 2017-02-22 13:19:45 -07:00
Makefile virtio, vhost: optimizations, fixes 2017-03-02 13:53:13 -08:00
mq-deadline.c block: enumify ELEVATOR_*_MERGE 2017-02-08 13:43:06 -07:00
noop-iosched.c block: move existing elevator ops to union 2017-01-17 10:03:33 -07:00
opal_proto.h block/sed-opal: allocate struct opal_dev dynamically 2017-02-17 12:41:47 -07:00
partition-generic.c block: Rename blk_queue_zone_size and bdev_zone_size 2017-01-12 07:58:32 -07:00
scsi_ioctl.c block: fold cmd_type into the REQ_OP_ space 2017-01-31 14:00:44 -07:00
sed-opal.c block/sed-opal: fix spelling mistake: "Lifcycle" -> "Lifecycle" 2017-03-30 09:22:53 -06:00
t10-pi.c block: constify struct blk_integrity_profile 2017-03-24 20:34:39 -06:00