Commit Graph

3857 Commits

Author SHA1 Message Date
Davide Sapienza
f6c3ca0e58 block, bfq: prevent soft_rt_next_start from being stuck at infinity
BFQ can deem a bfq_queue as soft real-time only if the queue
- periodically becomes completely idle, i.e., empty and with
  no still-outstanding I/O request;
- after becoming idle, gets new I/O only after a special reference
  time soft_rt_next_start.

In this respect, after commit "block, bfq: consider also past I/O in
soft real-time detection", the value of soft_rt_next_start can never
decrease. This causes a problem with the following special updating
case for soft_rt_next_start: to prevent queues that are not completely
idle to be wrongly detected as soft real-time (when they become
non-empty again), soft_rt_next_start is temporarily set to infinity
for empty queues with still outstanding I/O requests. But, if such an
update is actually performed, then, because of the above commit,
soft_rt_next_start will be stuck at infinity forever, and the queue
will have no more chance to be considered soft real-time.

On slow systems, this problem does cause actual soft real-time
applications to be occasionally not detected as such.

This commit addresses this issue by eliminating the pushing of
soft_rt_next_start to infinity, and by changing the way non-empty
queues are prevented from being wrongly detected as soft
real-time. Simply, a queue that becomes non-empty again can now be
detected as soft real-time only if it has no outstanding I/O request.

Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:54:41 -06:00
Davide Sapienza
d450542e3c block, bfq: increase weight-raising duration for interactive apps
The maximum possible duration of the weight-raising period for
interactive applications is limited to 13 seconds, as this is the time
needed to load the largest application that we considered when tuning
weight raising. Unfortunately, in such an evaluation, we did not
consider the case of very slow virtual machines.

For example, on a QEMU/KVM virtual machine
- running in a slow PC;
- with a virtual disk stacked on a slow low-end 5400rpm HDD;
- serving a heavy I/O workload, such as the sequential reading of
several files;
mplayer takes 23 seconds to start, if constantly weight-raised.

To address this issue, this commit conservatively sets the upper limit
for weight-raising duration to 25 seconds.

Signed-off-by: Davide Sapienza <sapienza.dav@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:54:40 -06:00
Paolo Valente
e24f1c245f block, bfq: remove slow-system class
BFQ computes the duration of weight raising for interactive
applications automatically, using some reference parameters. In
particular, BFQ uses the best durations (see comments in the code for
how these durations have been assessed) for two classes of systems:
slow and fast ones. Examples of slow systems are old phones or systems
using micro HDDs. Fast systems are all the remaining ones. Using these
parameters, BFQ computes the actual duration of the weight raising,
for the system at hand, as a function of the relative speed of the
system w.r.t. the speed of a reference system, belonging to the same
class of systems as the system at hand.

This slow vs fast differentiation proved to be useful in the past, but
happens to have little meaning with current hardware. Even worse, it
does cause problems in virtual systems, where the speed of the system
can vary frequently, and so widely to just confuse the class-detection
mechanism, and, as we have verified experimentally, to cause BFQ to
compute non-sensical weight-raising durations.

This commit addresses this issue by removing the slow class and the
class-detection mechanism.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:54:38 -06:00
Paolo Valente
4029eef1be block, bfq: add description of weight-raising heuristics
A description of how weight raising works is missing in BFQ
sources. In addition, the code for handling weight raising is
scattered across a few functions. This makes it rather hard to
understand the mechanism and its rationale. This commits adds such a
description at the beginning of the main source file.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:54:36 -06:00
Filippo Muzzini
ac857e0d54 block, bfq: remove the removal of 'next' rq in bfq_requests_merged
Since bfq_finish_request() is always called on the request 'next',
after bfq_requests_merged() is finished, and bfq_finish_request()
removes 'next' from its bfq_queue if needed, it isn't necessary to do
such a removal in advance in bfq_merged_requests().

This commit removes such a useless 'next' removal.

Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:48:32 -06:00
Paolo Valente
8abfa4d6fd block, bfq: remove wrong check in bfq_requests_merged
The request rq passed to the function bfq_requests_merged is always in
a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
beginning of bfq_requests_merged always succeeds, and the control
flow systematically skips to the end of the function.  This implies
that the body of the function is never executed, i.e., the
repositioning of rq is never performed.

On the opposite end, a control is missing in the body of the function:
'next' must be removed only if it is inside a bfq_queue.

This commit removes the wrong check on rq, and adds the missing check
on 'next'. In addition, this commit adds comments on
bfq_requests_merged.

Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:48:05 -06:00
Filippo Muzzini
a12bffebc0 block, bfq: remove wrong lock in bfq_requests_merged
In bfq_requests_merged(), there is a deadlock because the lock on
bfqq->bfqd->lock is held by the calling function, but the code of
this function tries to grab the lock again.

This deadlock is currently hidden by another bug (fixed by next commit
for this source file), which causes the body of bfq_requests_merged()
to be never executed.

This commit removes the deadlock by removing the lock/unlock pair.

Signed-off-by: Filippo Muzzini <filippo.muzzini@outlook.it>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31 08:42:27 -06:00
Jens Axboe
04c4950d5b block: fixup bioset_integrity_create() call
Missed converting the bioset_integrity_create() bounce bio set
call.

Fixes: 338aa96d56 ("block: convert bounce, q->bio_split to bioset_init()/mempool_init()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 18:51:21 -06:00
Kent Overstreet
dad0852752 block: Drop bioset_create()
All users have been converted to bioset_init(), kill off the
old API.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Kent Overstreet
338aa96d56 block: convert bounce, q->bio_split to bioset_init()/mempool_init()
Convert the core block functionality to embedded bio sets.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 15:33:32 -06:00
Chengguang Xu
0b6bad7d66 blk-throttle: return proper bool type to caller instead of 0/1
Change to return true/false only for bool type return code.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 12:48:22 -06:00
Christoph Hellwig
d250bf4e77 blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
We already check for started commands in all callbacks, but we should
also protect against already completed commands.  Do this by taking
the checks to common code.

Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 11:31:34 -06:00
Liu Bo
2ab74cd296 blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch
tg in throtl_select_dispatch is used first and then do check. Since tg
may be NULL, it has potential NULL pointer dereference risk. So fix
it.

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 10:54:33 -06:00
Jianchao Wang
a6088845c2 block: kyber: make kyber more friendly with merging
Currently, kyber is very unfriendly with merging. kyber depends
on ctx rq_list to do merging, however, most of time, it will not
leave any requests in ctx rq_list. This is because even if tokens
of one domain is used up, kyber will try to dispatch requests
from other domain and flush the rq_list there.

To improve this, we setup kyber_ctx_queue (kcq) which is similar
with ctx, but it has rq_lists for different domain and build same
mapping between kcq and khd as the ctx & hctx. Then we could merge,
insert and dispatch for different domains separately. At the same
time, only flush the rq_list of kcq when get domain token successfully.
Then if one domain token is used up, the requests could be left in
the rq_list of that domain and maybe merged with following io.

Following is my test result on machine with 8 cores and NVMe card
INTEL SSDPEKKR128G7

fio size=256m ioengine=libaio iodepth=64 direct=1 numjobs=8
seq/random
+------+---------------------------------------------------------------+
|patch?| bw(MB/s) |   iops    | slat(usec) |    clat(usec)   |  merge  |
+----------------------------------------------------------------------+
| w/o  |  606/612 | 151k/153k |  6.89/7.03 | 3349.21/3305.40 |   0/0   |
+----------------------------------------------------------------------+
| w/   | 1083/616 | 277k/154k |  4.93/6.95 | 1830.62/3279.95 | 223k/3k |
+----------------------------------------------------------------------+
When set numjobs to 16, the bw and iops could reach 1662MB/s and 425k
on my platform.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 10:47:40 -06:00
Jens Axboe
9c55873464 blk-mq: abstract out blk-mq-sched rq list iteration bio merge helper
No functional changes in this patch, just a prep patch for utilizing
this in an IO scheduler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2018-05-30 10:43:58 -06:00
Christoph Hellwig
5de815a7ee block: remove parent device reference from struct bsg_class_device
Bsg holding a reference to the parent device may result in a crash if a
bsg file handle is closed after the parent device driver has unloaded.

Holding a reference is not really needed: the parent device must exist
between bsg_register_queue and bsg_unregister_queue.  Before the device
goes away the caller does blk_cleanup_queue so that all in-flight
requests to the device are gone and all new requests cannot pass beyond
the queue.  The queue itself is a refcounted object and it will stay
alive with a bsg file.

Based on analysis, previous patch and changelog from Anatoliy Glagolev.

Reported-by: Anatoliy Glagolev <glagolig@gmail.com>
Reviewed-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 13:00:25 -06:00
Christoph Hellwig
5afb78356c block: don't print a message when the device went away
The information about a size change in this case just creates confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
d1210d5afb blk-mq: simplify blk_mq_rq_timed_out
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
f6e7d48a78 block: remove BLK_EH_HANDLED
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Christoph Hellwig
6600593cbd block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
is not what is happening - instead the driver already completed the
command.  Fix the symbolic name to reflect that a little better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Keith Busch
12f5b93145 blk-mq: Remove generation seqeunce
This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it.  The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
 for completions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:59:21 -06:00
Keith Busch
ad103e7983 blk-mq: Fix timeout and state order
The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-29 08:47:40 -06:00
Joe Perches
5657a819a8 block drivers/block: Use octal not symbolic permissions
Convert the S_<FOO> symbolic permissions to their octal equivalents as
using octal and not symbolic permissions is preferred by many as more
readable.

see: https://lkml.org/lkml/2016/8/2/1945

Done with automated conversion via:
$ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...>

Miscellanea:

o Wrapped modified multi-line calls to a single line where appropriate
o Realign modified multi-line calls to open parenthesis

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-24 13:38:59 -06:00
Ming Lei
e6fc464987 blk-mq: avoid starving tag allocation after allocating process migrates
When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previous queue for compensating wake
up miss, so other allocations on the previous queue won't be starved.

This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.

The race is as follows:

1) 2 hw queues, nr_requests are 2, and wake_batch is one

2) there are 3 waiters on hw queue 0

3) two in-flight requests in hw queue 0 are completed, and only two
   waiters of 3 are waken up because of wake_batch, but both the two
   waiters can be scheduled to another CPU and cause to switch to hw
   queue 1

4) then the 3rd waiter will wait for ever, since no in-flight request
   is in hw queue 0 any more.

5) this patch fixes it by the fake wakeup when waiter is scheduled to
   another hw queue

Cc: <stable@vger.kernel.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Modified commit message to make it clearer, and make it apply on
top of the 4.18 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-24 11:00:39 -06:00
Bart Van Assche
327ea4adcf blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffers
Avoid that complaints similar to the following appear in the kernel log
if the number of zones is sufficiently large:

  fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
  Call Trace:
  dump_stack+0x63/0x88
  warn_alloc+0xf5/0x190
  __alloc_pages_slowpath+0x8f0/0xb0d
  __alloc_pages_nodemask+0x242/0x260
  alloc_pages_current+0x6a/0xb0
  kmalloc_order+0x18/0x50
  kmalloc_order_trace+0x26/0xb0
  __kmalloc+0x20e/0x220
  blkdev_report_zones_ioctl+0xa5/0x1a0
  blkdev_ioctl+0x1ba/0x930
  block_ioctl+0x41/0x50
  do_vfs_ioctl+0xaa/0x610
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x79/0x1b0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 3ed05a987e ("blk-zoned: implement ioctls")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Shaun Tancheff <shaun.tancheff@seagate.com>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-22 11:58:07 -06:00
huhai
b4f6f38d9f blk-mq: remove wrong 'unlikely' check
When dispatch_rq_from_ctx is called, in the vast majority of cases
the ctx->rq_list is not empty.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-22 08:38:04 -06:00
huhai
d416c92c5d blk-mq: clear hctx->dispatch_from when mappings change
When the number of hardware queues is changed, the drivers will call
blk_mq_update_nr_hw_queues() to remap hardware queues. This changes
the ctx mappings, but the current code doesn't clear the
->dispatch_from hint. This can result in dispatch_from pointing to
a ctx that isn't mapped to the hctx anymore.

Fixes: b347689ffb ("blk-mq-sched: improve dispatching from sw queue")
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>

Moved the placement of the clearing to where we clear other items
pertaining to the existing mapping, added Fixes line, and reworded
the commit message.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-18 08:35:38 -06:00
huhai
8fa9f55645 blk-mq: remove redundant insert case in blk_mq_make_request()
We can use blk_mq_sched_insert_request() even if we don't have
an IO scheduler attached, since that case will end up being
exactly the same as what blk_mq_queue_io() was doing now.

Signed-off-by: huhai <huhai@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-16 08:22:47 -06:00
Kent Overstreet
6fcefbe578 block: Add sysfs entry for fua support
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:17 -06:00
Kent Overstreet
1900fcc461 block: Export bio check/set pages_dirty
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:15 -06:00
Kent Overstreet
0ba99ca483 block: Add warning for bi_next not NULL in bio_endio()
Recently found a bug where a driver left bi_next not NULL and then
called bio_endio(), and then the submitter of the bio used
bio_copy_data() which was treating src and dst as lists of bios.

Fixed that bug by splitting out bio_list_copy_data(), but in case other
things are depending on bi_next in weird ways, add a warning to help
avoid more bugs like that in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:13 -06:00
Kent Overstreet
6e6e811d74 block: Add missing flush_dcache_page() call
Since a bio can point to userspace pages (e.g. direct IO), this is
generally necessary.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:11 -06:00
Kent Overstreet
45db54d58d block: Split out bio_list_copy_data()
Found a bug (with ASAN) where we were passing a bio to bio_copy_data()
with bi_next not NULL, when it should have been - a driver had left
bi_next set to something after calling bio_endio().

Since the normal case is only copying single bios, split out
bio_list_copy_data() to avoid more bugs like this in the future.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:10 -06:00
Kent Overstreet
38a72dac48 block: Add bio_copy_data_iter(), zero_fill_bio_iter()
Add versions that take bvec_iter args instead of using bio->bi_iter - to
be used by bcachefs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:08 -06:00
Kent Overstreet
f4f8154a08 block: Use bioset_init() for fs_bio_set
Minor optimization - remove a pointer indirection when using fs_bio_set.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:06 -06:00
Kent Overstreet
917a38c71a block: Add bioset_init()/bioset_exit()
Similarly to mempool_init()/mempool_exit(), take a pointer indirection
out of allocation/freeing by allowing biosets to be embedded in other
structs.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:04 -06:00
Kent Overstreet
8aa6ba2f6e block: Convert bio_set to mempool_init()
Minor performance improvement by getting rid of pointer indirections
from allocation/freeing fastpaths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 13:16:03 -06:00
Christoph Hellwig
0eb0b63c1d block: consistently use GFP_NOIO instead of __GFP_NORECLAIM
Same numerical value (for now at least), but a much better documentation
of intent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:18 -06:00
Christoph Hellwig
c3036021c7 block: use GFP_NOIO instead of __GFP_DIRECT_RECLAIM
We just can't do I/O when doing block layer requests allocations,
so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:16 -06:00
Christoph Hellwig
4accf5fc79 block: pass an explicit gfp_t to get_request
blk_old_get_request already has it at hand, and in blk_queue_bio, which
is the fast path, it is constant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:14 -06:00
Christoph Hellwig
ff005a0662 block: sanitize blk_get_request calling conventions
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:12 -06:00
Christoph Hellwig
a9a14d3671 block: fix __get_request documentation
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-14 08:55:11 -06:00
Jens Axboe
2882064076 kyber-iosched: update shallow depth when setting up hardware queue
We don't expect the async depth to be smaller than the wake batch
count for sbitmap, but just in case, inform sbitmap of what shallow
depth kyber may use.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:46 -06:00
Jens Axboe
483b7bf2e4 bfq-iosched: update shallow depth to smallest one used
If our shallow depth is smaller than the wake batching of sbitmap,
we can introduce hangs. Ensure that sbitmap knows how low we'll go.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:41 -06:00
Jens Axboe
bd7d4ef6a4 bfq-iosched: remove unused variable
bfqd->sb_shift was attempted used as a cache for the sbitmap queue
shift, but we don't need it, as it never changes. Kill it with fire.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:31 -06:00
Jens Axboe
f0635b8a41 bfq: calculate shallow depths at init time
It doesn't change, so don't put it in the per-IO hot path.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:29 -06:00
Jens Axboe
55141366de bfq-iosched: don't worry about reserved tags in limit_depth
Reserved tags are used for error handling, we don't need to
care about them for regular IO. The core won't call us for these
anyway.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:17 -06:00
Jens Axboe
17a5119932 blk-mq: don't call into depth limiting for reserved tags
It's not useful, they are internal and/or error handling recovery
commands.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 11:27:12 -06:00
Paolo Valente
18e5a57d79 block, bfq: postpone rq preparation to insert or merge
When invoked for an I/O request rq, the prepare_request hook of bfq
increments reference counters in the destination bfq_queue for rq. In
this respect, after this hook has been invoked, rq may still be
transformed into a request with no icq attached, i.e., for bfq, a
request not associated with any bfq_queue. No further hook is invoked
to signal this tranformation to bfq (in general, to the destination
elevator for rq). This leads bfq into an inconsistent state, because
bfq has no chance to correctly lower these counters back. This
inconsistency may in its turn cause incorrect scheduling and hangs. It
certainly causes memory leaks, by making it impossible for bfq to free
the involved bfq_queue.

On the bright side, no transformation can still happen for rq after rq
has been inserted into bfq, or merged with another, already inserted,
request. Exploiting this fact, this commit addresses the above issue
by delaying the preparation of an I/O request to when the request is
inserted or merged.

This change also gives a performance bonus: a lock-contention point
gets removed. To prepare a request, bfq needs to hold its scheduler
lock. After postponing request preparation to insertion or merging, no
lock needs to be grabbed any longer in the prepare_request hook, while
the lock already taken to perform insertion or merging is used to
preparare the request as well.

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Tested-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10 10:16:29 -06:00
Omar Sandoval
522a777566 block: consolidate struct request timestamp fields
Currently, struct request has four timestamp fields:

- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
  used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq

These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09 08:33:09 -06:00