Many users have reported the lack of an HOWTO for properly configuring
bfq as a function of the goal one wants to achieve (max
responsiveness, max throughput, ...). In fact, all needed details are
already provided in the documentation file bfq-iosched.txt. Yet the
document lacks guidance on which parameter descriptions to look
at. This commit adds some simple direction.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Jeremy Hickman <jeremywh7@gmail.com>
Reviewed-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In addition to containing some typos and stale sentences, the file
bfq-iosched.txt still mentioned a set of sysfs parameters that have
been removed from this version of bfq. This commit fixes all these
issues.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Jeremy Hickman <jeremywh7@gmail.com>
Reviewed-by: Laurentiu Nicola <lnicola@dend.ro>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently all integrity prep hooks are open-coded, and if prepare fails
we ignore it's code and fail bio with EIO. Let's return real error to
upper layer, so later caller may react accordingly.
In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
so it is reasonable to fold it in to one function.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[hch: merged with the latest block tree,
return bool from bio_integrity_prep]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().
So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The introduction of the BFQ and Kyber I/O schedulers has triggered a
new wave of I/O benchmarks. Unfortunately, comments and discussions on
these benchmarks confirm that there is still little awareness that it
is very hard to achieve, at the same time, a low latency and a high
throughput. In particular, virtually all benchmarks measure
throughput, or throughput-related figures of merit, but, for BFQ, they
use the scheduler in its default configuration. This configuration is
geared, instead, toward a low latency. This is evidently a sign that
BFQ documentation is still too unclear on this important aspect. This
commit addresses this issue by stressing how BFQ configuration must be
(easily) changed if the only goal is maximum throughput.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch introduces a simple heuristic to load applications quickly,
and to perform the I/O requested by interactive applications just as
quickly. To this purpose, both a newly-created queue and a queue
associated with an interactive application (we explain in a moment how
BFQ decides whether the associated application is interactive),
receive the following two special treatments:
1) The weight of the queue is raised.
2) The queue unconditionally enjoys device idling when it empties; in
fact, if the requests of a queue are sync, then performing device
idling for the queue is a necessary condition to guarantee that the
queue receives a fraction of the throughput proportional to its weight
(see [1] for details).
For brevity, we call just weight-raising the combination of these
two preferential treatments. For a newly-created queue,
weight-raising starts immediately and lasts for a time interval that:
1) depends on the device speed and type (rotational or
non-rotational), and 2) is equal to the time needed to load (start up)
a large-size application on that device, with cold caches and with no
additional workload.
Finally, as for guaranteeing a fast execution to interactive,
I/O-related tasks (such as opening a file), consider that any
interactive application blocks and waits for user input both after
starting up and after executing some task. After a while, the user may
trigger new operations, after which the application stops again, and
so on. Accordingly, the low-latency heuristic weight-raises again a
queue in case it becomes backlogged after being idle for a
sufficiently long (configurable) time. The weight-raising then lasts
for the same time as for a just-created queue.
According to our experiments, the combination of this low-latency
heuristic and of the improvements described in the previous patch
allows BFQ to guarantee a high application responsiveness.
[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Scheduler", Proceedings of the First Workshop on Mobile System
Technologies (MST-2015), May 2015.
http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We tag as v0 the version of BFQ containing only BFQ's engine plus
hierarchical support. BFQ's engine is introduced by this commit, while
hierarchical support is added by next commit. We use the v0 tag to
distinguish this minimal version of BFQ from the versions containing
also the features and the improvements added by next commits. BFQ-v0
coincides with the version of BFQ submitted a few years ago [1], apart
from the introduction of preemption, described below.
BFQ is a proportional-share I/O scheduler, whose general structure,
plus a lot of code, are borrowed from CFQ.
- Each process doing I/O on a device is associated with a weight and a
(bfq_)queue.
- BFQ grants exclusive access to the device, for a while, to one queue
(process) at a time, and implements this service model by
associating every queue with a budget, measured in number of
sectors.
- After a queue is granted access to the device, the budget of the
queue is decremented, on each request dispatch, by the size of the
request.
- The in-service queue is expired, i.e., its service is suspended,
only if one of the following events occurs: 1) the queue finishes
its budget, 2) the queue empties, 3) a "budget timeout" fires.
- The budget timeout prevents processes doing random I/O from
holding the device for too long and dramatically reducing
throughput.
- Actually, as in CFQ, a queue associated with a process issuing
sync requests may not be expired immediately when it empties. In
contrast, BFQ may idle the device for a short time interval,
giving the process the chance to go on being served if it issues
a new request in time. Device idling typically boosts the
throughput on rotational devices, if processes do synchronous
and sequential I/O. In addition, under BFQ, device idling is
also instrumental in guaranteeing the desired throughput
fraction to processes issuing sync requests (see [2] for
details).
- With respect to idling for service guarantees, if several
processes are competing for the device at the same time, but
all processes (and groups, after the following commit) have
the same weight, then BFQ guarantees the expected throughput
distribution without ever idling the device. Throughput is
thus as high as possible in this common scenario.
- Queues are scheduled according to a variant of WF2Q+, named
B-WF2Q+, and implemented using an augmented rb-tree to preserve an
O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
also ready for hierarchical scheduling. However, for a cleaner
logical breakdown, the code that enables and completes
hierarchical support is provided in the next commit, which focuses
exactly on this feature.
- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
perfectly fair, and smooth service. In particular, B-WF2Q+
guarantees that each queue receives a fraction of the device
throughput proportional to its weight, even if the throughput
fluctuates, and regardless of: the device parameters, the current
workload and the budgets assigned to the queue.
- The last, budget-independence, property (although probably
counterintuitive in the first place) is definitely beneficial, for
the following reasons:
- First, with any proportional-share scheduler, the maximum
deviation with respect to an ideal service is proportional to
the maximum budget (slice) assigned to queues. As a consequence,
BFQ can keep this deviation tight not only because of the
accurate service of B-WF2Q+, but also because BFQ *does not*
need to assign a larger budget to a queue to let the queue
receive a higher fraction of the device throughput.
- Second, BFQ is free to choose, for every process (queue), the
budget that best fits the needs of the process, or best
leverages the I/O pattern of the process. In particular, BFQ
updates queue budgets with a simple feedback-loop algorithm that
allows a high throughput to be achieved, while still providing
tight latency guarantees to time-sensitive applications. When
the in-service queue expires, this algorithm computes the next
budget of the queue so as to:
- Let large budgets be eventually assigned to the queues
associated with I/O-bound applications performing sequential
I/O: in fact, the longer these applications are served once
got access to the device, the higher the throughput is.
- Let small budgets be eventually assigned to the queues
associated with time-sensitive applications (which typically
perform sporadic and short I/O), because, the smaller the
budget assigned to a queue waiting for service is, the sooner
B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
- Weights can be assigned to processes only indirectly, through I/O
priorities, and according to the relation:
weight = 10 * (IOPRIO_BE_NR - ioprio).
The next patch provides, instead, a cgroups interface through which
weights can be assigned explicitly.
- If several processes are competing for the device at the same time,
but all processes and groups have the same weight, then BFQ
guarantees the expected throughput distribution without ever idling
the device. It uses preemption instead. Throughput is then much
higher in this common scenario.
- ioprio classes are served in strict priority order, i.e.,
lower-priority queues are not served as long as there are
higher-priority queues. Among queues in the same class, the
bandwidth is distributed in proportion to the weight of each
queue. A very thin extra bandwidth is however guaranteed to the Idle
class, to prevent it from starving.
- If the strict_guarantees parameter is set (default: unset), then BFQ
- always performs idling when the in-service queue becomes empty;
- forces the device to serve one I/O request at a time, by
dispatching a new request only if there is no outstanding
request.
In the presence of differentiated weights or I/O-request sizes,
both the above conditions are needed to guarantee that every
queue receives its allotted share of the bandwidth (see
Documentation/block/bfq-iosched.txt for more details). Setting
strict_guarantees may evidently affect throughput.
[1] https://lkml.org/lkml/2008/4/1/234https://lkml.org/lkml/2008/11/11/148
[2] P. Valente and M. Andreolini, "Improving Application
Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
the 5th Annual International Systems and Storage Conference
(SYSTOR '12), June 2012.
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
scale to multiple queues. Users configure only two knobs, the target
read and synchronous write latencies, and the scheduler tunes itself to
achieve that latency goal.
The implementation is based on "tokens", built on top of the scalable
bitmap library. Tokens serve as a mechanism for limiting requests. There
are two tiers of tokens: queueing tokens and dispatch tokens.
A queueing token is required to allocate a request. In fact, these
tokens are actually the blk-mq internal scheduler tags, but the
scheduler manages the allocation directly in order to implement its
policy.
Dispatch tokens are device-wide and split up into two scheduling
domains: reads vs. writes. Each hardware queue dispatches batches
round-robin between the scheduling domains as long as tokens are
available for that domain.
These tokens can be used as the mechanism to enable various policies.
The policy Kyber uses is inspired by active queue management techniques
for network routing, similar to blk-wbt. The scheduler monitors
latencies and scales the number of dispatch tokens accordingly. Queueing
tokens are used to prevent starvation of synchronous requests by
asynchronous requests.
Various extensions are possible, including better heuristics and ionice
support. The new scheduler isn't set as the default yet.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.
A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.
Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Three more DocBook template files have been converted to RST; only 21 to
go. There are various build improvements and the usual array of
documentation improvements and fixes.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYriFXAAoJEI3ONVYwIuV6iTMP/iV7ownq9IK1f8askcXKM76i
NoRdj4/JywAPQ73vLhOSDVELGdVJNRBjdyOdBRzxPgsqAhFmm79lVYV2eLIffQ2k
7LcVbEQR77I+4z9SwqIVbIWNCBry7Hu8aWh7moDL3I6yeuay408yr5YW2lIlsqHZ
V/LZgkTWDe+iQPeXNA4Djzylx0lcRlAy4yMSLjN1+gb9/uBnXb9J0eGJzgfZfrL8
fiIhymg3bv8vB99l6LMR5vT343QLWXf1yS31A7rPQvwkDo6zFehUJA0XNfIsl2dw
VQYsvl9vp9wy3e6Y0qKXPn1XhAhCrm64P3crBxK31MMvcKZVCfeRSZ78wrvpvewy
MVLlXdqop1bHPHowtRfA5jwxr1NqcYp+Jg0+YGX3iXpPi1Jfk36DNUy9iWvtvIzr
lWgQcIKsdCwwYUcvPR8Kt8T/3q/AHbYlI6mimWlkmbZwncQcgCrH5xSG+c2BIPfV
fn3W6eLHBn8RyVsxlaXlA0Y9TNtI/Cm85b3Ri10pFvhl868ppWfJxXHi7UtcbU58
sQzahISCTXOH/NQwkkh7kFMtczbB43rAcChvF7EUYpazVBpJ4P4HxKFg3eIzIdc6
VlBSaMu1hxUGoYxNNYuKr/nYstuczLOKzK7q4j/JOExY3RgTWP+T3bF02wgubvoa
D/9WfScewkgCJRoA7i17
=C5nd
-----END PGP SIGNATURE-----
Merge tag 'docs-4.11' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A slightly quieter cycle for documentation this time around.
Three more DocBook template files have been converted to RST; only 21
to go. There are various build improvements and the usual array of
documentation improvements and fixes"
* tag 'docs-4.11' of git://git.lwn.net/linux: (44 commits)
docs / driver-api: Fix structure references in device_link.rst
PM / docs: Fix structure references in device.rst
Add a target to check broken external links in the Documentation
Documentation: Fix linux-api list typo
Documentation: DocBook/Makefile comment typo
Improve sparse documentation
Documentation: make Makefile.sphinx no-ops quieter
Documentation: DMA-ISA-LPC.txt
Documentation: input: fix path to input code definitions
docs: Remove the copyright year from conf.py
docs: Fix a warning in the Korean HOWTO.rst translation
PM / sleep / docs: Convert PM notifiers document to reST
PM / core / docs: Convert sleep states API document to reST
PM / core: Update kerneldoc comments in pm.h
doc-rst: Fix recursive make invocation from macros
doc-rst: Delete output of failed dot-SVG conversion
doc-rst: Break shell command sequences on failure
Documentation/sphinx: make targets independent of Sphinx work for HAVE_SPHINX=0
doc-rst: fixed cleandoc target when used with O=dir
Documentation/sphinx: prevent generation of .pyc files in the source tree
...
This patch fix some double words found in Documentation.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Allow a write of '-1' to reset the default latency target for
a given device. This removes knowledge of the different default
settings for rotational vs non-rotational from user space.
Signed-off-by: Jens Axboe <axboe@fb.com>
This was documented in the original commit, 64f1c21e86, but it
never made it into the proper location for queue sysfs files.
Signed-off-by: Jens Axboe <axboe@fb.com>
If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.
Add the appropriate error message.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Massaged the text a bit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Enable throttling of buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.
The algorithm for when to throttle takes its inspiration in the
CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
the minimum latencies of requests over a window of time. In that
window of time, if the minimum latency of any request exceeds a
given target, then a scale count is incremented and the queue depth
is shrunk. The next monitoring window is shrunk accordingly. Unlike
CoDel, if we hit a window that exhibits good behavior, then we
simply increment the scale count and re-calculate the limits for that
scale value. This prevents us from oscillating between a
close-to-ideal value and max all the time, instead remaining in the
windows where we get good behavior.
Unlike CoDel, blk-wb allows the scale count to to negative. This
happens if we primarily have writes going on. Unlike positive
scale counts, this doesn't change the size of the monitoring window.
When the heavy writers finish, blk-bw quickly snaps back to it's
stable state of a zero scale count.
The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
target to me met. It defaults to 2 msec for non-rotational storage, and
75 msec for rotational storage. Setting this value to '0' disables
blk-wb. Generally, a user would not have to touch this setting.
We don't enable WBT on devices that are managed with CFQ, and have
a non-root block cgroup attached. If we have a proportional share setup
on this particular disk, then the wbt throttling will interfere with
that. We don't have a strong need for wbt for that case, since we will
rely on CFQ doing that for us.
Signed-off-by: Jens Axboe <axboe@fb.com>
Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.
This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.
In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.
Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
commit e1defc4ff0
"block: Do away with the notion of hardsect_size"
removed the notion of "hardware sector size" from
the kernel in favor of logical block size, but
references remain in comments and documentation.
Update the remaining sites mentioning hardsect.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add descriptions for dax, io_poll, and write_same_max_bytes files.
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.
Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.
Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...
Commit 7eaceaccab removed ->sync_page, but a few mentions of it still
existed in documentation and comments,
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
infrastructural work to allow documents to be written using restructured
text. Maybe someday, in a galaxy far far away, we'll be able to eliminate
the DocBook dependency and have a much better integrated set of kernel
docs. Someday.
Beyond that, there's a new document on security hardening from Kees, the
movement of some sample code over to samples/, a number of improvements to
the serial docs from Geert, and the usual collection of corrections, typo
fixes, etc.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPf/VAAoJEI3ONVYwIuV60pkP/3brq+CavbwptWppESoyZaf7
mpVSH7sOKicMcfHYYIXHmmg0K5gM4e22ATl39+izUCRZRwRnObXvroH++G5mARLs
MUDxLvkc/QxDDuCZnUBq5E2gPtuyYpgj1q9fMGB+70ucc/EXYp5cxUhDmbNVrpSG
KBMoZqKaW/Cf8/4fvRQG/glSR0iwyaQuvvoFAWLHgf8uWN/JPM2Cnv9V2zGQCtzP
4B4Jzayu2BGKowBd65WUYdpGnccc7OAJFSJDY/Z9x7kVxKyD+VTn7VgxGnXxs88v
uNmUEMENUpswzuoYEnDHoR0Y2o7jUi2doFKv+eacSmPaMLWL5EMDzcooZ+Vi7HWH
mvp6GtAZ5qs96OGjsi+gFIw4kY8HGdnpzs7qk/uEdAndfAif5v24YLSQRG2rUCJM
LxomnAWOJEIWGKJtuJnl16aZkgOcn6soecXw3PJmpxzhwd8BnQzwyZIdaZ98kwjA
7Enq2Mmw5NBQwGIV2ODUxzoQ3Axj7aJJsDra2n6lPGTGXONGdgNFzk/hGmtQSuIp
Aeatiy66FF0qKomzs2+EACOFP+eH/IId0yvW83Pj0o9nV25YZiPsw0Z1Tae5n3+g
zgTFycalaowIwE3YzyH6BwvnMrluiPpUTjSLsmEaviJxE7/o+zrjOvMvallUIVUn
YkJcia/DtSuc7u7LYkWe
=2O+a
-----END PGP SIGNATURE-----
Merge tag 'docs-for-linus' of git://git.lwn.net/linux
Pull Documentation updates from Jon Corbet:
"A bit busier this time around.
The most interesting thing (IMO) this time around is some beginning
infrastructural work to allow documents to be written using
restructured text. Maybe someday, in a galaxy far far away, we'll be
able to eliminate the DocBook dependency and have a much better
integrated set of kernel docs. Someday.
Beyond that, there's a new document on security hardening from Kees,
the movement of some sample code over to samples/, a number of
improvements to the serial docs from Geert, and the usual collection
of corrections, typo fixes, etc"
* tag 'docs-for-linus' of git://git.lwn.net/linux: (55 commits)
doc: self-protection: provide initial details
serial: doc: Use port->state instead of info
serial: doc: Always refer to tty_port->mutex
Documentation: vm: Spelling s/paltform/platform/g
Documentation/memcg: update kmem limit doc as codes behavior
docproc: print a comment about autogeneration for rst output
docproc: add support for reStructuredText format via --rst option
docproc: abstract terminating lines at first space
docproc: abstract docproc directive detection
docproc: reduce unnecessary indentation
docproc: add variables for subcommand and filename
kernel-doc: use rst C domain directives and references for types
kernel-doc: produce RestructuredText output
kernel-doc: rewrite usage description, remove duplicated comments
Doc: correct the location of sysrq.c
Documentation: fix common spelling mistakes
samples: v4l: from Documentation to samples directory
samples: connector: from Documentation to samples directory
Documentation: xillybus: fix spelling mistake
Documentation: x86: fix spelling mistakes
...
We don't have any drivers left using it, so kill it off. Update
documentation to use the newer blk_queue_write_cache().
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.
This will replace the (awkward) blk_queue_flush() interface that
drivers currently use to inform the block layer of write cache state
and capabilities.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
fixes and various document tweaks.
One patch reaches out of the documentation subtree to fix a comment in
init/do_mounts_rd.c. There didn't seem to be anybody more appropriate to
take that one, so I accepted it.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmmJ8AAoJEI3ONVYwIuV6uqwP/0mnqdxVWo47ohaYJP7q0Soh
ovJAbfttxKnkmOdGbWcNIJtTiw+MpdF805CYR+2treE0zvEEDodg7BhkDnmKZJ9n
F1r53JrIj769E1c5ETmWTHcBt3jjtKyQIbBmDr4YTgX91dlKF28o1bMmyDECWIcT
PktTlPUidDtffKMn3klh6baPCMrTpLJ8aLshBzUrQhrQY8lxcZKAU+98vtFzYofG
LXCSulMYXumb7XBxErTLQZhmJslD4gaDMh2xkov6ALS8XNHnfoUIFRbArAllNfTf
LQGJ6Q5qnn58UWi9F/vgDqx7+d1KIPUjBxJR9wfa0w9ggQhA9ly2BSN/fllbiSbp
yIi1JS4hwBe8H/h577BNC3xjmgVN7mazZsXlS+fg3G16gpv4JdWeRY4efjosFIzQ
EIJxB8qAovUNqw4s1mzRIJ5B9L7PEK27O6z8N27Fiw4EigtMTFAOC2/GD3ELx4iJ
p1doiSr+wjfDcFd8kdIUiDKGrTSTXwNy3hUfrhzQyaEjDTJnx3+1+ono1orSazPO
Fr2RSsC5VzX4IYSuxTMvFSKjN1Iiu8xqwq3IdclHXrBhRvwOF2wpjjQ5Guf0lHBJ
FLBahSjZqt01kmwFykxoHps+VeSwpoEen6rClBQolfmtYVDTvgRNN46AxK9jZ8T4
jZmCNNs/mYzrqo/RTnmw
=u38W
-----END PGP SIGNATURE-----
Merge tag 'docs-4.5' of git://git.lwn.net/linux
Pull documentation updates from Jon Corbet:
"A relatively boring cycle in the docs tree. There's a few kernel-doc
fixes and various document tweaks.
One patch reaches out of the documentation subtree to fix a comment in
init/do_mounts_rd.c. There didn't seem to be anybody more appropriate
to take that one, so I accepted it"
* tag 'docs-4.5' of git://git.lwn.net/linux: (29 commits)
thermal: add description for integral_cutoff unit
Documentation: update libhugetlbfs site url
Documentation: Explain pci=conf1,conf2 more verbosely
DMA-API: fix confusing sentence in Documentation/DMA-API.txt
Documentation: translations: update linux cross reference link
Documentation: fix typo in CodingStyle
init, Documentation: Remove ramdisk_blocksize mentions
Documentation-getdelays: Apply a recommendation from "checkpatch.pl" in main()
Documentation: HOWTO: update versions from 3.x to 4.x
Documentation: remove outdated references from translations
Doc: treewide: Fix grammar "a" to "an"
Documentation: cpu-hotplug: Fix sysfs mount instructions
can-doc: Add hint about getting timestamps
Fix CFQ I/O scheduler parameter name in documentation
Documentation: arm: remove dead links from Marvell Berlin docs
Documentation: HOWTO: update code cross reference link
Doc: Docbook/iio: Fix typo in iio.tmpl
DocBook: make index.html generation less verbose by default
DocBook: Cleanup: remove an unused $(call) line
DocBook: Add a help message for DOCBOOKS env var
...
As seen in block/cfq-iosched.c, the parameter name is low_latency.
Signed-off-by: Libor Pechacek <lpechacek@suse.cz>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Add support for registering as a LightNVM device. This allows us to
evaluate the performance of the LightNVM subsystem.
In /drivers/Makefile, LightNVM is moved above block device drivers
to make sure that the LightNVM media managers have been initialized
before drivers under /drivers/block are initialized.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Fix by Jens Axboe to remove unneeded slab cache and the following
memory leak.
Signed-off-by: Jens Axboe <axboe@fb.com>
This commits adds a driver API and ioctls for controlling Persistent
Reservations s/genericly/generically/ at the block layer. Persistent
Reservations are supported by SCSI and NVMe and allow controlling who gets
access to a device in a shared storage setup.
Note that we add a pr_ops structure to struct block_device_operations
instead of adding the members directly to avoid bloating all instances
of devices that will never support Persistent Reservations.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Update block/biovecs.txt so that it includes a note on what kind of
effects arbitrarily sized bios would bring to the block layer.
Also fix a trivial typo, bio_iter_iovec.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Lots of devices support huge discard sizes these days. Depending
on how the device handles them internally, huge discards can
introduce massive latencies (hundreds of msec) on the device side.
We have a sysfs file, discard_max_bytes, that advertises the max
hardware supported discard size. Make this writeable, and split
the settings into a soft and hard limit. This can be set from
'discard_granularity' and up to the hardware limit.
Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
set limit.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove mentioning of block barriers since they were removed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leonid V. Fedorenchik <leonidsbox@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:
- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.
- Various updates/fixes/improvements for blk-mq:
- A set of updates from Bart, mostly fixing buts in the tag
handling.
- Cleanup/code consolidation from Christoph.
- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.
- A few tag and request handling updates from me.
- Cleanup of the preempt handling for running queues from Paolo.
- Prevent running of unmapped hardware queues from Ming Lei.
- Move the kdump memory limiting check to be in the correct
location, from Shaohua.
- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.
- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.
- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.
- Documentation fixup for IO scheduler merging from Jan.
- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.
- Kill off artificial limiting of max sectors in a request from
Christoph"
* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...
Currently scsi piggy backs on the block layer to define the concept
of a tagged command. But we want to be able to have block-level host-wide
tags assigned even for untagged commands like the initial INQUIRY, so add
a new SCSI-level flag for commands that are tagged at the scsi level, so
that even commands without that set can have tags assigned to them. Note
that this alredy is the case for the blk-mq code path, and this just lets
the old path catch up with it.
We also set this flag based upon sdev->simple_tags instead of the block
queue flag, so that it is entirely independent of the block layer tagging,
and thus always correct even if a driver doesn't use block level tagging
yet.
Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and
removing it forces them to look for the proper replacement.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Explain that two requests can be merged without
elevator_allow_merge_fn() being called.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.
- blk-mq timeout updates and fixes from Christoph.
- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.
- blk integrity updates and fixes from Martin and Gu Zheng.
- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.
- Improve the error printing, from Rob Elliott.
- Backing device improvements and cleanups from Tejun.
- Fixup of a misplaced rq_complete() tracepoint from Hannes.
- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.
- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.
- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.
- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"
* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...
None of the filesystems appear interested in using the integrity tagging
feature. Potentially because very few storage devices actually permit
using the application tag space.
Remove the tagging functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For commands like REQ_COPY we need a way to pass extra information along
with each bio. Like integrity metadata this information must be
available at the bottom of the stack so bi_private does not suffice.
Rename the existing bi_integrity field to bi_special and make it a union
so we can have different bio extensions for each class of command.
We previously used bi_integrity != NULL as a way to identify whether a
bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the
indicator now that bi_special can contain different things.
In addition, bio_integrity(bio) will now return a pointer to the
integrity payload (when applicable).
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
bdev_integrity_enabled() is only used by bio_integrity_enabled().
Combine these two functions.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patches fixes a typo, and for consistency use
"IO" in upper case in the block/queue-sysfs.txt documentation.
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>