Commit Graph

692262 Commits

Author SHA1 Message Date
Markus Elfring
427fd2bee0 drbd: A single dot should be put into a sequence.
Thus use the corresponding function "seq_putc".

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:45 -06:00
Lars Ellenberg
3f1a1b7cbb drbd: fix rmmod cleanup, remove _all_ debugfs entries
If there are still resources defined, but "empty", no more volumes
or connections configured, they don't hold module reference counts,
so rmmod is possible.

To avoid DRBD leftovers in debugfs, we need to call our global
drbd_debugfs_cleanup() only after all resources have been cleaned up.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:45 -06:00
Geliang Tang
be7445a381 drbd: Use setup_timer() instead of init_timer() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:45 -06:00
Lars Ellenberg
7c752ed325 drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
Race:

drbd_adm_attach()               | async drbd_md_endio()
                                |
device->ldev is still NULL.     |
                                |
drbd_md_read(                   |
 .endio = drbd_md_endio;        |
 submit;                        |
 ....                           |
 wait for done == 1;            |       done = 1;
);                              |       wake_up();
.. lot of other stuff,          |
.. includeing taking and        |
...giving up locks,             |
.. doing further IO,            |
.. stuff that takes "some time" |
                                | while in this context,
                                | this is the next statement.
                                | which means this context was scheduled
.. only then, finally,          | away for "some time".
device->ldev = nbc;             |
                                |       if (device->ldev)
                                |               put_ldev()

Unlikely, but possible. I was able to provoke it "reliably"
by adding an mdelay(500); after the wake_up().
Fixed by moving the if (!NULL) put_ldev() before done = 1;

Impact of the bug was that the resulting refcount imbalance
could lead to premature destruction of the object, potentially
causing a NULL pointer dereference during a subsequent detach.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:45 -06:00
Lars Ellenberg
9de7e14a1a drbd: new disk-option disable-write-same
Some backend devices claim to support write-same,
but would fail actual write-same requests.

Allow to set (or toggle) whether or not DRBD tries to support write-same.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:44 -06:00
Philipp Reisner
c200d98687 drbd: Fix resource role for newly created resources in events2
The conn_higest_role() (a terribly misnamed function) returns
the role of the resource. It returned R_UNKNOWN as long as the
resource has not a single device.

Resources without devices are short living objects.

But it matters for the NOTIFY_CREATE netwlink message. It makes
a lot more sense to report R_SECONDARY for the newly created
resource than R_UNKNOWN.

I reviewd all call sites of conn_highest_role(), that change
does not matter for the other call sites.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:44 -06:00
Baoyou Xie
1ffa7bfab4 drbd: mark symbols static where possible
We get a few warnings when building kernel with W=1:
drbd/drbd_receiver.c:1224:6: warning: no previous prototype for 'one_flush_endio' [-Wmissing-prototypes]
drbd/drbd_req.c:1450:6: warning: no previous prototype for 'send_and_submit_pending' [-Wmissing-prototypes]
drbd/drbd_main.c:924:6: warning: no previous prototype for 'assign_p_sizes_qlim' [-Wmissing-prototypes]
....

In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
So this patch marks these functions with 'static'.

Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:44 -06:00
Lars Ellenberg
e1fbc4ca9d drbd: Send P_NEG_ACK upon write error in protocol != C
In protocol != C, we forgot to send the P_NEG_ACK for failing writes.

Once we no longer submit to local disk, because we already "detached",
due to the typical "on-io-error detach;" config setting,
we already send the neg acks right away.

Only those requests that have been submitted,
and have been error-completed by the local disk,
would forget to send the neg-ack,
and only in asynchronous replication (protocol != C).
Unless this happened during resync,
where we already always send acks, regardless of protocol.

The primary side needs the P_NEG_ACK in order to mark
the affected block(s) for resync in its out-of-sync bitmap.

If the blocks in question are not re-written again,
we may miss to resync them later, causing data inconsistencies.

This patch will always send the neg-acks, and also at least try to
persist the out-of-sync status on the local node already.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:44 -06:00
Lars Ellenberg
de6978be44 drbd: add explicit plugging when submitting batches
When submitting batches of requests which had been queued on the
submitter thread, typically because they needed to wait for an
activity log transactions, use explicit plugging to help potential
merging of requests in the backend io-scheduler.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:44 -06:00
Lars Ellenberg
9da10e8da3 drbd: change list_for_each_safe to while(list_first_entry_or_null)
Two instances of list_for_each_safe can drop their tmp element, they
really just peel off each element in turn from the start of the list.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:44 -06:00
Lars Ellenberg
c51a0ef374 drbd: introduce drbd_recv_header_maybe_unplug
Recently, drbd_recv_header() was changed to potentially
implicitly "unplug" the backend device(s), in case there
is currently nothing to receive.

Be more explicit about it: re-introduce the original drbd_recv_header(),
and introduce a new drbd_recv_header_maybe_unplug() for use by the
receiver "main loop".

Using explicit plugging via blk_start_plug(); blk_finish_plug();
really helps the io-scheduler of the backend with merging requests.

Wrap the receiver "main loop" with such a plug.
Also catch unplug events on the Primary,
and try to propagate.

This is performance relevant.  Without this, if the receiving side does
not merge requests, number of IOPS on the peer can me significantly
higher than IOPS on the Primary, and can easily become the bottleneck.

Together, both changes should help to reduce the number of IOPS
as seen on the backend of the receiving side, by increasing
the chance of merging mergable requests, without trading latency
for more throughput.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 15:34:43 -06:00
Christoph Hellwig
c529594f93 bsg: remove #if 0'ed code
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 10:50:30 -06:00
Ben Hutchings
7de967e76f mq-deadline: Enable auto-loading when built as module
The block core requests modules with the "-iosched" name suffix, but
mq-deadline does not have that suffix.  Add an alias.

Fixes: 945ffb60c1 ("mq-deadline: add blk-mq adaptation of the deadline ...")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 10:47:23 -06:00
Ben Hutchings
26b4cf2497 bfq: Re-enable auto-loading when built as a module
The block core requests modules with the "-iosched" name suffix, but
bfq no longer has that suffix.  Add an alias.

Fixes: ea25da4808 ("block, bfq: split bfq-iosched.c into multiple ...")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 10:47:23 -06:00
Damien Le Moal
5034435c84 block: Make blk_dequeue_request() static
The only caller of this function is blk_start_request() in the same
file. Fix blk_start_request() description accordingly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 09:49:31 -06:00
Bart Van Assche
6fd5b91dab skd: Let the block layer core choose .nr_requests
Since blk_mq_init_queue() initializes .nr_requests to the tag set
size and since that value is a good default for the skd driver, do
not overwrite the value set by blk_mq_init_queue(). This change
doubles the default value of .nr_requests.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 09:43:06 -06:00
Bart Van Assche
bf231981be skd: Remove blk_queue_bounce_limit() call
Since sTec s1120 devices support 64-bit DMA it is not necessary
to request data buffer bouncing. Hence remove the
blk_queue_bounce_limit() call.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-29 09:43:03 -06:00
Bhumika Goyal
dfbde55249 nbd: make device_attribute const
Make this const as is is only passed as an argument to the
function device_create_file and device_remove_file and the corresponding
arguments are of type const.
Done using Coccinelle

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28 15:21:27 -06:00
Jens Axboe
b3c3051220 null_blk: use available 'dev' in nullb_device_power_store()
We already have this pointer, no need to use to_nullb_device()
again.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28 15:06:31 -06:00
Shaohua Li
060fd198a3 block/nullb: delete unnecessary memory free
Commit 2984c86(nullb: factor disk parameters) has a typo. The
nullb_device allocation/free is done outside of null_add_dev. The commit
accidentally frees the nullb_device in error code path.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28 15:06:17 -06:00
David Jeffery
e9a823fb34 block: fix warning when I/O elevator is changed as request_queue is being removed
There is a race between changing I/O elevator and request_queue removal
which can trigger the warning in kobject_add_internal.  A program can
use sysfs to request a change of elevator at the same time another task
is unregistering the request_queue the elevator would be attached to.
The elevator's kobject will then attempt to be connected to the
request_queue in the object tree when the request_queue has just been
removed from sysfs.  This triggers the warning in kobject_add_internal
as the request_queue no longer has a sysfs directory:

kobject_add_internal failed for iosched (error: -2 parent: queue)
------------[ cut here ]------------
WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0

To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
changing the elevator and use the request_queue's sysfs_lock to
serialize between clearing the flag and the elevator testing the flag.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28 10:52:44 -06:00
weiping zhang
235f8da119 block, scheduler: convert xxx_var_store to void
The last parameter "count" never be used in xxx_var_store,
convert these functions to void.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28 10:01:08 -06:00
Bart Van Assche
f5cb2d5152 skd: Remove SKD_ID_INCR
The SKD_ID_INCR flag in skd_request_context.id duplicates information
that is already available otherwise, e.g. through the block layer
request state and through skd_request_context.state. Hence remove
the code that manipulates this flag and also the flag itself.
Since skd_isr_completion_posted() only uses the lower bits of
skd_request_context.id as hardware tag, this patch does not change
the behavior of the skd driver. I'm referring to the following code:

    tag = req_id & SKD_ID_SLOT_AND_TABLE_MASK;

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 15:29:45 -06:00
Bart Van Assche
4633504c1a skd: Make it easier for static analyzers to analyze skd_free_disk()
Although it is easy to see that skdev->disk != NULL if skdev->queue
!= NULL, add a test for skdev->disk to avoid that smatch reports the
following warning:

drivers/block/skd_main.c:3080 skd_free_disk()
         error: we previously assumed 'disk' could be null (see line 3074)

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 15:29:43 -06:00
Bart Van Assche
795bc1b542 skd: Inline skd_end_request()
It is not worth to keep the debug statements in skd_end_request().
Without debug statements that function only consists of two
statements. Hence inline skd_end_request().

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 15:29:42 -06:00
Bart Van Assche
296cb94c9d skd: Rename skd_softirq_done() into skd_complete_rq()
The latter name follows more closely the function names used in
other blk-mq drivers.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 15:29:40 -06:00
Shaohua Li
0d06a42f79 block/nullb: fix NULL dereference
Dan reported this:

The patch 2984c8684f: "nullb: factor disk parameters" from Aug 14,
2017, leads to the following Smatch complaint:

drivers/block/null_blk.c:1759 null_init_tag_set()
	 error: we previously assumed 'nullb' could be null (see line
1750)

  1755		set->cmd_size	= sizeof(struct nullb_cmd);
  1756		set->flags = BLK_MQ_F_SHOULD_MERGE;
  1757		set->driver_data = NULL;
  1758
  1759		if (nullb->dev->blocking)
                    ^^^^^^^^^^^^^^^^^^^^
And an unchecked dereference.

nullb could be NULL here.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 14:52:01 -06:00
weiping zhang
4c18c9e962 blkcg: avoid free blkcg_root when failed to alloc blkcg policy
this patch fix two errors, firstly avoid kfree blk_root, secondly not
free(blkcg) ,if blkcg alloc fail(blkcg == NULL), just unlock that mutex;

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 13:51:07 -06:00
Jens Axboe
231b3db18d null_blk: update email adress
Update to a working one, the fusionio address hasn't been valid
in 4 years.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 12:53:15 -06:00
Omar Sandoval
3140c3cfae block: update comments to reflect REQ_FLUSH -> REQ_PREFLUSH rename
Normally I wouldn't bother with this, but in my opinion the comments are
the most important part of this whole file since without them no one
would have any clue how this insanity works.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-25 10:36:54 -06:00
Bart Van Assche
6a934bb814 compat_hdio_ioctl: Fix a declaration
This patch avoids that sparse reports the following warning messages:

block/compat_ioctl.c:85:11: warning: incorrect type in assignment (different address spaces)
block/compat_ioctl.c:85:11:    expected unsigned long *[noderef] <asn:1>p
block/compat_ioctl.c:85:11:    got void [noderef] <asn:1>*
block/compat_ioctl.c:91:21: warning: incorrect type in argument 1 (different address spaces)
block/compat_ioctl.c:91:21:    expected void const volatile [noderef] <asn:1>*<noident>
block/compat_ioctl.c:91:21:    got unsigned long *[noderef] <asn:1>p
block/compat_ioctl.c:87:53: warning: dereference of noderef expression
block/compat_ioctl.c:91:21: warning: dereference of noderef expression

Fixes: commit d597580d37 ("generic ...copy_..._user primitives")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-24 08:40:17 -06:00
weiping zhang
47570848f0 block: remove blk_free_devt in add_partition
put_device(pdev) will call pdev->type->release finally, and blk_free_devt
has been called in part_release(), so remove it.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-24 08:26:57 -06:00
Milan Broz
97e05463e0 bio-integrity: Fix regression if profile verify_fn is NULL
In dm-integrity target we register integrity profile that have
both generate_fn and verify_fn callbacks set to NULL.

This is used if dm-integrity is stacked under a dm-crypt device
for authenticated encryption (integrity payload contains authentication
tag and IV seed).

In this case the verification is done through own crypto API
processing inside dm-crypt; integrity profile is only holder
of these data. (And memory is owned by dm-crypt as well.)

After the commit (and previous changes)
  Commit 7c20f11680
  Author: Christoph Hellwig <hch@lst.de>
  Date:   Mon Jul 3 16:58:43 2017 -0600

    bio-integrity: stop abusing bi_end_io

we get this crash:

: BUG: unable to handle kernel NULL pointer dereference at   (null)
: IP:   (null)
: *pde = 00000000
...
:
: Workqueue: kintegrityd bio_integrity_verify_fn
: task: f48ae180 task.stack: f4b5c000
: EIP:   (null)
: EFLAGS: 00210286 CPU: 0
: EAX: f4b5debc EBX: 00001000 ECX: 00000001 EDX: 00000000
: ESI: 00001000 EDI: ed25f000 EBP: f4b5dee8 ESP: f4b5dea4
:  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
: CR0: 80050033 CR2: 00000000 CR3: 32823000 CR4: 001406d0
: Call Trace:
:  ? bio_integrity_process+0xe3/0x1e0
:  bio_integrity_verify_fn+0xea/0x150
:  process_one_work+0x1c7/0x5c0
:  worker_thread+0x39/0x380
:  kthread+0xd6/0x110
:  ? process_one_work+0x5c0/0x5c0
:  ? kthread_worker_fn+0x100/0x100
:  ? kthread_worker_fn+0x100/0x100
:  ret_from_fork+0x19/0x24
: Code:  Bad EIP value.
: EIP:   (null) SS:ESP: 0068:f4b5dea4
: CR2: 0000000000000000

Patch just skip the whole verify workqueue if verify_fn is set to NULL.

Fixes: 7c20f116 ("bio-integrity: stop abusing bi_end_io")
Signed-off-by: Milan Broz <gmazyland@gmail.com>
[hch: trivial whitespace fix]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-24 08:16:48 -06:00
weiping zhang
37dcd6570f block, bfq: fix error handle in bfq_init
if elv_register fail, bfq_pool should be free.

Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 15:35:54 -06:00
Christoph Hellwig
74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Christoph Hellwig
c2ee070fb0 block: cache the partition index in struct block_device
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:53 -06:00
Christoph Hellwig
807d4af2f6 block: add a __disk_get_part helper
This helper allows looking up a partion under RCU protection without
grabbing a reference to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:52 -06:00
Christoph Hellwig
de65b01232 block: reject attempts to allocate more than DISK_MAX_PARTS partitions
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:50 -06:00
Christoph Hellwig
10433d04b8 raid5: remove a call to get_start_sect
The block layer always remaps partitions before calling into the
->make_request methods of drivers.  Thus the call to get_start_sect in
in_chunk_boundary will always return 0 and can be removed.

Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:49 -06:00
Christoph Hellwig
f8f84b2dfd btrfs: index check-integrity state hash by a dev_t
We won't have the struct block_device available in the bio soon, so switch
to the numerical dev_t instead of the block_device pointer for looking up
the check-integrity state.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:47 -06:00
Bart Van Assche
744353b695 skd: Change default interrupt mode to MSI-X
Since MSI support on some motherboards is unreliable, change the
default interrupt mode from MSI to MSI-X. This patch avoids that
the following message appears sporadially in the kernel logs of
my test setup:

do_IRQ: 3.193 No irq handler for vector

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:02:37 -06:00
Bart Van Assche
f2fe445986 skd: Avoid double completions in case of a timeout
Avoid that normal request completion and the timeout handler can
run concurrently by calling blk_mq_complete_request() instead of
blk_mq_end_request() from skd_end_request(). Avoid that the block
layer can reuse a request while the firmware is still processing
it. Convert skd_softirq_done() to blk-mq. Pass the pointer to
skd_softirq_done() to the block layer core through
blk_mq_ops.complete instead of by calling blk_queue_softirq_done().
Pass the pointer to skd_timed_out() to the block layer core
through blk_mq_ops.timeout instead of by calling
blk_queue_timed_out(). The timeout handler has been tested as
follows:

    echo 1 > /sys/block/skd0/io-timeout-fail &&
    (cd /sys/kernel/debug/fail_io_timeout &&
      echo 100 > probability &&
      echo N > task-filter &&
      echo 1 > times)

Fixes: commit a74d5b76fa ("skd: Switch to block layer timeout mechanism")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:02:34 -06:00
Bart Van Assche
c39c6c773d skd: Inline skd_process_request()
This patch does not change any functionality but makes the skd
driver code more similar to that of other blk-mq kernel drivers.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:02:33 -06:00
Bart Van Assche
49f16e2f20 skd: Report completion mismatches once
This patch removes one debug statement but otherwise does not change
any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:02:32 -06:00
Bart Van Assche
130d733a61 block: Warn if blk_queue_rq_timed_out() is called for a blk-mq queue
The timeout handler set by blk_queue_rq_timed_out() is only used
in single queue mode. Calling this function for blk-mq drivers is
wrong. Hence issue a warning if this function is called by a blk-mq
driver.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:02:30 -06:00
Shaohua Li
2f54a613c9 nullb: badbblocks support
Sometime disk could have tracks broken and data there is inaccessable,
but data in other parts can be accessed in normal way. MD RAID supports
such disks. But we don't have a good way to test it, because we can't
control which part of a physical disk is bad. For a virtual disk, this
can be easily controlled.

This patch adds a new 'badblock' attribute. Configure it in this way:
echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
blocks.
echo "-20-30" > xxx/badblock, this will make sector [20-30] good

If badblocks are accessed, the nullb disk will return IO error. Other
parts of the disk can accessed in normal way.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 08:54:12 -06:00
Shaohua Li
deb78b419d nullb: emulate cache
Software must flush disk cache to guarantee data safety. To check if
software correctly does disk cache flush, we must know the behavior of
disk. But physical disk behavior is uncontrollable. Even software
doesn't do the flush, the disk probably does the flush. This patch tries
to emulate a cache in the test disk.

All write will go to a cache first, when the cache is full, we then
flush some data to disk storage. A flush request will flush all data of
the cache to disk storage. A FUA write will write to memory store
directly and revalidate data in cache. If there is a power failure (by
writing to power attribute, 'echo 0 > disk_name/power'), we discard all
data in the cache, but preserve the data in disk storage. Later we can
power on the disk again as usual (write 1 to 'power' attribute), then we
can check data integrity and very if software does everything correctly.

A new attribute 'cache_size' (in MB) is added to configure cache size.

Based on original patch from Kyungchan Koh

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 08:54:11 -06:00
Shaohua Li
eff2c4f108 nullb: bandwidth control
In test, we usually expect controllable disk speed. For example, in a
raid array, we'd like some disks are fast and some are slow. MD RAID
actually has a feature for this. To test the feature, we'd like to make
the disk run in specific speed.

block throttling probably can be used for this purpose, but it requires
cgroup setup. Here we just implement a simple throttling mechanism in
the driver. There is slight fluctuation in the mechanism, but it's good
enough for test.

To configure the bandwidth cap, user sets the 'mbps' attribute. mbps is
MB/s.

Based on original patch from Kyungchan Koh

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 08:54:09 -06:00
Shaohua Li
306eb6b4ad nullb: support discard
discard makes sense for memory backed disk. And also it's useful to test
if upper layer supports dicard correctly.

User configures 'discard' attribute to enable/disable dicard support.

Based on original patch from Kyungchan Koh

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 08:54:08 -06:00
Shaohua Li
5bcd0e0c79 nullb: support memory backed store
This adds memory backed store in nullb.

User configure 'memory_backed' attribute for this. By default, nullb
disk doesn't use memory backed store.

Based on original patch from Kyungchan Koh

Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 08:54:06 -06:00