While performing a resync/recovery, raid1 divides the
array space into three regions:
- before the resync
- at or shortly after the resync point
- much further ahead of the resync point.
Write requests to the first or third do not need to wait. Write
requests to the middle region do need to wait if resync requests are
pending.
If there are any active write requests in the middle region, resync
will wait for them.
Due to an accounting error, there is a small range of addresses,
between conf->next_resync and conf->start_next_window, where write
requests will *not* be blocked, but *will* be counted in the middle
region. This can effectively block resync indefinitely if filesystem
writes happen repeatedly to this region.
As ->next_window_requests is incremented when the sector is after
conf->start_next_window + NEXT_NORMALIO_DISTANCE
the same boundary should be used for determining when write requests
should wait.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
As we don't wait for writes to complete in bitmap_daemon_work, they
could still be in-flight when bitmap_unplug writes again. Or when
bitmap_daemon_work tries to write again.
This can be confusing and could risk the wrong data being written last.
So make sure we wait for old writes to complete before new writes start.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.
It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).
In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.
So re-check the Faulty state of a device before submitting a
delayed write.
This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.
Reported-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When writing to an array with a bitmap enabled, the writes are grouped
in batches which are preceded by an update to the bitmap.
It is quite likely if that a drive develops a problem which is not
media related, that the bitmap write will be the first to report an
error and cause the device to be marked faulty (as the bitmap write is
at the start of a batch).
In this case, there is point submiting the subsequent writes to the
failed device - that just wastes times.
So re-check the Faulty state of a device before submitting a
delayed write.
This requires that we keep the 'rdev', rather than the 'bdev' in the
bio, then swap in the bdev just before final submission.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When adding devices to, or removing device from, an array we need to
update the metadata. However we don't need to do it synchronously as
data integrity doesn't depend on these changes being recorded
instantly. So avoid the synchronous call to md_update_sb and just set
a flag so that the thread will do it.
This can reduce the number of updates performed when lots of devices
are being added or removed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We can calculate this offset by using ctx->meta_total_blocks,
without passing in from the function
Signed-off-by: JackieLiu <liuyun01@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
This makes md/raid0 much less verbose as the messages about
the array geometry are now pr_debug()
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Also remove all messages about memory allocation failure.
page_alloc() reports those.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Follow err/warn distinction introduced in md.c
Join multi-part strings into single string.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
1/ using pr_debug() for a number of messages reduces the noise of
md, but still allows them to be enabled when needed.
2/ try to be consistent in the usage of pr_err() and pr_warn(), and
document the intention
3/ When strings have been split onto multiple lines, rejoin into
a single string.
The cost of having lines > 80 chars is less than the cost of not
being able to easily search for a particular message.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
1/ don't print a warning if allocation fails.
page_alloc() does that already.
2/ always check return status for error.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
It is possible that bitmap_storage_alloc could return -ENOMEM,
and some member inside store could be allocated such as filemap.
To avoid memory leak, we need to call bitmap_file_unmap to free
those members in the bitmap_resize.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Revert commit 11367799f3 ("md: Prevent IO hold during accessing to faulty
raid5 array") as it doesn't comply with commit c3cce6cda1 ("md/raid5:
ensure device failure recorded before write request returns."). That change
is not required anymore as the problem is resolved by commit 16f889499a
("md: report 'write_pending' state when array in sync") - read request is
stuck as array state is not reported correctly via sysfs attribute.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When raid1/raid10 array fails to write to one of the drives, the request
is added to bio_end_io_list and finished by personality thread. The
thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
case of external metadata this flag is cleared, however the thread is
not woken up. It causes request to be blocked for few seconds (until
another action on the array wakes up the thread) or to get stuck
indefinitely.
Wake up personality thread once MD_CHANGE_PENDING has been cleared.
Moving 'restart_array' call after the flag is cleared it not a solution
because in read-write mode the call doesn't wake up the thread.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If external metadata handler supports bad blocks and unacknowledged bad
blocks are present, don't report disk via sysfs as faulty. Such
situation can be still handled so disk just has to be blocked for a
moment. It makes it consistent with kernel state as corresponding rdev
flag is also not set.
When the disk in being unblocked there are few cases:
1. Disk has been in blocked and faulty state, it is being unblocked but
it still remains in faulty state. Metadata handler will remove it from
array in the next call.
2. There is no bad block support in external metadata handler and bad
blocks are present - put the disk in blocked and faulty state (see
case 1).
3. There is bad block support in external metadata handler and all bad
blocks are acknowledged - clear all flags, continue.
4. There is bad block support in external metadata handler but there are
still unacknowledged bad blocks - clear all flags, continue. It is fine
to clear Blocked flag because it was probably not set anyway (if it was
it is case 1). BlockedBadBlocks flag can also be cleared because the
request waiting for it will set it again when it finds out that some bad
block is still not acknowledged. Recovery is not necessary but there are
no problems if the flag is set. Sysfs rdev state is still reported as
blocked (due to unacknowledged bad blocks) so metadata handler will
process remaining bad blocks and unblock disk again.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.
When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.
Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull MD fixes from Shaohua Li:
"There are several bug fixes queued:
- fix raid5-cache recovery bugs
- fix discard IO error handling for raid1/10
- fix array sync writes bogus position to superblock
- fix IO error handling for raid array with external metadata"
* tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: be careful not lot leak internal curr_resync value into metadata. -- (all)
raid1: handle read error also in readonly mode
raid5-cache: correct condition for empty metadata write
md: report 'write_pending' state when array in sync
md/raid5: write an empty meta-block when creating log super-block
md/raid5: initialize next_checkpoint field before use
RAID10: ignore discard error
RAID1: ignore discard error
Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
calls have stopped before setting the "queue stopped" flag. This
allows to remove the "queue stopped" test from dm_mq_queue_rq() and
dm_mq_requeue_request(). This patch fixes a race condition because
dm_mq_queue_rq() is called without holding the queue lock and hence
BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
in progress. This patch prevents that the following hang occurs
sporadically when using dm-mq:
INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
Call Trace:
[<ffffffff8161f397>] schedule+0x37/0x90
[<ffffffff816239ef>] schedule_timeout+0x27f/0x470
[<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110
[<ffffffff8161fb36>] bit_wait_io+0x16/0x60
[<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0
[<ffffffff8114fe69>] __lock_page+0xb9/0xc0
[<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760
[<ffffffff81166120>] truncate_inode_pages+0x10/0x20
[<ffffffff81212a20>] kill_bdev+0x30/0x40
[<ffffffff81213d41>] __blkdev_put+0x71/0x360
[<ffffffff81214079>] blkdev_put+0x49/0x170
[<ffffffff812141c0>] blkdev_close+0x20/0x30
[<ffffffff811d48e8>] __fput+0xe8/0x1f0
[<ffffffff811d4a29>] ____fput+0x9/0x10
[<ffffffff810842d3>] task_work_run+0x83/0xb0
[<ffffffff8106606e>] do_exit+0x3ee/0xc40
[<ffffffff8106694b>] do_group_exit+0x4b/0xc0
[<ffffffff81073d9a>] get_signal+0x2ca/0x940
[<ffffffff8101bf43>] do_signal+0x23/0x660
[<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0
[<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0
[<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
in the dm start and stop queue functions, only manipulate the latter
flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
if __dm_suspend() or __dm_resume() is called and not if the
requeue list is processed.
* In the SCSI core queue stopping and starting should only be
performed by the scsi_internal_device_block() and
scsi_internal_device_unblock() functions but not by any other
function. Although the blk_mq_stop_hw_queue() call in
scsi_queue_rq() may help to reduce CPU load if a LLD queue is
full, figuring out whether or not a queue should be restarted
when requeueing a command would require to introduce additional
locking in scsi_mq_requeue_cmd() to avoid a race with
scsi_internal_device_block(). Avoid this complexity by removing
the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
blk_mq_start_stopped_hw_queues() explicitly should start stopped
queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
xen-blkfront driver in its blkif_recover() function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
(and remove one layer of masking for the op_is_write call next to it).
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.
1 - means that the array is trying to start a resync, but has yielded
to another array which shares physical devices, and also needs to
start a resync
2 - means the array is trying to start resync, but has found another
array which shares physical devices and has already started resync.
3 - means that resync has commensed, but it is possible that nothing
has actually been resynced yet.
It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint. In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.
There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.
Change them to only propagate the value if it is > 3.
As this can cause an array to fail, the patch is suitable for -stable.
Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Viswesh <viswesh.vichu@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If write is the first operation on a disk and it happens not to be
aligned to page size, block layer sends read request first. If read
operation fails, the disk is set as failed as no attempt to fix the
error is made because array is in auto-readonly mode. Similarily, the
disk is set as failed for read-only array.
Take the same approach as in raid10. Don't fail the disk if array is in
readonly or auto-readonly mode. Try to redirect the request first and if
unsuccessful, return a read error.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
As long as we recover one metadata block, we should write the empty metadata
write. The original code could make recovery corrupted if only one meta is
valid.
Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
- A couple .request_fn request-based DM NULL pointer fixes
- A fix for a DM target reference count leak, on target load error, that
prevented associated DM target kernel module(s) from being removed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJYEo+lAAoJEMUj8QotnQNaGfkH/jGqr4bj4l2Ty3QgV95fYW7+
lqp4Flkevm35HotEGKuuizvqbbVrj57BCGLE+dV48/X2cv5QbUFht6QBu9iJTrk6
Q7VqyBOvDDnOZHIof5CfKBeLZ2gd8YHZwUpYvzJcThSWS1+LjeVqg8a33LMZroMQ
rghVxFCIKy6LqCryIiTHk1t+OfmuBz3S2LXcQXFY7XAPpWq/f+V66gthTZUpm86+
Gu1xOHQlvnmf5xnDUxCpPVbQNY334D/aSbU73i2cdvfL1pkxBFNcI+LbPcu+sNP9
ugGjPj4etbIRsVysuW3fLhn2kKqaXXVuD1rLTQ+C3ytciI+RQJvG892gWhAABRQ=
=apHk
-----END PGP SIGNATURE-----
Merge tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a couple DM raid and DM mirror fixes
- a couple .request_fn request-based DM NULL pointer fixes
- a fix for a DM target reference count leak, on target load error,
that prevented associated DM target kernel module(s) from being
removed
* tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm table: fix missing dm_put_target_type() in dm_table_add_target()
dm rq: clear kworker_task if kthread_run() returned an error
dm: free io_barrier after blk_cleanup_queue call
dm raid: fix activation of existing raid4/10 devices
dm mirror: use all available legs on multiple failures
dm mirror: fix read error on recovery after default leg failure
dm raid: fix compat_features validation
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.
In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.
Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cda1
("md/raid5: ensure device failure recorded before write request
returns.").
The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd71 ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.
Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If superblock points to an invalid meta block, r5l_load_log will set
create_super with true and create an new superblock, this runtime path
would always happen if we do no writing I/O to this array since it was
created. Writing an empty meta block could avoid this unnecessary
action at the first time we created log superblock.
Another reason is for the corretness of log recovery. Currently we have
bellow code to guarantee log revocery to be correct.
if (ctx.seq > log->last_cp_seq + 1) {
int ret;
ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
if (ret)
return ret;
log->seq = ctx.seq + 11;
log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
r5l_write_super(log, ctx.pos);
} else {
log->log_start = ctx.pos;
log->seq = ctx.seq;
}
If we just created a array with a journal device, log->log_start and
log->last_checkpoint should all be 0, then we write three meta block
which are valid except mid one and supposed crash happened. The ctx.seq
would equal to log->last_cp_seq + 1 and log->log_start would be set to
position of mid invalid meta block after we did a recovery, this will
lead to problems which could be avoided with this patch.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
No initial operation was done to this field when we
load/recovery the log, it got assignment only when IO
to raid disk was finished. So r5l_quiesce may use wrong
next_checkpoint to reclaim log space, that would make
reclaimable space calculation confused.
Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
This is the counterpart of raid10 fix. If a write error occurs, raid10
will try to rewrite the bio in small chunk size. If the rewrite fails,
raid10 will record the error in bad block. narrow_write_error will
always use WRITE for the bio, but actually it could be a discard. Since
discard bio hasn't payload, write the bio will cause different issues.
But discard error isn't fatal, we can safely ignore it. This is what
this patch does.
This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.
Cc: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: Shaohua Li <shli@fb.com>
If a write error occurs, raid1 will try to rewrite the bio in small
chunk size. If the rewrite fails, raid1 will record the error in bad
block. narrow_write_error will always use WRITE for the bio, but
actually it could be a discard. Since discard bio hasn't payload, write
the bio will cause different issues. But discard error isn't fatal, we
can safely ignore it. This is what this patch does.
This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.
Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: Shaohua Li <shli@fb.com>
dm_get_target_type() was previously called so any error returned from
dm_table_add_target() must first call dm_put_target_type(). Otherwise
the DM target module's reference count will leak and the associated
kernel module will be unable to be removed.
Also, leverage the fact that r is already -EINVAL and remove an extra
newline.
Fixes: 36a0456 ("dm table: add immutable feature")
Fixes: cc6cbe1 ("dm table: add always writeable feature")
Fixes: 3791e2f ("dm table: add singleton feature")
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
cleanup_mapped_device() calls kthread_stop() if kworker_task is
non-NULL. Currently the assigned value could be a valid task struct or
an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if
kthread_run() returned an erorr.
Fixes: 7193a9defc ("dm rq: check kthread_run return for .request_fn request-based DM")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_old_request_fn() has paths that access md->io_barrier. The party
destroying io_barrier should ensure that no future execution of
dm_old_request_fn() is possible. Move io_barrier destruction to below
blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during
request-based DM device shutdown.
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the
old superblock format (which does not have takeover/reshaping support
that was added via commit 33e53f0685).
Fix validation path for old superblocks by reverting to the old raid4
layout and basing checks on mddev->new_{level,layout,...} members in
super_init_validation().
Cc: stable@vger.kernel.org # 4.8
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When any leg(s) have failed, any read will cause a new operational
default leg to be selected and the read is resubmitted to it. If that
new default leg fails the read too, no other still accessible legs are
used to resubmit the read again -- thus failing the io.
Fix by allowing the read to get resubmitted until all operational legs
have been exhausted. Also, remove any details.bi_dev use as a flag.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If a default leg has failed, any read will cause a new operational
default leg to be selected and the read is resubmitted. But until now
the read will return failure even though it was successful due to
resubmission. The reason for this is bio->bi_error was not being
cleared before resubmitting the bio.
Fix by clearing bio->bi_error before resubmission.
Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ecbfb9f118 ("dm raid: add raid level takeover support") a new
compatible feature flag was added. Validation for these compat_features
was added but this only passes for new raid mappings with this feature
flag. This causes previously created raid mappings to be failed at
import.
Check compat_features for the only valid combination.
Fixes: ecbfb9f118 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
. add support for delaying the requeue of requests; used by DM multipath
when all paths have failed and 'queue_if_no_path' is enabled
. DM cache improvements to speedup the loading metadata and the writing
of the hint array
. fix potential for a dm-crypt crash on device teardown
. remove dm_bufio_cond_resched() and just using cond_resched()
. change DM multipath to return a reservation conflict error
immediately; rather than failing the path and retrying (potentially
indefinitely)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJX7n9KAAoJEMUj8QotnQNab74IANm+rW2uYdpLNCxWUmcaih0d
BK8dLS/Mz35S0TRSekvynuBcPx18VP2Zueulc+aHTWcT4sj79l6KnVYT9g6c98rL
zzcv10QTteqhiiWwFmPHsZgv5dW8Y5wiRdt+SqcQ5sAHMFci6C05gzp9caNu7VTs
fbcLUdyYm40y3j84Lx/+ABXgnBhq+40OTtdnYSkEmLtdscPLzwpHgPmMctkrEl7e
7mqGC1KbDDzartqOZOeGP2P2qOCNN21qA+8ctMw9Xyze33uwvj7Vx6cro6e28wMm
ZClY9XNGlfuW9dCNtFR9o6NXS6NIK30UJbKqyZPPsK+70JrOgzh6GzQnwSXdyNs=
=7SkG
-----END PGP SIGNATURE-----
Merge tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- various fixes and cleanups for request-based DM core
- add support for delaying the requeue of requests; used by DM
multipath when all paths have failed and 'queue_if_no_path' is
enabled
- DM cache improvements to speedup the loading metadata and the writing
of the hint array
- fix potential for a dm-crypt crash on device teardown
- remove dm_bufio_cond_resched() and just using cond_resched()
- change DM multipath to return a reservation conflict error
immediately; rather than failing the path and retrying (potentially
indefinitely)
* tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
dm mpath: always return reservation conflict without failing over
dm bufio: remove dm_bufio_cond_resched()
dm crypt: fix crash on exit
dm cache metadata: switch to using the new cursor api for loading metadata
dm array: introduce cursor api
dm btree: introduce cursor api
dm cache policy smq: distribute entries to random levels when switching to smq
dm cache: speed up writing of the hint array
dm array: add dm_array_new()
dm mpath: delay the requeue of blk-mq requests while all paths down
dm mpath: use dm_mq_kick_requeue_list()
dm rq: introduce dm_mq_kick_requeue_list()
dm rq: reduce arguments passed to map_request() and dm_requeue_original_request()
dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests
dm: convert wait loops to use autoremove_wake_function()
dm: use signal_pending_state() in dm_wait_for_completion()
dm: rename task state function arguments
dm: add two lockdep_assert_held() statements
dm rq: simplify dm_old_stop_queue()
dm mpath: check if path's request_queue is dying in activate_path()
...
Pull block layer updates from Jens Axboe:
"This is the main pull request for block layer changes in 4.9.
As mentioned at the last merge window, I've changed things up and now
do just one branch for core block layer changes, and driver changes.
This avoids dependencies between the two branches. Outside of this
main pull request, there are two topical branches coming as well.
This pull request contains:
- A set of fixes, and a conversion to blk-mq, of nbd. From Josef.
- Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
Followup dependency fix from Geert.
- General fixes from Bart, Baoyou, Guoqing, and Linus W.
- CFQ async write starvation fix from Glauber.
- Add supprot for delayed kick of the requeue list, from Mike.
- Pull out the scalable bitmap code from blk-mq-tag.c and make it
generally available under the name of sbitmap. Only blk-mq-tag uses
it for now, but the blk-mq scheduling bits will use it as well.
From Omar.
- bdev thaw error progagation from Pierre.
- Improve the blk polling statistics, and allow the user to clear
them. From Stephen.
- Set of minor cleanups from Christoph in block/blk-mq.
- Set of cleanups and optimizations from me for block/blk-mq.
- Various nvme/nvmet/nvmeof fixes from the various folks"
* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
fs/block_dev.c: return the right error in thaw_bdev()
nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
nvme/scsi: Remove power management support
nvmet: Make dsm number of ranges zero based
nvmet: Use direct IO for writes
admin-cmd: Added smart-log command support.
nvme-fabrics: Add host_traddr options field to host infrastructure
nvme-fabrics: revise host transport option descriptions
nvme-fabrics: rework nvmf_get_address() for variable options
nbd: use BLK_MQ_F_BLOCKING
blkcg: Annotate blkg_hint correctly
cfq: fix starvation of asynchronous writes
blk-mq: add flag for drivers wanting blocking ->queue_rq()
blk-mq: remove non-blocking pass in blk_mq_map_request
blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
block: export bio_free_pages to other modules
lightnvm: propagate device_add() error code
lightnvm: expose device geometry through sysfs
lightnvm: control life of nvm_dev in driver
blk-mq: register device instead of disk
...
Pull MD updates from Shaohua Li:
"This update includes:
- new AVX512 instruction based raid6 gen/recovery algorithm
- a couple of md-cluster related bug fixes
- fix a potential deadlock
- set nonrotational bit for raid array with SSD
- set correct max_hw_sectors for raid5/6, which hopefuly can improve
performance a little bit
- other minor fixes"
* tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: set rotational bit
raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
raid5: handle register_shrinker failure
raid5: fix to detect failure of register_shrinker
md: fix a potential deadlock
md/bitmap: fix wrong cleanup
raid5: allow arbitrary max_hw_sectors
lib/raid6: Add AVX512 optimized xor_syndrome functions
lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
lib/raid6: Add AVX512 optimized recovery functions
lib/raid6: Add AVX512 optimized gen_syndrome functions
md-cluster: make resync lock also could be interruptted
md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
md-cluster: convert the completion to wait queue
md-cluster: protect md_find_rdev_nr_rcu with rcu lock
md-cluster: clean related infos of cluster
md: changes for MD_STILL_CLOSED flag
md-cluster: remove some unnecessary dlm_unlock_sync
md-cluster: use FORCEUNLOCK in lockres_free
md-cluster: call md_kick_rdev_from_array once ack failed
Pull CPU hotplug updates from Thomas Gleixner:
"Yet another batch of cpu hotplug core updates and conversions:
- Provide core infrastructure for multi instance drivers so the
drivers do not have to keep custom lists.
- Convert custom lists to the new infrastructure. The block-mq custom
list conversion comes through the block tree and makes the diffstat
tip over to more lines removed than added.
- Handle unbalanced hotplug enable/disable calls more gracefully.
- Remove the obsolete CPU_STARTING/DYING notifier support.
- Convert another batch of notifier users.
The relayfs changes which conflicted with the conversion have been
shipped to me by Andrew.
The remaining lot is targeted for 4.10 so that we finally can remove
the rest of the notifiers"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
cpufreq: Fix up conversion to hotplug state machine
blk/mq: Reserve hotplug states for block multiqueue
x86/apic/uv: Convert to hotplug state machine
s390/mm/pfault: Convert to hotplug state machine
mips/loongson/smp: Convert to hotplug state machine
mips/octeon/smp: Convert to hotplug state machine
fault-injection/cpu: Convert to hotplug state machine
padata: Convert to hotplug state machine
cpufreq: Convert to hotplug state machine
ACPI/processor: Convert to hotplug state machine
virtio scsi: Convert to hotplug state machine
oprofile/timer: Convert to hotplug state machine
block/softirq: Convert to hotplug state machine
lib/irq_poll: Convert to hotplug state machine
x86/microcode: Convert to hotplug state machine
sh/SH-X3 SMP: Convert to hotplug state machine
ia64/mca: Convert to hotplug state machine
ARM/OMAP/wakeupgen: Convert to hotplug state machine
ARM/shmobile: Convert to hotplug state machine
arm64/FP/SIMD: Convert to hotplug state machine
...
if all disks in an array are non-rotational, set the array
non-rotational.
This only works for array with all disks populated at startup. Support
for disk hotadd/hotremove could be added later if necessary.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
If dm-mpath encounters an reservation conflict it should not fail the
path (as communication with the target is not affected) but should
rather retry on another path. However, in doing so we might be inducing
a ping-pong between paths, with no guarantee of any forward progress.
And arguably a reservation conflict is an unexpected error, so we should
be passing it upwards to allow the application to take appropriate
steps.
This change resolves a show-stopper problem seen with the pNFS SCSI
layout because it is trivial to hit reservation conflict based failover
loops without it.
Doubts were raised about the implications of this change relative to
products like IBM's SVC. But there is little point withholding a fix
for Linux because a proprietary product may or may not have some issues
in its implementation of how it interfaces with Linux. In the future,
if there is glaring evidence that this change is certainly problematic
we can revisit it.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
Use cond_resched() like everybody else.
Mikulas explained why dm_bufio_cond_resched() was introduced to begin
with (hopefully cond_resched can be improved accordingly) here:
https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header
As the documentation for kthread_stop() says, "if threadfn() may call
do_exit() itself, the caller must ensure task_struct can't go away".
dm-crypt does not ensure this and therefore crashes when crypt_dtr()
calls kthread_stop(). The crash is trivially reproducible by adding a
delay before the call to kthread_stop() and just opening and closing a
dm-crypt device.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
task: ffff88003bd0df40 task.stack: ffff8800375b4000
RIP: 0010: kthread_stop+0x52/0x300
Call Trace:
crypt_dtr+0x77/0x120
dm_table_destroy+0x6f/0x120
__dm_destroy+0x130/0x250
dm_destroy+0x13/0x20
dev_remove+0xe6/0x120
? dev_suspend+0x250/0x250
ctl_ioctl+0x1fc/0x530
? __lock_acquire+0x24f/0x1b10
dm_ctl_ioctl+0x13/0x20
do_vfs_ioctl+0x91/0x6a0
? ____fput+0xe/0x10
? entry_SYSCALL_64_fastpath+0x5/0xbd
? trace_hardirqs_on_caller+0x151/0x1e0
SyS_ioctl+0x41/0x70
entry_SYSCALL_64_fastpath+0x1f/0xbd
This problem was introduced by bcbd94ff48 ("dm crypt: fix a possible
hang due to race condition on exit").
Looking at the description of that patch (excerpted below), it seems
like the problem it addresses can be solved by just using
set_current_state instead of __set_current_state, since we obviously
need the memory barrier.
| dm crypt: fix a possible hang due to race condition on exit
|
| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
| It is possible that the processor reorders memory accesses so that
| kthread_should_stop() is executed before __set_current_state(). If
| such reordering happens, there is a possible race on thread
| termination: [...]
So this patch just reverts the aforementioned patch and changes the
__set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...). This
fixes the crash and should also fix the potential hang.
Fixes: bcbd94ff48 ("dm crypt: fix a possible hang due to race condition on exit")
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This change offers a pretty significant performance improvement.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
More efficient way to iterate an array due to prefetching (makes use of
the new dm_btree_cursor_* api).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This uses prefetching to speed up iteration through a btree.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
For smq the 32 bit 'hint' stores the multiqueue level that the entry
should be stored in. If a different policy has been used previously,
and then switched to smq, the hints will be invalid. In which case we
used to put all entries in the bottom level of the multiqueue, and then
redistribute. Redistribution is faster if we put entries with invalid
hints in random levels initially.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
It's far quicker to always delete the hint array and recreate with
dm_array_new() because we avoid the copying caused by mutation.
Also simplifies the policy interface, replacing the walk_hints() with
the simpler get_hint().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_array_new() creates a new, populated array more efficiently than
starting with an empty one and resizing.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
bio_free_pages is introduced in commit 1dfa0f68c0
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
register_shrinker() now can fail. When it happens, shrinker.nr_deferred is
null. We use it to determine if unregister_shrinker is required.
Signed-off-by: Shaohua Li <shli@fb.com>
register_shrinker can fail after commit 1d3d4437ea ("vmscan: per-node
deferred work"), we should detect the failure of it, otherwise we may
fail to register shrinker after raid5 configuration was setup successfully.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
if bitmap_create fails, the bitmap is already cleaned up and the returned value
is an error number. We can't do the cleanup again.
Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Shaohua Li <shli@fb.com>
raid5 will split bio to proper size internally, there is no point to use
underlayer disk's max_hw_sectors. In my qemu system, without the change,
the raid5 only receives 128k size bio, which reduces the chance of bio
merge sending to underlayer disks.
Signed-off-by: Shaohua Li <shli@fb.com>
When one node is perform resync or recovery, other nodes
can't get resync lock and could block for a while before
it holds the lock, so we can't stop array immediately for
this scenario.
To make array could be stop quickly, we check MD_CLOSING
in dlm_lock_sync_interruptible to make us can interrupt
the lock request.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When some node leaves cluster, then it's bitmap need to be
synced by another node, so "md*_recover" thread is triggered
for the purpose. However, with below steps. we can find tasks
hang happened either in B or C.
1. Node A create a resyncing cluster raid1, assemble it in
other two nodes (B and C).
2. stop array in B and C.
3. stop array in A.
linux44:~ # ps aux|grep md|grep D
root 5938 0.0 0.1 19852 1964 pts/0 D+ 14:52 0:00 mdadm -S md0
root 5939 0.0 0.0 0 0 ? D 14:52 0:00 [md0_recover]
linux44:~ # cat /proc/5939/stack
[<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster]
[<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster]
[<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod]
[<ffffffff8107ad94>] kthread+0xb4/0xc0
[<ffffffff8152a518>] ret_from_fork+0x58/0x90
linux44:~ # cat /proc/5938/stack
[<ffffffff8107afde>] kthread_stop+0x6e/0x120
[<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod]
[<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster]
[<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod]
[<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod]
[<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod]
[<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod]
[<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0
[<ffffffff811dd3dd>] block_ioctl+0x3d/0x40
[<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0
[<ffffffff811b9538>] SyS_ioctl+0x88/0xa0
[<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b
The problem is caused by recover_bitmaps can't reliably abort
when the thread is unregistered. So dlm_lock_sync_interruptible
is introduced to detect the thread's situation to fix the problem.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Previously, we used completion to sync between require dlm lock
and sync_ast, however we will have to expose completion.wait
and completion.done in dlm_lock_sync_interruptible (introduced
later), it is not a common usage for completion, so convert
related things to wait queue.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
We need to use rcu_read_lock/unlock to avoid potential
race.
Reported-by: Shaohua Li <shli@fb.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
cluster_info and bitmap_info.nodes also need to be
cleared when array is stopped.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When stop clustered raid while it is pending on resync,
MD_STILL_CLOSED flag could be cleared since udev rule
is triggered to open the mddev. So obviously array can't
be stopped soon and returns EBUSY.
mdadm -Ss md-raid-arrays.rules
set MD_STILL_CLOSED md_open()
... ... ... clear MD_STILL_CLOSED
do_md_stop
We make below changes to resolve this issue:
1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
when stop array and it means we are stopping array.
2. let md_open returns early if CLOSING is set, so no
other threads will open array if one thread is trying
to close it.
3. no need to clear CLOSING bit in md_open because 1 has
ensure the bit is cleared, then we also don't need to
test CLOSING bit in do_md_stop.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Since DLM_LKF_FORCEUNLOCK is used in lockres_free,
we don't need to call dlm_unlock_sync before free
lock resource.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
For dlm_unlock, we need to pass flag to dlm_unlock as the
third parameter instead of set res->flags.
Also, DLM_LKF_FORCEUNLOCK is more suitable for dlm_unlock
since it works even the lock is on waiting or convert queue.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The new_disk_ack could return failure if WAITING_FOR_NEWDISK
is not set, so we need to kick the dev from array in case
failure happened.
And we missed to check err before call new_disk_ack othwise
we could kick a rdev which isn't in array, thanks for the
reminder from Shaohua.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Return DM_MAPIO_DELAY_REQUEUE from .clone_and_map_rq. Also, return
false from .busy, if all paths are down, so that blk-mq requests get
mapped via .clone_and_map_rq -- which results in DM_MAPIO_DELAY_REQUEUE
being returned to dm-rq.
This change allows for a noticeable reduction in cpu utilization
(reduced kworker load) while all paths are down, e.g.:
system CPU idleness (as measured by fio's --idle-prof=system):
before: system: 86.58%
after: system: 98.60%
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
When reinstating a path the blk-mq request_queue's requeue_list should
get kicked. It makes sense to kick the requeue_list as part of the
existing hook (previously only used by bio-based support).
Rename process_queued_bios_list to process_queued_io_list.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Make it possible for a request-based target to kick the DM device's
blk-mq request_queue's requeue_list.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.
This provides better code generation, and better debugability as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Otherwise blk-mq will immediately dispatch requests that are requeued
via a BLK_MQ_RQ_QUEUE_BUSY return from blk_mq_ops .queue_rq.
Delayed requeue is implemented using blk_mq_delay_kick_requeue_list()
with a delay of 5 secs. In the context of DM multipath (all paths down)
it doesn't make any sense to requeue more quickly.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use autoremove_wake_function() instead of default_wake_function()
to make the dm wait loops more similar to other wait loops in the
kernel. This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use signal_pending_state() instead of open-coding it. This patch does
not change any functionality but makes it possible to pass TASK_KILLABLE
as the second argument of dm_wait_for_completion(). See also commit
16882c1e96 ("sched: fix TASK_WAKEKILL vs SIGKILL race").
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Rename 'interruptible' into 'task_state' to make it clear that this
argument is a task state instead of a boolean. Also, change type from
int to long.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Document the locking assumptions for the __bind() and __dm_suspend()
functions.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If pg_init_retries is set and a request is queued against a multipath
device with all underlying block device request_queues in the "dying"
state then an infinite loop is triggered because activate_path() never
succeeds and hence never calls pg_init_done().
This change avoids that device removal triggers an infinite loop by
failing the activate_path() which causes the "dying" path to be failed.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Every call of queue_flag_clear_unlocked() after block device
initialization has finished is wrong if blk_cleanup_queue() can be
called concurrently. Convert queue_flag_clear_unlocked() into
queue_flag_clear() and protect it by the block layer queue lock.
Also, factor out dm_mq_start_queue().
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Also, check that the blk-mq request_queue isn't already stopped.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
This avoids that new requests are queued while __dm_destroy() is in
progress.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
dm_resume() will return success (0) rather than -EINVAL if
!dm_suspended_md() upon retry within dm_resume().
Reset the error code at the start of dm_resume()'s retry loop.
Also, remove a useless assignment at the end of dm_resume().
Fixes: ffcc393641 ("dm: enhance internal suspend and resume interface")
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Introduce the bio_flags() macro. Ensure that the second argument of
bio_set_op_attrs() only contains flags and no operation. This patch
does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Chris Mason <clm@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Josef Bacik <jbacik@fb.com> (maintainer:BTRFS FILE SYSTEM)
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Damien Le Moal <damien.lemoal@hgst.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull MD fixes from Shaohua Li:
"A few bug fixes for MD:
- Guoqing fixed a bug compiling md-cluster in kernel
- I fixed a potential deadlock in raid5-cache superblock write, a
hang in raid5 reshape resume and a race condition introduced in
rc4"
* tag 'md/4.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid5: fix a small race condition
md-cluster: make md-cluster also can work when compiled into kernel
raid5: guarantee enough stripes to avoid reshape hang
raid5-cache: fix a deadlock in superblock write
commit 5f9d1fde7d54a5(raid5: fix memory leak of bio integrity data)
moves bio_reset to bio_endio. But it introduces a small race condition.
It does bio_reset after raid5_release_stripe, which could make the
stripe reusable and hence reuse the bio just before bio_reset. Moving
bio_reset before raid5_release_stripe is called should fix the race.
Reported-and-tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Signed-off-by: Shaohua Li <shli@fb.com>
The md-cluster is compiled as module by default,
if it is compiled by built-in way, then we can't
make md-cluster works.
[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)
Fixes: edb39c9 ("Introduce md_cluster_operations to handle cluster functions")
Cc: stable@vger.kernel.org (v4.1+)
Reported-by: Marc Smith <marc.smith@mcc.edu>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
generated by bcache)
- 2 other stable fixes for DM log-writes
- a stable fix for a DM crypt bug that could result in freeing pointers
from uninitialized memory in the tfm allocation error path
- a DM bufio cleanup to discontinue using create_singlethread_workqueue()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXybwpAAoJEMUj8QotnQNaVjIIALIS2erGyUquUcFALyGzK0So
f3GUA3+o/1ttkzkHvDwdgPO0CscVsAp71hMN+3+GrPtXJZRoqlE/w2QfGLYHvV++
xZR4+kBYuKrlo7+ldvjEi4KI2YtZ541QyaRez7Vy8XKDBo54cFe9oUnGznOYIC+2
+oH0d2w933rrFgsUa3RFa+8Qyv2ch6SAhDhn6oy0vk7HhH8MIGQKMDQEHVRbgfJ9
kG45wakb4rDDzmxqT+ZyA8rNk4sV+WanNVfj/7mww/NZe4HW+O7xMJTVgUqczADu
Sny4VhQOk6w4rpooDeJ2djWHUi8THtX1W616Owu701fmQ9ttALEw0xiZXEOYzBA=
=v6+u
-----END PGP SIGNATURE-----
Merge tag 'dm-4.8-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- a stable fix in both DM crypt and DM log-writes for too large bios
(as generated by bcache)
- two other stable fixes for DM log-writes
- a stable fix for a DM crypt bug that could result in freeing pointers
from uninitialized memory in the tfm allocation error path
- a DM bufio cleanup to discontinue using create_singlethread_workqueue()
* tag 'dm-4.8-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: remove use of deprecated create_singlethread_workqueue()
dm crypt: fix free of bad values after tfm allocation failure
dm crypt: fix error with too large bios
dm log writes: fix check of kthread_run() return value
dm log writes: fix bug with too large bios
dm log writes: move IO accounting earlier to fix error path
If there aren't enough stripes, reshape will hang. We have a check for
this in new reshape, but miss it for reshape resume, hence we could see
hang in reshape resume. This patch forces enough stripes existed if
reshape resumes.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
There is a potential deadlock in superblock write. Discard could zero data, so
before discard we must make sure superblock is updated to new log tail.
Updating superblock (either directly call md_update_sb() or depend on md
thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
IO finish, hence waitting for reclaim thread, while reclaim thread is calling
this function and waitting for reconfig mutex. So there is a deadlock. We
workaround this issue with a trylock. The downside of the solution is we could
miss discard if we can't take reconfig mutex. But this should happen rarely
(mainly in raid array stop), so miss discard shouldn't be a big problem.
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The workqueue "dm_bufio_wq" queues a single work item &dm_bufio_work so
it doesn't require execution ordering. Hence, alloc_workqueue() has
been used to replace the deprecated create_singlethread_workqueue().
The WQ_MEM_RECLAIM flag has been set since DM requires forward progress
under memory pressure.
Since there are fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
If crypt_alloc_tfms() had to allocate multiple tfms and it failed before
the last allocation, then it would call crypt_free_tfms() and could free
pointers from uninitialized memory -- due to the crypt_free_tfms() check
for non-zero cc->tfms[i]. Fix by allocating zeroed memory.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
When dm-crypt processes writes, it allocates a new bio in
crypt_alloc_buffer(). The bio is allocated from a bio set and it can
have at most BIO_MAX_PAGES vector entries, however the incoming bio can be
larger (e.g. if it was allocated by bcache). If the incoming bio is
larger, bio_alloc_bioset() fails and an error is returned.
To avoid the error, we test for a too large bio in the function
crypt_map() and use dm_accept_partial_bio() to split the bio.
dm_accept_partial_bio() trims the current bio to the desired size and
asks DM core to send another bio with the rest of the data.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.16+
The kthread_run() function returns either a valid task_struct or
ERR_PTR() value, check for NULL is invalid. This change fixes potential
for oops, e.g. in OOM situation.
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
bio_alloc() can allocate a bio with at most BIO_MAX_PAGES (256) vector
entries. However, the incoming bio may have more vector entries if it
was allocated by other means. For example, bcache submits bios with
more than BIO_MAX_PAGES entries. This results in bio_alloc() failure.
To avoid the failure, change the code so that it allocates bio with at
most BIO_MAX_PAGES entries. If the incoming bio has more entries,
bio_add_page() will fail and a new bio will be allocated - the code that
handles bio_add_page() failure already exists in the dm-log-writes
target.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Josef Bacik <jbacik@fb,com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v4.1+
Move log_one_block()'s atomic_inc(&lc->io_blocks) before bio_alloc() to
fix a bug that the target hangs if bio_alloc() fails. The error path
does put_io_block(lc), so atomic_inc(&lc->io_blocks) must occur before
invoking the error path to avoid underflow of lc->io_blocks.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Josef Bacik <jbacik@fb,com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Pull MD fixes from Shaohua Li:
"This includes several bug fixes:
- Alexey Obitotskiy fixed a hang for faulty raid5 array with external
management
- Song Liu fixed two raid5 journal related bugs
- Tomasz Majchrzak fixed a bad block recording issue and an
accounting issue for raid10
- ZhengYuan Liu fixed an accounting issue for raid5
- I fixed a potential race condition and memory leak with DIF/DIX
enabled
- other trival fixes"
* tag 'md/4.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
raid5: avoid unnecessary bio data set
raid5: fix memory leak of bio integrity data
raid10: record correct address of bad block
md-cluster: fix error return code in join()
r5cache: set MD_JOURNAL_CLEAN correctly
md: don't print the same repeated messages about delayed sync operation
md: remove obsolete ret in md_start_sync
md: do not count journal as spare in GET_ARRAY_INFO
md: Prevent IO hold during accessing to faulty raid5 array
MD: hold mddev lock to change bitmap location
raid5: fix incorrectly counter of conf->empty_inactive_list_nr
raid10: increment write counter after bio is split
didn't factor in expected 'drop_writes' behavior for read IO).
- A dm-log bio operation flags fix for the broader block changes that
were merged during the 4.8 merge window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXwHX2AAoJEMUj8QotnQNaMdQIAJuCHedIKQxlsCH4BG20thwM
7+kPh68ZWOB5VYpVlm2sn0aJG0t2c2IsM2+AcQrwwcVsTjVkqu4s5XeqhBhkhvBE
xrRHdJU21K6ho3IFiMhscZYfhMGvptwddevOxnRLfCgBALTjWpCWCEeQWLe17QCt
klR0bvGckLp7dJavYmb/8MO7VqIQQufYCDjYqEdq4IQT+lKVf940X1bNx5+RpzAD
OCgFwmWFb1OWYsVKWnVqxL+QzQcIA84YpBMV+FKQSTDNTLYgDM1mPTxMOxVMCNLO
neCUh2WNetvoE9s69T/NmPkjzB3hNAmVhbuFT2SBJ7Bnf/lfxT4Zc6WYOeqqWKY=
=XAfD
-----END PGP SIGNATURE-----
Merge tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- another stable fix for DM flakey (that tweaks the previous fix that
didn't factor in expected 'drop_writes' behavior for read IO).
- a dm-log bio operation flags fix for the broader block changes that
were merged during the 4.8 merge window.
* tag 'dm-4.8-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm log: fix unitialized bio operation flags
dm flakey: fix reads to be issued if drop_writes configured
Pull block fixes from Jens Axboe:
"Here's a set of block fixes for the current 4.8-rc release. This
contains:
- a fix for a secure erase regression, from Adrian.
- a fix for an mmc use-after-free bug regression, also from Adrian.
- potential zero pointer deference in bdev freezing, from Andrey.
- a race fix for blk_set_queue_dying() from Bart.
- a set of xen blkfront fixes from Bob Liu.
- three small fixes for bcache, from Eric and Kent.
- a fix for a potential invalid NVMe state transition, from Gabriel.
- blk-mq CPU offline fix, preventing us from issuing and completing a
request on the wrong queue. From me.
- revert two previous floppy changes, since they caused a user
visibile regression. A better fix is in the works.
- ensure that we don't send down bios that have more than 256
elements in them. Fixes a crash with bcache, for example. From
Ming.
- a fix for deferencing an error pointer with cgroup writeback.
Fixes a regression. From Vegard"
* 'for-linus' of git://git.kernel.dk/linux-block:
mmc: fix use-after-free of struct request
Revert "floppy: refactor open() flags handling"
Revert "floppy: fix open(O_ACCMODE) for ioctl-only open"
fs/block_dev: fix potential NULL ptr deref in freeze_bdev()
blk-mq: improve warning for running a queue on the wrong CPU
blk-mq: don't overwrite rq->mq_ctx
block: make sure a big bio is split into at most 256 bvecs
nvme: Fix nvme_get/set_features() with a NULL result pointer
bdev: fix NULL pointer dereference
xen-blkfront: free resources if xlvbd_alloc_gendisk fails
xen-blkfront: introduce blkif_set_queue_limits()
xen-blkfront: fix places not updated after introducing 64KB page granularity
bcache: pr_err: more meaningful error message when nr_stripes is invalid
bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.
bcache: register_bcache(): call blkdev_put() when cache_alloc() fails
block: Fix race triggered by blk_set_queue_dying()
block: Fix secure erase
nvme: Prevent controller state invalid transition
Commit e6047149db ("dm: use bio op accessors") switched DM over to
using bio_set_op_attrs() but didn't take care to initialize
lc->io_req.bi_op_flags in dm-log.c:rw_header(). This caused
rw_header()'s call to dm_io() to make bio->bi_op_flags be uninitialized
in dm-io.c:do_region(), which ultimately resulted in a SCSI BUG() in
sd_init_command().
Also, adjust rw_header() and its callers to use REQ_OP_{READ|WRITE}.
Fixes: e6047149db ("dm: use bio op accessors")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
down_interval") overlooked the 'drop_writes' feature, which is meant to
allow reads to be issued rather than errored, during the down_interval.
Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval")
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
set them every time. bi_private will be set before the bio is
dispatched.
Signed-off-by: Shaohua Li <shli@fb.com>
Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
doesn't alloc/free bio, instead it reuses bios. There are two issues in
current code:
1. the code calls bio_init (from
init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
The bio is reused, so likely there is integrity data attached. bio_init
will clear a pointer to integrity data and makes bio_reset can't release
the data
2. bio_reset is called before dispatching bio. After bio is finished,
it's possible we don't free bio's integrity data (eg, we don't call
bio_reset again)
Both issues will cause memory leak. The patch moves bio_init to stripe
creation and bio_reset to bio end io. This will fix the two issues.
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
For failed write request record block address on a device, not block
address in an array.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Fix to return error code -ENOMEM from the lockres_init() error
handling case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.
With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The original error was thought to be corruption, but was actually caused by:
make-bcache --data-offset N
where N was in bytes and should have been in sectors. While userspace
tools should be updated to check --data-offset beyond end of volume,
hopefully this will help others that might not have noticed the units.
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
This patch fixes a cachedev registration-time allocation deadlock.
This can deadlock on boot if your initrd auto-registeres bcache devices:
Allocator thread:
[ 720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds.
[ 720.732361] [<ffffffff816eeac7>] schedule+0x37/0x90
[ 720.732963] [<ffffffffa05192b8>] bch_bucket_alloc+0x188/0x360 [bcache]
[ 720.733538] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
[ 720.734137] [<ffffffffa05302bd>] bch_prio_write+0x19d/0x340 [bcache]
[ 720.734715] [<ffffffffa05190bf>] bch_allocator_thread+0x3ff/0x470 [bcache]
[ 720.735311] [<ffffffff816ee41c>] ? __schedule+0x2dc/0x950
[ 720.735884] [<ffffffffa0518cc0>] ? invalidate_buckets+0x980/0x980 [bcache]
Registration thread:
[ 720.710403] INFO: task bash:3531 blocked for more than 120 seconds.
[ 720.715226] [<ffffffff816eeac7>] schedule+0x37/0x90
[ 720.715805] [<ffffffffa05235cd>] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[ 720.716409] [<ffffffffa0522d30>] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache]
[ 720.717008] [<ffffffffa05236e4>] bch_btree_insert+0xf4/0x170 [bcache]
[ 720.717586] [<ffffffff810e6950>] ? prepare_to_wait_event+0xf0/0xf0
[ 720.718191] [<ffffffffa0527d9a>] bch_journal_replay+0x14a/0x290 [bcache]
[ 720.718766] [<ffffffff810cc90d>] ? ttwu_do_activate.constprop.94+0x5d/0x70
[ 720.719369] [<ffffffff810cf684>] ? try_to_wake_up+0x1d4/0x350
[ 720.719968] [<ffffffffa05317d0>] run_cache_set+0x580/0x8e0 [bcache]
[ 720.720553] [<ffffffffa053302e>] register_bcache+0xe2e/0x13b0 [bcache]
[ 720.721153] [<ffffffff81354cef>] kobj_attr_store+0xf/0x20
[ 720.721730] [<ffffffff812a2dad>] sysfs_kf_write+0x3d/0x50
[ 720.722327] [<ffffffff812a225a>] kernfs_fop_write+0x12a/0x180
[ 720.722904] [<ffffffff81225177>] __vfs_write+0x37/0x110
[ 720.723503] [<ffffffff81228048>] ? __sb_start_write+0x58/0x110
[ 720.724100] [<ffffffff812cedb3>] ? security_file_permission+0x23/0xa0
[ 720.724675] [<ffffffff812258a9>] vfs_write+0xa9/0x1b0
[ 720.725275] [<ffffffff8102479c>] ? do_audit_syscall_entry+0x6c/0x70
[ 720.725849] [<ffffffff81226755>] SyS_write+0x55/0xd0
[ 720.726451] [<ffffffff8106a390>] ? do_page_fault+0x30/0x80
[ 720.727045] [<ffffffff816f2cae>] system_call_fastpath+0x12/0x71
The fifo code in upstream bcache can't use the last element in the buffer,
which was the cause of the bug: if you asked for a power of two size,
it'd give you a fifo that could hold one less than what you asked for
rather than allocating a buffer twice as big.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: stable@vger.kernel.org
register_cache() is supposed to return an error string on error so that
register_bcache() will will blkdev_put and cleanup other user counters,
but it does not set 'char *err' when cache_alloc() fails (eg, due to
memory pressure) and thus register_bcache() performs no cleanup.
register_bcache() <----------\ <- no jump to err_close, no blkdev_put()
| |
+->register_cache() | <- fails to set char *err
| |
+->cache_alloc() ---/ <- returns error
This patch sets `char *err` for this failure case so that register_cache()
will cause register_bcache() to correctly jump to err_close and do
cleanup. This was tested under OOM conditions that triggered the bug.
Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: stable@vger.kernel.org
This fixes a long-standing bug that caused a flood of messages like:
"md: delaying data-check of md1 until md2 has finished (they share one
or more physical units)"
It can be reproduced like this:
1. Create at least 3 raid1 arrays on a pair of disks, each on different
partitions.
2. Request a sync operation like 'check' or 'repair' on 2 arrays by
writing to their md/sync_action attribute files. One operation should
start and one should be delayed and a message like the above will be
printed.
3. Issue a write to the third array. Each write will cause 2 copies of
the message to be printed.
This happens when wake_up(&resync_wait) is called, usually by
md_check_recovery(). Then the delayed sync thread again prints the
message and is put to sleep. This patch adds a check in md_do_sync() to
prevent printing this message more than once for the same pair of
devices.
Reported-by: Sven Koehler <sven.koehler@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=151801
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The ret is not needed anymore since we have already
move resync_start into md_do_sync in commit 41a9a0d.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The raid0 MD personality does not start a raid0 array with any of its
data devices missing.
dm-raid was removing data/metadata device pairs unconditionally if it
failed to read a superblock off the respective metadata device of such
pair, resulting in failure to start arrays with the raid0 personality.
Avoid removing any data/metadata device pairs in case of raid0
(e.g. lvm2 segment type 'raid0_meta') thus allowing MD to start the
array.
Also, avoid region size validation for raid0.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
GET_ARRAY_INFO counts journal as spare (spare_disks), which is not
accurate. This patch fixes this.
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
attempt_restore_of_faulty_devices() is limited to 64 when it should support
the new maximum of 253 when identifying any failed devices. It clears any
revivable devices via an MD personality hot remove and add cylce to allow
for their recovery.
Address by using existing functions to retrieve and update all failed
devices' bitfield members in the dm raid superblocks on all RAID devices
and check for any devices to clear in it.
Whilst on it, don't call attempt_restore_of_faulty_devices() for any MD
personality not providing disk hot add/remove methods (i.e. raid0 now),
because such personalities don't support reviving of failed disks.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
'lvchange --refresh RaidLV' causes a mapped device suspend/resume cycle
aiming at device restore and resync after transient device failures. This
failed because flag RT_FLAG_RS_RESUMED was always cleared in the suspend path,
thus the device restore wasn't performed in the resume path.
Solve by removing RT_FLAG_RS_RESUMED from the suspend path and resume
unconditionally. Also, remove superfluous comment from raid_resume().
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
On LVM2 conversions via lvconvert(8), the target keeps mapped devices in
frozen state when requesting RAID devices be resynchronized. This
applies to e.g. adding legs to a raid1 device or taking over from raid0
to raid4 when the rebuild flag's set on the new raid1 legs or the added
dedicated parity stripe.
Also, fix frozen recovery for reshaping as well.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Increase mempool size from 16 to 64 entries. This increase improves
swap on dm-crypt performance.
When swapping to dm-crypt, all available memory is temporarily exhausted
and dm-crypt can only use the mempool reserve.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use local_irq_save() to disable preemption before calling
this_cpu_ptr().
Reported-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Fixes: b0b477c7e0 ("dm round robin: use percpu 'repeat_count' and 'current_path'")
Cc: stable@vger.kernel.org # 4.6+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
After array enters in faulty state (e.g. number of failed drives
becomes more then accepted for raid5 level) it sets error flags
(one of this flags is MD_CHANGE_PENDING). For internal metadata
arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
external metadata arrays. MD_CHANGE_PENDING flag set prevents to
finish all new or non-finished IOs to array and hold them in
pending state. In some cases this can leads to deadlock situation.
For example, we have faulty array (2 of 4 drives failed) and
udev handle array state changes and blkid started (or other
userspace application that used array to read/write) but unable
to finish reads due to IO hold. At the same time we unable to get
exclusive access to array (to stop array in our case) because
another external application still use this array.
Fix makes possible to return IO with errors immediately.
So external application can finish working with array and
give exclusive access to other applications to perform
required management actions with array.
Signed-off-by: Alexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Changing the location changes a lot of things. Holding the lock to avoid race.
This makes the .quiesce called with mddev lock hold too.
Acked-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
During a resynchronization, device status char 'a' is output on the raid
status line for every device of a RAID set. It changes from 'a' to 'A'
(unless device failure) when the resynchronization completes.
Interrupting and restarting a resynchronization, by reloading the DM
table, erroneously lead to status char 'A'.
Fix this by avoiding setting the MD_RECOVERY_REQUESTED flag in
raid_preresume().
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When lvm2 userspace requests a RaidLV repair, it sets the rebuild
constructor flag on the new replacement DataLVs but does not clear the
respective MetaLVs. Hence the superblock that is loaded from such new
MetaLVs may have a non-zero incompat_features member and the constructor
will fail with false-positive on incompat_features.
Solve by initializing the incompat_features member properly.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
__CTR_FLAG_MIN_RECOVERY_RATE was used instead of __CTR_FLAG_MAX_RECOVERY_RATE
thus causing max_recovery_rate to be rejected in case min_recovery_rate
was already set.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Otherwise, there is potential for both DMF_SUSPENDED* and
DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is
definitely _not_ a valid state.
This fix, in conjuction with "dm rq: fix the starting and stopping of
blk-mq queues", addresses the potential for request-based DM multipath's
__multipath_map() to see !dm_noflush_suspending() during suspend.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Improve dm_stop_queue() to cancel any requeue_work. Also, have
dm_start_queue() and dm_stop_queue() clear/set the QUEUE_FLAG_STOPPED
for the blk-mq request_queue.
On suspend dm_stop_queue() handles stopping the blk-mq request_queue
BUT: even though the hw_queues are marked BLK_MQ_S_STOPPED at that point
there is still a race that is allowing block/blk-mq.c to call ->queue_rq
against a hctx that it really shouldn't. Add a check to
dm_mq_queue_rq() that guards against this rarity (albeit _not_
race-free).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # must patch dm.c on < 4.8 kernels
Multiple flags were being tested without locking. Protect against
non-atomic bit changes in m->flags by holding m->lock (while testing or
setting the queue_if_no_path related flags).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
When the corrupt_bio_byte feature was introduced it caused READ bios to
no longer be errored with -EIO during the down_interval. This had to do
with the complexity of needing to submit READs if the corrupt_bio_byte
feature was used.
Fix it so READ bios are properly errored with -EIO; doing so early in
flakey_map() as long as there isn't a match for the corrupt_bio_byte
feature.
Fixes: a3998799fb ("dm flakey: add corrupt_bio_byte feature")
Reported-by: Akira Hayakawa <ruby.wktk@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
The counter conf->empty_inactive_list_nr is only used for determine if the
raid5 is congested which is deal with in function raid5_congested().
It was increased in get_free_stripe() when conf->inactive_list got to be
empty and decreased in release_inactive_stripe_list() when splice
temp_inactive_list to conf->inactive_list. However, this may have a
problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
because these two functions may call list_del_init(&sh->lru) to delete sh from
"conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
done at these two point and increase empty_inactive_list_nr accordingly.
Otherwise the counter may get to be negative number which would influence
async readahead from VFS.
Signed-off-by: ZhengYuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
md pending write counter must be incremented after bio is split,
otherwise it gets decremented too many times in end bio callback and
becomes negative.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull MD updates from Shaohua Li:
- A bunch of patches from Neil Brown to fix RCU usage
- Two performance improvement patches from Tomasz Majchrzak
- Alexey Obitotskiy fixes module refcount issue
- Arnd Bergmann fixes time granularity
- Cong Wang fixes a list corruption issue
- Guoqing Jiang fixes a deadlock in md-cluster
- A null pointer deference fix from me
- Song Liu fixes misuse of raid6 rmw
- Other trival/cleanup fixes from Guoqing Jiang and Xiao Ni
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (28 commits)
MD: fix null pointer deference
raid10: improve random reads performance
md: add missing sysfs_notify on array_state update
Fix kernel module refcount handling
md: use seconds granularity for error logging
md: reduce the number of synchronize_rcu() calls when multiple devices fail.
md: be extra careful not to take a reference to a Faulty device.
md/multipath: add rcu protection to rdev access in multipath_status.
md/raid5: add rcu protection to rdev accesses in raid5_status.
md/raid5: add rcu protection to rdev accesses in want_replace
md/raid5: add rcu protection to rdev accesses in handle_failed_sync.
md/raid1: add rcu protection to rdev in fix_read_error
md/raid1: small code cleanup in end_sync_write
md/raid1: small cleanup in raid1_end_read/write_request
md/raid10: simplify print_conf a little.
md/raid10: minor code improvement in fix_read_error()
md/raid10: add rcu protection to rdev access during reshape.
md/raid10: add rcu protection to rdev access in raid10_sync_request.
md/raid10: add rcu protection in raid10_status.
md/raid10: fix refounct imbalance when resyncing an array with a replacement device.
...
1/ Replace pcommit with ADR / directed-flushing:
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement either
ADR, or provide one or more flush addresses per nvdimm. ADR
(Asynchronous DRAM Refresh) flushes data in posted write buffers to the
memory controller on a power-fail event. Flush addresses are defined in
ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
"Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes
targeting a given dimm have been flushed to media.
2/ On-demand ARS (address range scrub):
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the media
to refresh the bad block list, userspace can also request a re-scrub at
any time.
3/ Support for the Microsoft DSM (device specific method) command format.
4/ Support for EDK2/OVMF virtual disk device memory ranges.
5/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
cN3edFVIdxvZeMgM5Ubq
=xCBG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
The md device might not have personality (for example, ddf raid array). The
issue is introduced by 8430e7e0af9a15(md: disconnect device from personality
before trying to remove it)
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX commits
that DM depends on to provide its DAX support
- clean up the bio-based vs request-based DM core code by moving the
request-based DM core code out to dm-rq.[hc]
- reinstate bio-based support in the DM multipath target (done with the
idea that fast storage like NVMe over Fabrics could benefit) -- while
preserving support for request_fn and blk-mq request-based DM mpath
- SCSI and DM multipath persistent reservation fixes that were
coordinated with Martin Petersen.
- the DM raid target saw the most extensive change this cycle; it now
provides reshape and takeover support (by layering ontop of the
corresponding MD capabilities)
- DAX support for DM core and the linear, stripe and error targets
- A DM thin-provisioning block discard vs allocation race fix that
addresses potential for corruption
- A stable fix for DM verity-fec's block calculation during decode
- A few cleanups and fixes to DM core and various targets
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXkRZmAAoJEMUj8QotnQNat2wH/i4LpkoGI5tI6UhyKWxRkzJp
vKaJ0zuZ2Ez73DucJujNuvaiyHq1IjHD5pfr8JQO3E8ygDkRC2KjF2O8EXp0Has6
U1uLahQej72MAs0ZJTpvfE+JiY6qyIl4K+xxuPmYm2f2S5TWTIgOetYjJQmcMlQo
Y8zFfcDYn4Dv5rMdvDT4+1ePETxq74wcBwTxyW3OAbHE1f0JjsUGdMKzXB1iTWcM
VjLjWI//ETfFdIlDO0w2Qbd90aLUjmTR2k67RGnbPj5kNUNikv/X6iiY32KERR/0
vMiiJ7JS+a44P7FJqCMoAVM/oBYFiSNpS4LYevOgHb0G0ikF8kaSeqBPC6sMYvg=
=uYt9
-----END PGP SIGNATURE-----
Merge tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- initially based on Jens' 'for-4.8/core' (given all the flag churn)
and later merged with 'for-4.8/core' to pickup the QUEUE_FLAG_DAX
commits that DM depends on to provide its DAX support
- clean up the bio-based vs request-based DM core code by moving the
request-based DM core code out to dm-rq.[hc]
- reinstate bio-based support in the DM multipath target (done with the
idea that fast storage like NVMe over Fabrics could benefit) -- while
preserving support for request_fn and blk-mq request-based DM mpath
- SCSI and DM multipath persistent reservation fixes that were
coordinated with Martin Petersen.
- the DM raid target saw the most extensive change this cycle; it now
provides reshape and takeover support (by layering ontop of the
corresponding MD capabilities)
- DAX support for DM core and the linear, stripe and error targets
- a DM thin-provisioning block discard vs allocation race fix that
addresses potential for corruption
- a stable fix for DM verity-fec's block calculation during decode
- a few cleanups and fixes to DM core and various targets
* tag 'dm-4.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (73 commits)
dm: allow bio-based table to be upgraded to bio-based with DAX support
dm snap: add fake origin_direct_access
dm stripe: add DAX support
dm error: add DAX support
dm linear: add DAX support
dm: add infrastructure for DAX support
dm thin: fix a race condition between discarding and provisioning a block
dm btree: fix a bug in dm_btree_find_next_single()
dm raid: fix random optimal_io_size for raid0
dm raid: address checkpatch.pl complaints
dm: call PR reserve/unreserve on each underlying device
sd: don't use the ALL_TG_PT bit for reservations
dm: fix second blk_delay_queue() parameter to be in msec units not jiffies
dm raid: change logical functions to actually return bool
dm raid: use rdev_for_each in status
dm raid: use rs->raid_disks to avoid memory leaks on free
dm raid: support delta_disks for raid1, fix table output
dm raid: enhance reshape check and factor out reshape setup
dm raid: allow resize during recovery
dm raid: fix rs_is_recovering() to allow for lvextend
...
Allow table type DM_TYPE_BIO_BASED to extend with DM_TYPE_DAX_BIO_BASED
since DM_TYPE_DAX_BIO_BASED supports bio-based requests.
This is needed to allow a snapshot of an LV with DAX support to be
removed. One of the intermediate table reloads that lvm2 does switches
from DM_TYPE_BIO_BASED to DM_TYPE_DAX_BIO_BASED. No known reason to
disallow this so...
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dax-capable mapped-device is marked as DM_TYPE_DAX_BIO_BASED,
which supports both dax and bio-based operations. dm-snap
needs to work with dax-capable device when bio-based operation
is used.
Add fake origin_direct_access() to origin device so that its
origin device is also marked as DM_TYPE_DAX_BIO_BASED for
dax-capable device. This allows to extend target's DM table.
dm-snap works normally when bio-based operation is used.
dm-snap does not support dax operation, and mount with dax
option to a target device or snapshot device fails.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Change dm-stripe to implement direct_access function,
stripe_direct_access(), which maps bdev and sector and
calls direct_access function of its physical target device.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Change dm-linear to implement direct_access function,
linear_direct_access(), which maps sector and calls direct_access
function of its physical target device.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Change mapped device to implement direct_access function,
dm_blk_direct_access(), which calls a target direct_access function.
'struct target_type' is extended to have target direct_access interface.
This function limits direct accessible size to the dm_target's limit
with max_io_len().
Add dm_table_supports_dax() to iterate all targets and associated block
devices to check for DAX support. To add DAX support to a DM target the
target must only implement the direct_access function.
Add a new dm type, DM_TYPE_DAX_BIO_BASED, which indicates that mapped
device supports DAX and is bio based. This new type is used to assure
that all target devices have DAX support and remain that way after
QUEUE_FLAG_DAX is set in mapped device.
At initial table load, QUEUE_FLAG_DAX is set to mapped device when setting
DM_TYPE_DAX_BIO_BASED to the type. Any subsequent table load to the
mapped device must have the same type, or else it fails per the check in
table_load().
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Instead of a flag and an index just make sure an index of 0 means
no need to free the bvec array. Also move the constants related
to the bvec pools together and use a consistent naming scheme for
them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The discard passdown was being issued after the block was unmapped,
which meant the block could be reprovisioned whilst the passdown discard
was still in flight.
We can only identify unshared blocks (safe to do a passdown a discard
to) once they're unmapped and their ref count hits zero. Block ref
counts are now used to guard against concurrent allocation of these
blocks that are being discarded. So now we unmap the block, issue
passdown discards, and the immediately increment ref counts for regions
that have been discarded via passed down (this is safe because
allocation occurs within the same thread). We then decrement ref counts
once the passdown discard IO is complete -- signaling these blocks may
now be allocated.
This fixes the potential for corruption that was reported here:
https://www.redhat.com/archives/dm-devel/2016-June/msg00311.html
Reported-by: Dennis Yang <dennisyang@qnap.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
dm_btree_find_next_single() can short-circuit the search for a block
with a return of -ENODATA if all entries are higher than the search key
passed to lower_bound().
This hasn't been a problem because of the way the btree has been used by
DM thinp. But it must be fixed now in preparation for fixing the race
in DM thinp's handling of simultaneous block discard vs allocation.
Otherwise, once that fix is in place, some of the blocks in a discard
would not be unmapped as expected.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
RAID10 random read performance is lower than expected due to excessive spinlock
utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
as it's in IO path and encounters a lot of unnecessary congestion.
As lower_barrier just takes a lock in order to decrement a counter, convert
counter (nr_pending) into atomic variable and remove the spin lock. There is
also a congestion for wake_up (it uses lock internally) so call it only when
it's really needed. As wake_up is not called constantly anymore, ensure process
waiting to raise a barrier is notified when there are no more waiting IOs.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Changeset 6791875e2e has added early return from a function so there is no
sysfs notification for 'active' and 'clean' state change.
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
md loads raidX modules and increments module refcount each time level
has changed but does not decrement it. You are unable to unload raid0
module after reshape because raid0 reshape changes level to raid4
and back to raid0.
Signed-off-by: Aleksey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The md code stores the exact time of the last error in the
last_read_error variable using a timespec structure. It only
ever uses the seconds portion of that though, so we can
use a scalar for it.
There won't be an overflow in 2038 here, because it already
used monotonic time and 32-bit is enough for that, but I've
decided to use time64_t for consistency in the conversion.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Shaohua Li <shli@fb.com>
raid_io_hints() was retrieving the number of data stripes used for the
calculation of io_opt from struct r5conf, which is not defined for raid0
mappings.
Base the calculation on the in-core raid_set structure instead.
Also, adjust to use to_bytes() for the sector -> bytes conversion
throughout.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Use 'unsigned int' where appropriate.
Return negative errors.
Correct an indentation.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
So far we tried to rely on the SCSI 'all target ports' bit to register
all path, but for many setups this didn't work properly as the different
paths are seen as separate initiators to the target instead of multiple
ports of the same initiator. Because of that we'll stop setting the
'all target ports' bit in SCSI, and let device mapper handle iterating
over the device for each path and register them manually.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Commit d548b34b06 ("dm: reduce the queue delay used in dm_request_fn
from 100ms to 10ms") always intended the value to be 10 msecs -- it
just expressed it in jiffies because earlier commit 7eaceaccab ("block:
remove per-queue plugging") did.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Fixes: d548b34b06 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms")
Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c
Add "delta_disks" constructor argument support to raid1 to allow for
consistent userspace disk addition/removal handling.
Fix raid_status() to report all raid disks with status and table output
on disk adding reshapes, not just the ones listed on the mddev; optimize
its rebuild and writemostly output.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Enhance rs_reshape_requested() check function to be more transparent and
fix its raid10 check.
Streamline the constructor by factoring out reshaping preparation into
fucntion rs_prepare_reshape().
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Resizing a RAID set during recovery can be allowed, because the MD
resynchronization thread will either stop any ongoing recovery in case
of shrinking below the current recovery position or carry on recovery
to the new size if the set is growing.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add function rs_setup_recovery() to allow for defined setup of RAID set
recovery in the constructor.
Will be called with dev_sectors={0, rdev->sectors, MaxSectors} to
recover a new or enforced sync, grown or not to be synhronized RAID set
respectively.
Prevents recovery on raid0, which doesn't support it.
Enforces recovery on raid6 to ensure properly defined Syndromes
mandatory for that MD personality are being created.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Check return value of kthread_run() in dm_old_init_request_queue().
Reported-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
We have assigned sb->block_size before the switch,
so remove the redundant one.
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Eric Wheeler <bcache@lists.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no return in continue_at(), update the documentation.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Cache_sb is not used in cache_alloc, and we have copied
sb info to cache->sb already, remove it.
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
do_div was replaced with div64_u64 at some point, causing a bug with
block calculation due to incompatible semantics of the two functions.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Fixes: a739ff3f54 ("dm verity: add support for forward error correction")
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Merge the two DM_PARAMS_[KV]MALLOC flags into a single flag.
Doing so avoids the crashes seen with previous attempts to consolidate
buffer management to use kvfree() without first flagging that memory had
actually been allocated.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Avoid that sparse complains about assigning a __le64 value to a u64
variable. Remove the (u64) casts since these are superfluous. This
patch does not change the behavior of the source code.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
A newly introduced function has 'const int' as the return type,
but as "make W=1" reports, that has no meaning:
drivers/md/dm-raid.c:510:18: error: type qualifiers ignored on function return type [-Werror=ignored-qualifiers]
This changes the return type to plain 'int'.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 33e53f0685 ("dm raid: introduce extended superblock and new raid types to support takeover/reshaping")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Superblock updates where bogus causing some takovers/reshapes to fail.
Introduce new runtime flag (RT_FLAG_KEEP_RS_FROZEN) to keep a raid set
frozen when a layout change was requested. Userpace will immediately
reload the table w/o the flags requesting such change once they made it
to the superblocks and any change of recovery/reshape offsets has to be
avoided until after read.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add bool functions rs_is_recovering and rs_is_reshaping()
to test for ongoing recovery/reshaping respectively in order
to reject respective requests on ongoing ones.
Remove ctr array size check, because ti->len and array
sectors will differ during disk addition/removal reshape.
Use __is_raid10_near() rather than type string compare.
Introduce rs_check_reshape() and rs_start_reshape(),
use the former in the ctr to reject bogus rehsape requests
and the latter in preresume to actually start a reshape.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add rs_is_reshapable(), rs_data_stripes(), rs_reshape_requested(),
rs_set_dev_and_array_sectors() and rs_adjust_data_offsets()
Remove superfluous check for reshape message
Correct runtime bit definitions to be incremental
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
It is more intuitive to manage each raid level's features in terms of
what is supported rather than what isn't supported.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Renamed functions and variables with leading single underscore to have a
double underscore. Renamed some functions to have better names. Folded
functions that were split out without reason.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Also update module description to "raid0/1/10/4/5/6 target"
Reported by Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
No idea what Heinz was doing with the versioning but upstream commit
4c9971ca6a ("dm raid: make sure no feature flags are set in metadata")
bumped to 1.8.0 already.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
There ti_error_* wrappers added very little. No other DM target has
ever gone to such lengths to wrap setting ti->error.
Also fixes some NULL derefences via rs->ti->error.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
The target's status interface has to provide the new 'data_offset' value
to allow userspace to retrieve the kernels offset to the data on each
raid device of a raid set. This is the base for out-of-place reshaping
required to not write over any data during reshaping (e.g. change
raid6_zr -> raid6_nc):
- add rs_set_cur() to be able to start up existing array in case of no
takeover; use in ctr on takeover check
- enhance raid_status()
- add supporting functions to get resync/reshape progress and raid
device status chars
- fixup rebuild table line output race, which does miss to emit
'rebuild N' on fully synced/rebuild devices, because it is relying on
the transient 'In_sync' raid device flag
- add new status line output for 'data_offset', which'll later be used
for out-of-place reshaping
- fixup takeover not working for all levels
- fixup raid0 message interface oops caused by missing checks
for the md threads, which don't exist in case of raid0
- remove ALL_FREEZE_FLAGS not needed for takeover
- adjust comments
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add raid level takeover support allowing arbitrary takeovers between
raid levels supported by md personalities (i.e. raid0, raid1/10 and
raid4/5/6):
- add rs_config_{backup|restore} function to allow for temporary
storing ctr requested layout changes and restore them for takeover
conersion decision after the superblocks got loaded and analyzed
- add members to store layout to 'struct raid_set' (not mandatory
for takeover but needed for reshape in later patch)
- add rebuild_disks bitfield to 'struct raid_set' and set bits in ctr
to use in setting up takeover (base to address a 'rebuild' related
raid_status() table line bug and needed as well for reshape in future
patch)
- add runtime flags and respective manipulation functions to be able to
control e.g. wrting of superlocks to the preresume function on
takeover and (later) reshape
- add functions to detect takeover, check it's valid (mandatory here to
avoid failing on md_run()), setup for it and use in the ctr; those
will be likely moved out once reshaping gets added to simplify the
ctr
- start raid set readonly in ctr and switch to readwrite, optionally
updating superblocks, in preresume in order to allow suspend to
quiesce any active table before (which involves superblock updates);
this ensures the proper sequence of writing the current and any new
takeover(/reshape) metadata
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add transferring the new takeover/reshape related superblock
members introduced to the super_sync() function:
- add/move supporting functions
- add failed devices bitfield transfer functions to retrieve the
bitfield from superblock format or update it in the superblock
- add code to transfer all new members
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Support the follwoing arguments in the ctr parameter parser:
- add 'delta_disks', 'data_offset' taking int and sector respectively
- 'raid10_use_near_sets' bool argument to optionally select
near sets with supporting raid10 mappings
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add new members to the dm-raid superblock and new raid types to support
takeover/reshape.
Add all necessary members needed to support takeover and reshape in one
go -- aiming to limit the amount of changes to the superblock layout.
This is a larger patch due to the new superblock members, their related
flags, validation of both and involved API additions/changes:
- add additional members to keep track of:
- state about forward/backward reshaping
- reshape position
- new level, layout, stripe size and delta disks
- data offset to current and new data for out-of-place reshapes
- failed devices bitfield extensions to keep track of max raid devices
- adjust super_validate() to cope with new superblock members
- adjust super_init_validation() to cope with new superblock members
- add definitions for ctr flags supporting delta disks etc.
- add new raid types (raid6_n_6 etc.)
- add new raid10 supporting function API (_is_raid10_*())
- adjust to changed raid10 supporting function API
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
which can delay several milliseconds in some case.
If lots of devices fail at once - as could happen with a large RAID10 where one set
of devices are removed all at once - these delays can add up to be very inconcenient.
As failure is not reversible we can check for that first, setting a
separate flag if it is found, and then all synchronize_rcu() once for
all the flagged devices. Then ->hot_remove_disk() function can skip the
synchronize_rcu() step if the flag is set.
fix build error(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
It is important that we never increment rdev->nr_pending on a Faulty
device as ->hot_remove_disk() assumes that once the Faulty flag is visible
no code will take a new reference.
Some places take a new reference after only check In_sync. This should
be safe as the two are changed together. However to make the code more
obviously safe, add checks for 'Faulty' as well.
Note: the actual rule is:
Never increment nr_pending if Faulty is set and Blocked is clear,
never clear Faulty, and never set Blocked without holding a reference
through nr_pending.
fix build error (Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Being in the middle of resync is no longer protection against failed
rdevs disappearing. So add rcu protection.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
The rdev could be freed while handle_failed_sync is running, so
rcu protection is needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Since remove_and_add_spares() was added to hot_remove_disk() it has
been possible for an rdev to be hot-removed while fix_read_error()
was running, so we need to be more careful, and take a reference to
the rdev while performing IO.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
'mirror' is only used to find 'rdev', several times.
So just find 'rdev' once, and use it instead.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Both functions use conf->mirrors[mirror].rdev several times, so
improve readability by storing this in a local variable.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
'tmp' is only ever used to extract 'tmp->rdev', so just use 'rdev' directly.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
rdev already holds conf->mirrors[d].rdev, so no need to load it again.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
mirrors[].rdev can become NULL at any point unless:
- a counted reference is held
- ->reconfig_mutex is held, or
- rcu_read_lock() is held
Reshape isn't always suitably careful as in the past rdev couldn't be
removed during reshape. It can now, so add protection.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
mirrors[].rdev can become NULL at any point unless:
- a counted reference is held
- ->reconfig_mutex is held, or
- rcu_read_lock() is held
Previously they could not become NULL during a resync/recovery/reshape either.
However when remove_and_add_spares() was added to hot_remove_disk(), that
changed.
So raid10_sync_request didn't previously need to protect rdev access,
but now it does.
Fix missed check(Shaohua)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
mirrors[].rdev can become NULL at any point unless:
- a counted reference is held
- ->reconfig_mutex is held, or
- rcu_read_lock() is held
raid10_status holds none of these. So add rcu_read_lock()
protection.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
If you have a raid10 with a replacement device that is resyncing -
e.g. after a crash before the replacement was complete - the write to
the replacement will increment nr_pending on the wrong device, which
will lead to strangeness.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Re-checking the faulty flag here brings no value.
The comment about "risk" refers to the risk that the device could
be in the process of being removed by ->hot_remove_disk().
However providing that the ->nr_pending count is incremented inside
an rcu_read_locked() region, there is no risk of that happening.
This is because the rdev pointer (in the personalities array) is set
to NULL before synchronize_rcu(), and ->nr_pending is tested
afterwards. If the rcu_read_locked region happens before the
synchronize_rcu(), the test will see that nr_pending has been incremented.
If it happens afterwards, the rdev pointer will be NULL so there is nothing
to increment.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
When the HOT_REMOVE_DISK ioctl is used to remove a device, we
call remove_and_add_spares() which will remove it from the personality
if possible. This improves the chances that the removal will succeed.
When writing "remove" to dev-XX/state, we don't. So that can fail more easily.
So add the remove_and_add_spares() into "remove" handling.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
A performance drop of mkfs has been observed on RAID10 during resync
since commit 09314799e4 ("md: remove 'go_faster' option from
->sync_request()"). Resync sends so many IOs it slows down non-resync
IOs significantly (few times). Add a short delay to a resync. The
previous long sleep (1s) has proven unnecessary, even very short delay
brings performance right.
The change also applied to raid1. The problem has not been observed on
raid1, however it shares barriers code with raid10 so it might be an
issue for some setup too.
Suggested-by: NeilBrown <neilb@suse.com>
Link: http://lkml.kernel.org/r/20160609134555.GA9104@proton.igk.intel.com
Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
This is a simple check before updating the superblock. It should update
the superblock when update_size return 0.
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Make use if raid type rt_is_*() bool functions for simplification and
consistency reasons.
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
- add _test_flags() function
- use it to simplify rs_check_for_invalid_flags()
- use _test_flag() throughout
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reject invalid flag combinations to avoid potential data corruption or
failing raid set construction:
- add definitions for constructor flag combinations and invalid flags
per level
- add bool test functions for the various raid types
(also will be used by future reshaping enhancements)
- introduce rs_check_for_invalid_flags() and _invalid_flags()
to perform the validity checks
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Provide necessary infrastructure to handle ctr flags and their names
and cleanup setting ti->error:
- comment constructor flags
- introduce constructor flag manipulation
- introduce ti_error_*() functions to simplify
setting the error message (use in other targets?)
- introduce array to hold ctr flag <-> flag name mapping
- introduce argument name by flag functions for that array
- use those functions throughout the ctr call path
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
- use dm_arg_set API in ctr and its callees parse_raid_params() and dev_parms()
- introduce _in_range() function to check a value is in a [ min, max ] range;
this is to support more callers in parsing parameters etc. in the future
- correct comment on MAX_RAID_DEVICES
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
alloc_workqueue replaces deprecated create_workqueue().
Dedicated workqueues have been used since bcache_wq and moving_gc_wq
are workqueues for writes and are being used on a memory reclaim path.
WQ_MEM_RECLAIM has been set to ensure forward progress under memory
pressure.
Since there are only a fixed number of work items, explicit concurrency
limit is unnecessary here.
Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allow a user to specify an optional feature 'queue_mode <mode>' where
<mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based,
request_fn rq-based, and blk-mq rq-based respectively.
If the queue_mode feature isn't specified the default for the
"multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y
it'll default to mode "mq".
This new queue_mode feature introduces the ability for each multipath
device to have its own queue_mode (whereas before this feature all
multipath devices effectively had to have the same queue_mode).
This commit also goes a long way to eliminate the awkward (ab)use of
DM_TYPE_*, the associated filter_md_type() and other relatively fragile
and difficult to maintain code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add "multipath-bio" target that offers a bio-based multipath target as
an alternative to the request-based "multipath" target -- but in a
following commit "multipath-bio" will immediately be replaced by a new
"queue_mode" feature for the "multipath" target which will allow
bio-based mode to be selected.
When DM multipath was originally converted from bio-based to
request-based the motivation for the change was better dynamic load
balancing (by leveraging block core's request-based IO schedulers, for
merging and sorting, _before_ DM multipath would make the decision on
where to steer the IO -- based on path load and/or availability).
More background is available in this "Request-based Device-mapper
multipath and Dynamic load balancing" paper:
https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf
But we've now come full circle where significantly faster storage
devices no longer need IOs to be made larger to drive optimal IO
performance. And even if they do there have been changes to the block
and filesystem layers that help ensure upper layers are constructing
larger IOs. In addition, SCSI's differentiated IO errors will propagate
through to bio-based IO completion hooks -- so that eliminates another
historic justiciation for request-based DM multipath. Lastly, the block
layer's immutable biovec changes have made bio cloning cheaper than it
has ever been; whereas request cloning is still relatively expensive
(both on a CPU usage and memory footprint level).
As such, bio-based DM multipath offers the promise of a more efficient
IO path for high IOPs devices that are, or will be, emerging.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Add some seperation between bio-based and request-based DM core code.
'struct mapped_device' and other DM core only structures and functions
have been moved to dm-core.h and all relevant DM core .c files have been
updated to include dm-core.h rather than dm.h
DM targets should _never_ include dm-core.h!
[block core merge conflict resolution from Stephen Rothwell]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>