The pre-existing sysfs interfaces which take explicit namespace
argument are weird in that they place the optional @ns in front of
@name which is contrary to the established convention. For example,
we end up forcing vast majority of sysfs_get_dirent() users to do
sysfs_get_dirent(parent, NULL, name), which is silly and error-prone
especially as @ns and @name may be interchanged without causing
compilation warning.
This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the
positions of @name and @ns, and sysfs_get_dirent() is now a wrapper
around sysfs_get_dirent_ns(). This makes confusions a lot less
likely.
There are other interfaces which take @ns before @name. They'll be
updated by following patches.
This patch doesn't introduce any functional changes.
v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol
error on module builds. Reported by build test robot. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the last process closes /dev/mdX sync_blockdev will be called so
that all buffers get flushed.
So if it is then opened for the STOP_ARRAY ioctl to be sent there will
be nothing to flush.
However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
moments before some other process which was writing closes their file
descriptor, then there won't be a 'last close' and the buffers might
not get flushed.
So do_md_stop() calls sync_blockdev(). However at this point it is
holding ->reconfig_mutex. So if the array is currently 'clean' then
the writes from sync_blockdev() will not complete until the array
can be marked dirty and that won't happen until some other thread
can get ->reconfig_mutex. So we deadlock.
We need to move the sync_blockdev() call to before we take
->reconfig_mutex.
However then some other thread could open /dev/mdX and write to it
after we call sync_blockdev() and before we actually stop the array.
This can leave dirty data in the page cache which is awkward.
So introduce new flag MD_STILL_CLOSED. Set it before calling
sync_blockdev(), clear it if anyone does open the file, and abort the
STOP_ARRAY attempt if it gets set before we lock against further
opens.
It is still possible to get problems if you open /dev/mdX, write to
it, then issue the STOP_ARRAY ioctl. Just don't do that.
Signed-off-by: NeilBrown <neilb@suse.de>
mddev->flags is mostly used to record if an update of the
metadata is needed. Sometimes the whole field is tested
instead of just the important bits. This makes it difficult
to introduce more state bits.
So replace all bare tests of mddev->flags with tests for the bits
that actually need testing.
Signed-off-by: NeilBrown <neilb@suse.de>
Setting a variable to itself probably wasn't the intention here.
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
immediately - otherwise the small value might not be honoured.
However if the previous timeout was 0 meaning "no timeout", we didn't.
This would mean that no timeout happens until the next write completes,
which could be a long time.
Signed-off-by: NeilBrown <neilb@suse.de>
There is no really need as GFP_NOIO is very likely sufficient,
and failure is not catastrophic.
Calling md_allow_write here will convert a read-auto array to
read/write which could be confusing when you are just performing
a read operation.
Signed-off-by: NeilBrown <neilb@suse.de>
commit 7ceb17e87b
md: Allow devices to be re-added to a read-only array.
allowed a bit more than just that. It also allows devices to be added
to a read-write array and to end up skipping recovery.
This patch removes the offending piece of code pending a rewrite for a
subsequent release.
More specifically:
If the array has a bitmap, then the device will still need a bitmap
based resync ('saved_raid_disk' is set under different conditions
is a bitmap is present).
If the array doesn't have a bitmap, then this is correct as long as
nothing has been written to the array since the metadata was checked
by ->validate_super. However there is no locking to ensure that there
was no write.
Bug was introduced in 3.10 and causes data corruption so
patch is suitable for 3.10-stable.
Cc: stable@vger.kernel.org (3.10)
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD: Remember the last sync operation that was performed
This patch adds a field to the mddev structure to track the last
sync operation that was performed. This is especially useful when
it comes to what is recorded in mismatch_cnt in sysfs. If the
last operation was "data-check", then it reports the number of
descrepancies found by the user-initiated check. If it was a
"repair" operation, then it is reporting the number of
descrepancies repaired. etc.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The usage of strict_strtoul() is not preferred, because
strict_strtoul() is obsolete. Thus, kstrtoul() should be
used.
Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.
So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run. This increases the chance that the ioctl will succeed.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Neil Brown <nfbrown@suse.de>
Some tagged for -stable.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUbl1mznsnt1WYoG5AQKGlQ//eixdawF+DUK5hadqZ9EDni+BAVzb7m69
+zU6ilQ7UOh7bxtAoJqrgFVykK+LG8wvYsEBwMjB9oRDLA96/YDXXiBzXHvd6mGh
g271lwMTQ9h+O8L6psLUX6qsrH3i7SJmF8ySPKi6Fe5ruT8ToOB8Ii8XQebEZdXo
VOzRz2VgSTcBdrTyKPDsBJByDQX36hsK8Gs5YSl5F3nvyV4dvGWMlyoTF1TRRt9K
YCCZ8pSk3kTXaSdl0syrJxI17pEUC8mtcA01S6JD/GV49CGO8LYAckVJ4ijWw7VV
IGGlH0DsYSMgJ7yyuLz4ifaqRnsWsAGW0WyiZYYKvjtNUiyBuBBbo2cQ1lNkR5p4
jnLhpJJVh0hLCPn6wcCWIBIdT/mFaBpXkvZPd3ks5kefGXsfpVPm0fK8r0fzkzgy
tJCZtZFZHeK1qsgaDsiS76S2ZNcFh0HQVIa84Q200/XUDgh8dYlD0+7oIsVu0UBZ
72Aop+Ak9+k4vKTvB9/hpcY+Rt0MI7zKewXBDSDK1sXhIHLQqv8rCEeNYiuPPqr/
ghRukn+C/Wtr7JYBsX+jMjxtmSzYtwBOihwLoZCH9pp3C5jTvyQk9s8n1j13V2RK
sAFtfpCVoQ8tTa7IITKRMfftzHn1WiPlPsj6VbigJ6A4N98csgv7x2rF7FyqcF0X
aoj69nQ3i/4=
=8iy3
-----END PGP SIGNATURE-----
Merge tag 'md-3.10-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.
However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.
So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Tested-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
The value passed is 0 in all but "it can never happen" cases (and those
only in a couple of drivers) *and* it would've been lost on the way
out anyway, even if something tried to pass something meaningful.
Just don't bother.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Maintenance of a bad-block-list currently defaults to 'enabled'
and is then disabled when it cannot be supported.
This is backwards and causes problem for dm-raid which didn't know
to disable it.
So fix the defaults, and only enabled for v1.x metadata which
explicitly has bad blocks enabled.
The problem with dm-raid has been present since badblock support was
added in v3.1, so this patch is suitable for any -stable from 3.1
onwards.
Cc: stable@vger.kernel.org (3.1+)
Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD: Export 'md_reap_sync_thread' function
Make 'md_reap_sync_thread' available to other files, specifically dm-raid.c.
- rename reap_sync_thread to md_reap_sync_thread
- move the fn after md_check_recovery to match md.h declaration placement
- export md_reap_sync_thread
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
read-only arrays should stay that way as much as possible.
Updating the metadata - which could be triggered by a re-add
while assembling the array metadata - should be avoided.
Signed-off-by: NeilBrown <neilb@suse.de>
When assembling an array incrementally we might want to make
it device available when "enough" devices are present, but maybe
not "all" devices are present.
If the remaining devices appear before the array is actually used,
they should be added transparently.
We do this by using the "read-auto" mode where the array acts like
it is read-only until a write request arrives.
Current an add-device request switches a read-auto array to active.
This means that only one device can be added after the array is first
made read-auto. This isn't a problem for RAID5, but is not ideal for
RAID6 or RAID10.
Also we don't really want to switch the array to read-auto at all
when re-adding a device as this doesn't really imply any change.
So:
- remove the "md_update_sb()" call from add_new_disk(). This isn't
really needed as just adding a disk doesn't require a metadata
update. Instead, just set MD_CHANGE_DEVS. This will effect a
metadata update soon enough, once the array is not read-only.
- Allow the ADD_NEW_DISK ioctl to succeed without activating a
read-auto array, providing the MD_DISK_SYNC flag is set.
In this case, the device will be rejected if it cannot be added
with the correct device number, or has an incorrect event count.
- Teach remove_and_add_spares() to be careful about adding spares
when the array is read-only (or read-mostly) - only add devices
that are thought to be in-sync, and only do it if the array is
in-sync itself.
- In md_check_recovery, use remove_and_add_spares in the read-only
case, rather than open coding just the 'remove' part of it.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If a fail device or a spare is removed from an array, there is
not need to make the array 'active'. If/when the array does become
active for some other reason the metadata will be update to reflect
the removal.
If that never happens and the array is stopped while still read-auto,
then there is no loss in forgetting the that the device had 'failed'.
A read-only array will leave failed devices attached to
the array personality, so we need to explicitly call
remove_and_add_spares() to free it (clearing Blocked just
like we do in store_slot()).
Signed-off-by: NeilBrown <neilb@suse.de>
slot_store and remove_and_add_spares both call ->hot_remove_disk(),
but with slightly different tests and consequences, which is
at least untidy and might be buggy.
So modify remove_and_add_spaces() so that it can be asked
to remove a specific device, and call it from slot_store().
We also clear the Blocked flag to ensure that doesn't prevent
removal. The purpose of Blocked is to prevent automatic removal
by the kernel before an error is acknowledged.
If the array is read/write then user-space would have not reason
to remove a device unless it was known to be 'spare' or 'faulty' in
which it would have already cleared the Blocked flag.
If the array is read-only, the flag might still be blocked, but
there is no harm in clearing the flag for read-only arrays.
Signed-off-by: NeilBrown <neilb@suse.de>
Normally we don't even try to update the metadata if
the array is read-only. However future patches
will increase the number of things that can happen on a read-only
array, so it is safest to explicitly disable this.
Every time that mddev->ro is set to 0, either
- md_update_sb will be called again (at least if MD_CHANGE_DEVS
is set) or
- the mddev->thread is scheduled, which will also run
md_update_sb if needed.
So this is safe: if the array ever become read-write the
metadata will be updated.
Signed-off-by: NeilBrown <neilb@suse.de>
Tejun writes:
-----
This is the pull request for the earlier patchset[1] with the same
name. It's only three patches (the first one was committed to
workqueue tree) but the merge strategy is a bit involved due to the
dependencies.
* Because the conversion needs features from wq/for-3.10,
block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts
with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those
workqueue conflicts from flaring up in block tree.
* Resolving the issue that Jan and Dave raised about debugging
requires arch-wide changes. The patchset is being worked on[2] but
it'll have to go through -mm after these changes show up in -next,
and not included in this pull request.
The three commits are located in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue
Pulling it into block/for-3.10/core produces a conflict in
drivers/md/raid5.c between the following two commits.
e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available")
2f6db2a707 ("raid5: use bio_reset()")
The conflict is trivial - one removes an "if ()" conditional while the
other removes "rbi->bi_next = NULL" right above it. We just need to
remove both. The merged branch is available at
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge
so that you can use it for verification. The test merge commit has
proper merge description.
While these changes are a bit of pain to route, they make code simpler
and even have, while minute, measureable performance gain[3] even on a
workload which isn't particularly favorable to showing the benefits of
this conversion.
----
Fixed up the conflict.
Conflicts:
drivers/md/raid5.c
Signed-off-by: Jens Axboe <axboe@kernel.dk>
MD: Prevent sysfs operations on uninitialized kobjects
Device-mapper does not use sysfs; but when device-mapper is leveraging
MD's RAID personalities, MD sometimes attempts to update sysfs. This
patch adds checks for 'mddev-kobj.sd' in sysfs_[un]link_rdev to ensure
it is about to operate on something valid. This patch also checks for
'mddev->kobj.sd' before calling 'sysfs_notify' in 'remove_and_add_spares'.
Although 'sysfs_notify' already makes this check, doing so in
'remove_and_add_spares' prevents an additional mutex operation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If something has failed while the array was read-auto,
then when we switch to 'active' we need to update the metadata.
This will happen anyway but it is good to expedite it, and
also to ensure any failed device has been released by the
underlying device before we try to action the ioctl which
caused us to switch to 'active' mode.
Reported-by: Joe Lawrence <Joe.Lawrence@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
You cannot resize a RAID0 array (in terms of making the devices
bigger), but the code doesn't entirely stop you.
So:
disable setting of the available size on each device for
RAID0 and Linear devices. This must not change as doing so
can change the effective layout of data.
Make sure that the size that raid0_size() reports is accurate,
but rounding devices sizes to chunk sizes. As the device sizes
cannot change now, this isn't so important, but it is best to be
safe.
Without this change:
mdadm --grow /dev/md0 -z max
mdadm --grow /dev/md0 -Z max
then read to the end of the array
can cause a BUG in a RAID0 array.
These bugs have been present ever since it became possible
to resize any device, which is a long time. So the fix is
suitable for any -stable kerenl.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
If an fsync occurs on a read-only array, we need to send a
completion for the IO and may not increment the active IO count.
Otherwise, we hit a bug trace and can't stop the MD array anymore.
By advice of Christoph Hellwig we return success upon a flush
request but we return -EROFS for other writes.
We detect flush requests by checking if the bio has zero sectors.
This patch is suitable to any -stable kernel to which it applies.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Paul Menzel <paulepanter@users.sourceforge.net>
Signed-off-by: NeilBrown <neilb@suse.de>
Mostly just little fixes. Probably biggest part is
AVX accelerated RAID6 calculations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUM/w2Dnsnt1WYoG5AQKXlg/9F5juv4CjRkRRFLqZgOPBLmn/s/2Vspgh
2Kv8Jcyixd8jUQNbobZv0ahlJH/iSU61kpOE8QjLbKi5Y42vAbM0ZU2aHJ6nqGZy
HiTI8K+7kTvCK3ZXLcUQ+4oPPBNTcoTZbLWaEOmIqB1ruLddoIR7M9fG3PspVeG0
jijnXR8IfL6mr4YDXnJkEhFrneTysVik05RkKYZKyM/9r3stAoMJ9o0/EFy3OFxb
lO6mLEtvjVArXcnuf1RMCw2YKgki9Y4r73HCplgQsVFvcxcpsya4gFF+lRR5j7cO
/eMYbSQ89iWEYKh1dJ9u1nofc8fX5ia71QQyO1fkO4GXRHXPVIyBgKSbe7SaL6iG
JUMm7idUV2rZGeq3ln3k8Yor4QqHvN1n7pRKKUF+ZdsPoQ1B/TABu+qpsAdo5ZhP
fxDsULsHrzEaxgetd4V8F2Uptca9ni43sMI8mwsvVlA0p6SOzMIyoJLC9xAZpx11
b3H3+7Oje/fasmszBoq5B9uAlSt9XXVN4DDn2q6cX+S96JSX6jcsN1c6cJBO+ZxB
OU6a6P5mnU6HuxU02rspe7G8BeU+ybaonErOW+GdyC4r7M/cImC0dSp0NGHK2211
oqu0xBx/Q/ddTFwKQqa4HzR2ws09+LhKbjdqYIhCEKttIbLIAjf73ARZ19XPSRRX
pDR/ey2CB6E=
=uK52
-----END PGP SIGNATURE-----
Merge tag 'md-3.8' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly just little fixes. Probably biggest part is AVX accelerated
RAID6 calculations."
* tag 'md-3.8' of git://neil.brown.name/md:
md/raid5: add blktrace calls
md/raid5: use async_tx_quiesce() instead of open-coding it.
md: Use ->curr_resync as last completed request when cleanly aborting resync.
lib/raid6: build proper files on corresponding arch
lib/raid6: Add AVX2 optimized gen_syndrome functions
lib/raid6: Add AVX2 optimized recovery functions
md: Update checkpoint of resync/recovery based on time.
md:Add place to update ->recovery_cp.
md.c: re-indent various 'switch' statements.
md: close race between removing and adding a device.
md: removed unused variable in calc_sb_1_csm.
Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:
- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.
- A few cleanups from Akinobu Mita for drbd and cciss.
- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.
- A few fixes for xen back/front block driver from Roger Pau Monne.
- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."
* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
If a resync is aborted cleanly, ->curr_resync is a reliable
record of where we got up to.
If there was an error it is less reliable but we always know that
->curr_resync_completed is safe.
So add a flag MD_RECOVERY_ERROR to differentiate between these cases
and set recovery_cp accordingly.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md will current only only checkpoint recovery or resync ever 1/16th
of the device size. As devices get larger this can become a long time
an so a lot of work that might need to be duplicated after a shutdown.
So add a time-based checkpoint. Every 5 minutes limits the amount of
duplicated effort to at most 5 minutes, and has almost zero impact on
performance.
[changelog entry re-written by NeilBrown]
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In resyncing, recovery_cp only updated when resync aborted or completed.
But in md drives,many place used it to judge.So add a place to update.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When we remove a device from an md array, the final removal of
the "dev-XX" sys entry is run asynchronously.
If we then re-add that device immediately before the worker thread
gets to run, we can end up trying to add the "dev-XX" sysfs entry back
before it has been removed.
So in both places where we add a device, call
flush_workqueue(md_misc_wq);
before taking the md lock (as holding the md lock can prevent removal
to complete).
Signed-off-by: NeilBrown <neilb@suse.de>
New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.
The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.
All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.
Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
md_stop() would stop an array, but not free various attached
data structures.
For internal arrays, these are freed later in do_md_stop() or
mddev_put(), but they don't apply for dm-raid arrays.
So get md_stop() to free them, and only all it from dm-raid.
For internal arrays we now call __md_stop.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If read_seqretry returned true and bbp was changed, it will write
invalid address which can cause some serious problem.
This bug was introduced by commit v3.0-rc7-130-g2699b67.
So fix is suitable for 3.0.y thru 3.6.y.
Reported-by: zhuwenfeng@kedacom.com
Tested-by: zhuwenfeng@kedacom.com
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This bug was introduced by commit(v3.0-rc7-126-g2230dfe).
So fix is suitable for 3.0.y thru 3.6.y.
Cc: stable@vger.kernel.org
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
"discard" support, some dm-raid improvements and other assorted
bits and pieces.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUAUHk6Rjnsnt1WYoG5AQKovQ//Ym0ROo5a6uekb2USLyFSdQH3TC7z0v0+
+kujrgoc4nHZU/vj5yfMvPVomEUsAhHEwTkvvCiXFFHn6cxPzC8ezm8d40xEeISX
qp6i2bPlvGURhsW1tYeD+THtY82/oyzQ4Wa/vaE1sjVLQ+caa2q7kVVgAL9Bj/Kz
aESIZjAuPxQNE1674/KR0EmMFcbpd0z1WDV+ydKlRV5jHCHGYf8OmxOenJFf+V/b
/f9p2u+NUq5BN5WLhThcysO8lPX1Y7GG8IYay3DlSt/crU24R2a2j0qh/BDoK8+t
/DceoHipbIiGxXLVjM7y+1RwPpCh75HJSZQHltPype2Z3iwtwEth9uTkEE3M2h/W
tOQEbOZku0kcgsrys7JBmpkBwkR9oZqq1kDd4YBzqW4PiGVP6z0JRH8QpjjB+mjN
47ODYIZcaEYZ+0Jj8kcVxo3gv4Xj4DWH+auSNZihTVmjQPVqrcy3CAt3CkuDzTkY
34fZVuCDiCetLGCGQKrwfMDnySVy5xOmtC6iWsEY5rExAeb0E+BCzcBvbAXzt+ef
MPDsrxWbo/ZkvpuwXOwLFTccBuRtAsFi7CM4jcow53W6XMnPpdubphNw5nylaEm1
DEzfID58mv8VHWRuW15vr7SbtROjYJkEFCIaEK3oprrRUYftZntIABcknqvcIYR+
/ULNzkRU1w4=
=XRmL
-----END PGP SIGNATURE-----
Merge tag 'md-3.7' of git://neil.brown.name/md
Pull md updates from NeilBrown:
- "discard" support, some dm-raid improvements and other assorted bits
and pieces.
* tag 'md-3.7' of git://neil.brown.name/md: (29 commits)
md: refine reporting of resync/reshape delays.
md/raid5: be careful not to resize_stripes too big.
md: make sure manual changes to recovery checkpoint are saved.
md/raid10: use correct limit variable
md: writing to sync_action should clear the read-auto state.
Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races
md/raid5: make sure to_read and to_write never go negative.
md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.
md/raid5: protect debug message against NULL derefernce.
md/raid5: add some missing locking in handle_failed_stripe.
MD: raid5 avoid unnecessary zero page for trim
MD: raid5 trim support
md/bitmap:Don't use IS_ERR to judge alloc_page().
md/raid1: Don't release reference to device while handling read error.
raid: replace list_for_each_continue_rcu with new interface
add further __init annotations to crypto/xor.c
DM RAID: Fix for "sync" directive ineffectiveness
DM RAID: Fix comparison of index and quantity for "rebuild" parameter
DM RAID: Add rebuild capability for RAID10
DM RAID: Move 'rebuild' checking code to its own function
...
If 'resync_max' is set to 0 (as is often done when starting a
reshape, so the mdadm can remain in control during a sensitive
period), and if the reshape request is initially delayed because
another array using the same array is resyncing or reshaping etc,
when user-space cannot easily tell when the delay changes from being
due to a conflicting reshape, to being due to resync_max = 0.
So introduce a new state: (curr_resync == 3) to reflect this, make
sure it is visible both via /proc/mdstat and via the "sync_completed"
sysfs attribute, and ensure that the event transition from one delay
state to the other is properly notified.
Signed-off-by: NeilBrown <neilb@suse.de>
If you make an array bigger but suppress resync of the new region with
mdadm --grow /dev/mdX --size=max --assume-clean
then stop the array before anything is written to it, the effect of
the "--assume-clean" is lost and the array will resync the new space
when restarted.
So ensure that we update the metadata in the case.
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In some cases array are started in 'read-auto' state where in
nothing gets written to any device until the array is written
to. The purpose of this is to make accidental auto-assembly
of the wrong arrays less of a risk, and to allow arrays to be
started to read suspend-to-disk images without actually changing
anything (as might happen if the array were dirty and a
resync seemed necessary).
Explicitly writing the 'sync_action' for a read-auto array currently
doesn't clear the read-auto state, so the sync action doesn't
happen, which can be confusing.
So allow any successful write to sync_action to clear any read-auto
state.
Reported-by: Alexander Kühn <alexander.kuehn@nagilum.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID10: Fix a couple potential kernel panics if RAID10 is used by dm-raid
When device-mapper uses the RAID10 personality through dm-raid.c, there is no
'gendisk' structure in mddev and some sysfs information is also not populated.
This patch avoids touching those non-existent structures.
Signed-off-by: Jonathan Brassow <jbrassow@rehdat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some ioctls don't need to take the mutex and doing so can cause
a delay as it is held during super-block update.
So move those ioctls out of the mutex and rely on rcu locking
to ensure we don't access stale data.
Signed-off-by: NeilBrown <neilb@suse.de>
Change the thread parameter, so the thread can carry extra info. Next patch
will use it.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Pull block IO update from Jens Axboe:
"Core block IO bits for 3.7. Not a huge round this time, it contains:
- First series from Kent cleaning up and generalizing bio allocation
and freeing.
- WRITE_SAME support from Martin.
- Mikulas patches to prevent O_DIRECT crashes when someone changes
the block size of a device.
- Make bio_split() work on data-less bio's (like trim/discards).
- A few other minor fixups."
Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton. It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6: "mm: replace vma prio_tree with an interval tree").
So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.
* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
block: makes bio_split support bio without data
scatterlist: refactor the sg_nents
scatterlist: add sg_nents
fs: fix include/percpu-rwsem.h export error
percpu-rw-semaphore: fix documentation typos
fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
blockdev: turn a rw semaphore into a percpu rw semaphore
Fix a crash when block device is read and block size is changed at the same time
block: fix request_queue->flags initialization
block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
block: ioctl to zero block ranges
block: Make blkdev_issue_zeroout use WRITE SAME
block: Implement support for WRITE SAME
block: Consolidate command flag and queue limit checks for merges
block: Clean up special command handling logic
block/blk-tag.c: Remove useless kfree
block: remove the duplicated setting for congestion_threshold
block: reject invalid queue attribute values
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
...
It isn't always necessary to update the metadata when spares are
removed as the presence-or-not of a spare isn't really important to
the integrity of an array.
Also activating a spare doesn't always require updating the metadata
as the update on 'recovery-completed' is usually sufficient.
However the introduction of 'replacement' devices have made these
transitions sometimes more important. For example the 'Replacement'
flag isn't cleared until the original device is removed, so we need
to ensure a metadata update after that 'spare' is removed.
So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
complement the current situation where it is set when a spare is added
or a device is failed (or a number of other less common situations).
This is suitable for -stable as out-of-data metadata could lead
to data corruption.
This is only relevant for 3.3 and later 9when 'replacement' as
introduced.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().
This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.
This will also help in a later patch changing how bio cloning works.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that bios keep track of where they were allocated from,
bio_integrity_alloc_bioset() becomes redundant.
Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
related functions and make them use bio->bi_pool.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.
This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.
v6: Explain the temporary if statement in bio_put
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Nicholas Bellinger <nab@linux-iscsi.org>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit 27a7b260f7
md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
changed 0.90 metadata handling to truncated size to 4TB as that is
all that 0.90 can record.
However for RAID0 and Linear, 0.90 doesn't need to record the size, so
this truncation is not needed and causes working arrays to become too small.
So avoid the truncation for RAID0 and Linear
This bug was introduced in 3.1 and is suitable for any stable kernels
from then onwards.
As the offending commit was tagged for 'stable', any stable kernel
that it was applied to should also get this patch. That includes
at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for
providing that list).
Cc: stable@vger.kernel.org
Signed-off-by: Neil Brown <neilb@suse.de>
Pull block driver changes from Jens Axboe:
- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.
- The usual round of drbd updates.
- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.
- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.
* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.
So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.
This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.
Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
do_md_stop tests mddev->openers while holding ->open_mutex,
and fails if this count is too high.
So callers do not need to check mddev->openers and doing so isn't
very meaningful as they don't hold ->open_mutex so the number could
change.
So remove the unnecessary tests on mddev->openers.
These are not called often enough for there to be any gain in
an early test on ->open_mutex to avoid the need for a slightly more
costly mutex_lock call.
Signed-off-by: NeilBrown <neilb@suse.de>
md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.
However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.
If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).
So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.
This is needed when setting a read-write device to read-only too.
Cc: stable@vger.kernel.org
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit c6563a8c38
md: add possibility to change data-offset for devices.
introduced a 'new_data_offset' attribute which should normally
be the same as 'data_offset', but can be explicitly set to a different
value to allow a reshape operation to move the data.
Unfortunately when the 'data_offset' is explicitly set through
sysfs, the new_data_offset is not also set, so the two would become
out-of-sync incorrectly.
One result of this is that trying to set the 'size' after the
'data_offset' would fail because it is not permitted to set the size
when the 'data_offset' and 'new_data_offset' are different - as that
can be confusing.
Consequently when mdadm tried to do this while assembling an IMSM
array it would fail.
This bug was introduced in 3.5-rc1.
Reported-by: Brian Downing <bdowning@lavos.net>
Bisected-by: Brian Downing <bdowning@lavos.net>
Tested-by: Brian Downing <bdowning@lavos.net>
Signed-off-by: NeilBrown <neilb@suse.de>
We currently only allow a device to be re-added if it appear to be
in-sync. This is overly restrictive as it may be desirable to re-add
a device that is in the middle of recovery.
So remove the test for "InSync" - the test on rdev->raid_disk is
sufficient to ensure that the re-add will succeed.
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.
So make it non-optional and always explicitly pass the name
of the level that the array will be.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Check the return of mddev_find(), since it may fail due to out of
memeory or out of usable minor number.
The reason I chose -ENODEV instead of -ENOMEM or something else is
md_alloc() function chose that ;)
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Properly initialize MD recovery flags when resuming device-mapper devices.
When a device-mapper device is suspended, all I/O must stop. This is done by
calling 'md_stop_writes' and 'mddev_suspend'. These calls in-turn manipulate
the recovery flags - including setting 'MD_RECOVERY_FROZEN'. The DM device
may have been suspended while recovery was not yet complete, so the process
needs to pick-up where it left off. Since 'mddev_resume' does not unset
'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
(e.g. It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
setting 'MD_RECOVERY_FROZEN'. Clearing it in 'mddev_resume' would override the
desired behavior.)
Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.
Also clean up where level_store calls mddev_resume() - it current
duplicates some of the funcitons of that call. - NB
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.
This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.
Signed-off-by: NeilBrown <neilb@suse.de>
This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.
Signed-off-by: NeilBrown <neilb@suse.de>
An md bitmap comprises two parts
- internal counting of active writes per 'chunk'.
- external storage of whether there are any active writes on
each chunk
The second requires the first, but the first doesn't require the
second.
Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.
So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.
A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.
We don't actually activate these bitmaps yet - that will come
in a later patch.
Signed-off-by: NeilBrown <neilb@suse.de>
If we are to allow bitmaps to be resized when the array is resized,
we need to know how much space there is.
So create an attribute to store this information and set appropriate
defaults.
It can be set more precisely via sysfs, or future metadata extensions
may allow it to be recorded.
Signed-off-by: NeilBrown <neilb@suse.de>
This ensures that it is always freed - there were case where
we failed to free the page.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
dm-raid currently open-codes the freeing of some members of
and rdev. It is more maintainable to have it call common code
from md.c which does this for all call-sites.
So remove free_disk_sb to md_rdev_clear, export it, and use it in
dm-raid.c
Signed-off-by: NeilBrown <neilb@suse.de>
Some resync type operations need to act on the address space of the
device, others on the address space of the array.
This only affects RAID10, so it sets resync_max_sectors to the array
size (it defaults to the device size), and that is currently used for
resync only. However reshape of a RAID10 must be done against the
array size, not device size, so change code to use resync_max_sectors
for both the resync and the reshape cases.
This does not affect RAID5 or RAID1, just RAID10.
Signed-off-by: NeilBrown <neilb@suse.de>
Some code in raid1 and raid10 use sync_page_io to
read/write pages when responding to read errors.
As we will shortly support changing data_offset for
raid10, this function must understand new_data_offset.
So add that understanding.
Signed-off-by: NeilBrown <neilb@suse.de>
When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).
So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.
(As we didn't previous check that all 'pad' fields were zero,
we need a new FEATURE flag for this.
A (belatedly) check that all remaining 'pad' fields are
zero to avoid a repeat of this)
The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.
This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be. At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.
When a reshape finishes we will need to update the data_offset
and rdev->sectors. So provide an exported function to do that.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.
To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.
However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.
So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.
This will become more important when we allow the data_offset to
change in a reshape. Then the explicit statement of what direction is
being used will be more useful.
This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.
Signed-off-by: NeilBrown <neilb@suse.de>
A flush request is usually issued in transaction commit code path, so
using GFP_KERNEL to allocate memory for flush request bio falls into
the classic deadlock issue.
This is suitable for any -stable kernel to which it applies as it
avoids a possible deadlock.
Cc: stable@vger.kernel.org
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Use del_timer_sync to remove timer before mddev_suspend finishes.
We don't want a timer going off after an mddev_suspend is called. This is
especially true with device-mapper, since it can call the destructor function
immediately following a suspend. This results in the removal (kfree) of the
structures upon which the timer depends - resulting in a very ugly panic.
Therefore, we add a del_timer_sync to mddev_suspend to prevent this.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
commit c744a65c1e
md: don't set md arrays to readonly on shutdown.
removed the possibility of a 'BUG' when data is written to an array
that has just been switched to read-only, but also introduced the
possibility that the array metadata could be corrupted.
If, when md_notify_reboot gets the mddev lock, the array is
in a state where it is assembled but hasn't been started (as can
happen if the personality module is not available, or in other unusual
situations), then incorrect metadata will be written out making it
impossible to re-assemble the array.
So only call __md_stop_writes() if the array has actually been
activated.
This patch is needed for any stable kernel which has had the above
commit applied.
Cc: stable@vger.kernel.org
Reported-by: Christoph Nelles <evilazrael@evilazrael.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 7bfec5f35c
md/raid5: If there is a spare and a want_replacement device, start replacement.
cause md_check_recovery to call ->add_disk much more often.
Instead of only when the array is degraded, it is now called whenever
md_check_recovery finds anything useful to do, which includes
updating the metadata for clean<->dirty transition.
This causes unnecessary work, and causes info messages from ->add_disk
to be reported much too often.
So refine md_check_recovery to only do any actual recovery checking
(including ->add_disk) if MD_RECOVERY_NEEDED is set.
This fix is suitable for 3.3.y:
Cc: stable@vger.kernel.org
Reported-by: Jan Ceuleers <jan.ceuleers@computer.org>
Signed-off-by: NeilBrown <neilb@suse.de>
If there are no unacked bad blocks, then there is no point searching
for them to acknowledge them.
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In super_1_sync (the first hunk) we need to clear 'changed' before
checking read_seqretry(), otherwise we might race with other code
adding a bad block and so won't retry later.
In md_update_sb (the second hunk), in the case where there is no
metadata (neither persistent nor external), we treat any bad blocks as
an error. However we need to clear the 'changed' flag before calling
md_ack_all_badblocks, else it won't do anything.
This patch is suitable for -stable release 3.0 and later.
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
The part of /proc/mdstat which describes the bitmap should really
be generated by code in bitmap.c. So move it there.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently we don't honour merge_bvec_fn in member devices so if there
is one, we force all requests to be single-page at most.
This is not ideal.
So enhance the raid10 merge_bvec_fn to check that function in children
as well.
This introduces a small problem. There is no locking around calls
the ->merge_bvec_fn and subsequent calls to ->make_request. So a
device added between these could end up getting a request which
violates its merge_bvec_fn.
Currently the best we can do is synchronize_sched(). This will work
providing no preemption happens. If there is preemption, we just
have to hope that new devices are largely consistent with old devices.
Signed-off-by: NeilBrown <neilb@suse.de>
md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev. However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.
Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.
So:
- rename rdev_for_each to rdev_for_each_safe
- create a new rdev_for_each which uses the plain
list_for_each_entry,
- use the 'safe' version only where needed, and convert all other
list_for_each_entry calls to use rdev_for_each.
Signed-off-by: NeilBrown <neilb@suse.de>
It seems that with recent kernel, writeback can still be happening
while shutdown is happening, and consequently data can be written
after the md reboot notifier switches all arrays to read-only.
This causes a BUG.
So don't switch them to read-only - just mark them clean and
set 'safemode' to '2' which mean that immediately after any
write the array will be switch back to 'clean'.
This could result in the shutdown happening when array is marked
dirty, thus forcing a resync on reboot. However if you reboot
without performing a "sync" first, you get to keep both halves.
This is suitable for any stable kernel (though there might be some
conflicts with obvious fixes in earlier kernels).
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUATzMdiTnsnt1WYoG5AQKICw/9H3Xf/3crCCVRQ+yzSdZ1ZJH24Rps9O6W
8dLFN4/Ng/qxymWUMrgHAMq5MEEz2M3i7W+j23lFv6Oce06y8GJ4PpoYY5xlXCgO
SIU1BaO1JFHxQn89EQtP3iOn4AOiZvX0GUObR0P8KO1mMnLmN7cg8J1kBfmQiBKu
aXcUqqNvcywoix6ve4O/xgnZjd4IExxqG3W8U7CaIwExUDwaLY4NckxJcIJbIYy9
iapOGMUdcyr6xm819V/xE2DyAtfFCtvAk1hfW/dM4QQctran3MzQIRFn9RW+CwHU
ComEnv5ti/7g//JPXQArUPk4xgRHrMhqFcmmD8rozJ6FJDi8vw2e0BXaRLVqa0mK
1qSZkr0Ot3nwAdILzgSbNXQ0Y5OJgc9OLX5GGlVibTW2VTJYFgA7jAsnqq8PAJC5
sU5h2K3jrSy2unGy6BxleL5D/wvREE5OBnW35TEB5TYbxjp1FLgn+BWp8FfFUYWT
Eb2cIyAj6cBFJ3ma1K0RH0dmS9cbNjuG+CLiApJOnEEsXzrp/4KnqOwg4672ewW3
m1Ue2Qv+0avaK3sVyT+qzuemc6b0ps/dix0gMXw2pYqXQWHquW5NdUJcgD2DKFSn
BB734nUP6KlPg0IFh1eehRHyVRLIAot/uBlUJ3bMx9xeYCkKa+twX90u6EmjTopP
JjLxNsf6c2I=
=k0Xz
-----END PGP SIGNATURE-----
Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Some simple md-related fixes.
1/ two small fixes to ensure we handle an interrupted resync properly.
2/ avoid loading the bitmap multiple times in dm-raid
* tag 'md-3.3-fixes' of git://neil.brown.name/md:
md: two small fixes to handling interrupt resync.
Prevent DM RAID from loading bitmap twice.
1/ If a resync is aborted we should record how far we got
(recovery_cp) the last request that we know has completed
(->curr_resync_completed) rather than the last request that was
submitted (->curr_resync).
2/ When a resync aborts we still want to update the metadata with
any changes, so set MD_CHANGE_DEVS even if we 'skip'.
Signed-off-by: NeilBrown <neilb@suse.de>
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...
Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)
One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON. Has been tagged for -stable.
The other is minor missing functionality.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUATwy6Sznsnt1WYoG5AQL+5A//TbTgElZaJ7IMY4q658afuRNtuWfevqTs
4EoSUvarwyZN20JxUd4dFTzLQ3nu3XVmwZsDBbpRs7+Dt2m7Efp4qytqrTxHb6SR
4gOr1KFXZi2rQFNpIg8T5+eyb+2VkbHGYffOtwS9TZnJqZZ4upffJi1EpJSfB1Bo
ilkO8wcaNKVWzTgnQo+JVOLQQyNENs12Xc0aLVA0dZC0a37qWJTbr75r7nrtLT7A
Gy783AG8JglRsr7AOVceqBVOpRonhFDz7G2hQqHg140m6i/GzDJrPtadovCtq7nt
U6/Po7qbOj5eOSGrVPwS1gJQOT7deAL7Eeu7dOpbzl1Cwysbhg63piMNyDs4P/gM
bFsR+LTbmZiaYs5G1oDwN/WTYLeq6cxY0IftShWdGoQwZRF/woJ7VAQSWNvHY8mg
Z+EbEL3sY40+8eBk7/umT0WxQ9wYjooS/9ZowQ2ktRmt82Dwv0LXzWNTSlwhWKKt
QBtv1er/psEKFqb2zDtlea8gDlKahaVNaiOK6RuY5CM5iBa4/zEmWVXS/i07LC7Z
cW9swD4J3AEKSolWHWYQJBmCsKy+rUp5t0mQ5e/O4+nhCDbfe+Da0OArg6b/ygMu
14RdyjOENxSqKi3IkCnToch+eNzCIm3ETaS2E0nSv996G+ShqsLtROOI9x9DXiu3
nyLxAnIVp8I=
=969y
-----END PGP SIGNATURE-----
Merge tag 'md-3.3-fixes' of git://neil.brown.name/md
Two bugfixes for md.
One is a recently introduced regression that affects an unusual
configuration with a guaranteed BUG_ON. Has been tagged for -stable.
The other is minor missing functionality.
* tag 'md-3.3-fixes' of git://neil.brown.name/md:
md/raid1: perform bad-block tests for WriteMostly devices too.
md: notify the 'degraded' sysfs attribute on failure.
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.
This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.
Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently only 'notify' changes to the 'degraded' attribute
when it decreases, not when it increases.
Notifying on failure is a little awkward as it happen in
interrupt context.
So instead, notify when we remove the failed device from the array,
which is very soon afterwards.
Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQIVAwUATwUdyDnsnt1WYoG5AQJBExAAuFQrzt7CN32nmiqaLlJ1snBC/ixOSJf7
0X88YB+utO0qNIhiRTI1AulslRGms9pChyKmZZaxU2Hvzk1pTrFspsc8RlJTIv9S
Si43due99Np08/Kf5rjYulyH0fcug0qIqlrCiBvc36pOIHZzam6IzrEvKFf0dqbe
4vtLWuylEUiSEbN5gBordfTFZcxSPI/huMUgx6hEpdA4NtaPmN57/03d49k4RYgp
f5l9htuIqdMgHwNJRUDGMooifuvILC5HXINUbsCFT6KUF/bxA75nW7W3C2BUq0V/
CVb/ZoHFYtuaGBEcMUzN34k0DbHghPeSmlvXT4XWq7+gVNqGe33nbSJsx1oLXWr1
m/b3j4Ublv+VVd7L1Rr40vTRB6wN5/2uN7SiD6d83ppPD0TuAY8YvMHgoLZQmQvh
Ak4fEz07re+tueKhHwbi+1qMIw/ciQ9O/tI7r+AsVklkAxJNAfFKW6FOEFL2cjEW
h4rbg1z9sU+xD8G01LBnJ0to/ajrJ4ch4wV/raLgi4+dJ4Tt3+/tas6WJlxHbyFF
IiHCFM0+KcxLcwNkalZYg/zr5qu7EcwpPKLnc68m+LjXlYVWcnNHwv5WnOCHHQTw
6yurGGlKqBfvsb8zJJGSRWqEtRSYjDlZiMt10/u+H40HiCiLX//vSvVbZoFiwWu1
VzgEhzhvaYw=
=j00/
-----END PGP SIGNATURE-----
Merge tag 'md-3.3' of git://neil.brown.name/md
md update for 3.3
Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.
* tag 'md-3.3' of git://neil.brown.name/md: (36 commits)
md/raid1: Mark device want_replacement when we see a write error.
md/raid1: If there is a spare and a want_replacement device, start replacement.
md/raid1: recognise replacements when assembling arrays.
md/raid1: handle activation of replacement device when recovery completes.
md/raid1: Allow a failed replacement device to be removed.
md/raid1: Allocate spare to store replacement devices and their bios.
md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.
md/raid10: If there is a spare and a want_replacement device, start replacement.
md/raid10: recognise replacements when assembling array.
md/raid10: Allow replacement device to be replace old drive.
md/raid10: handle recovery of replacement devices.
md/raid10: Handle replacement devices during resync.
md/raid10: writes should get directed to replacement as well as original.
md/raid10: allow removal of failed replacement devices.
md/raid10: preferentially read from replacement device if possible.
md/raid10: change read_balance to return an rdev
md/raid10: prepare data structures for handling replacement.
md/raid5: Mark device want_replacement when we see a write error.
md/raid5: If there is a spare and a want_replacement device, start replacement.
md/raid5: recognise replacements when assembling array.
...
Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
kill_bdev as well, so brd doesn't have to open code it. Reduce
buffer_head.h requirement accordingly.
Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving. The small comment replacing it says enough.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.
This requires that common md code attempt hot_add even when the array
is not formally degraded.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.
With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition. It can also be use when the original
device is still reliable but it not wanted for some reason.
This will eventually be supported in RAID4/5/6 and RAID10.
This patch adds a super-block flag to distinguish the replacement
device. If an old kernel sees this flag it will reject the device.
It also adds two per-device flags which are viewable and settable via
sysfs.
"want_replacement" can be set to request that a device be replaced.
"replacement" is set to show that this device is replacing another
device.
The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement. We currently don't make links for the
replacement - there doesn't seem to be a need.
Signed-off-by: NeilBrown <neilb@suse.de>
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.
Thus the common test is not needed.
As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.
So remove the common test.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The info is already available in /proc/mdstat and /sys/block in
an accessible form so there is no point in putting a road-block in
the ioctl for information gathering.
Signed-off-by: NeilBrown <neilb@suse.de>