Commit Graph

2346 Commits

Author SHA1 Message Date
Martin K. Petersen
b1bd055d39 block: Introduce blk_set_stacking_limits function
Stacking driver queue limits are typically bounded exclusively by the
capabilities of the low level devices, not by the stacking driver
itself.

This patch introduces blk_set_stacking_limits() which has more liberal
metrics than the default queue limits function. This allows us to
inherit topology parameters from bottom devices without manually
tweaking the default limits in each driver prior to calling the stacking
function.

Since there is now a clear distinction between stacking and low-level
devices, blk_set_default_limits() has been modified to carry the more
conservative values that we used to manually set in
blk_queue_make_request().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-11 16:27:11 +01:00
NeilBrown
307729c8bc md/raid1: perform bad-block tests for WriteMostly devices too.
We normally try to avoid reading from write-mostly devices, but when
we do we really have to check for bad blocks and be sure not to
try reading them.

With the current code, best_good_sectors might not get set and that
causes zero-length read requests to be send down which is very
confusing.

This bug was introduced in commit d2eb35acfd and so the patch
is suitable for 3.1.x and 3.2.x

Reported-and-tested-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Reported-and-tested-by: Art -kwaak- van Breemen <ard@telegraafnet.nl>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel.org
2012-01-11 08:35:17 +11:00
NeilBrown
f2a371c5e7 md: notify the 'degraded' sysfs attribute on failure.
We currently only 'notify' changes to the 'degraded' attribute
when it decreases, not when it increases.

Notifying on failure is a little awkward as it happen in
interrupt context.
So instead, notify when we remove the failed device from the array,
which is very soon afterwards.

Reported-and-tested-by: Mikhail Balabin <mbalabin@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-01-11 08:35:14 +11:00
Linus Torvalds
2943c83322 md update for 3.3
Big change is new hot-replacement.
 A slot in an array can hold 2 devices - one that
 wants-replacement and one that is the replacement.
 Once the replacement is built - either from the
 original or (in the case of errors) from elsewhere,
 the wants-replacement device will be removed.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUATwUdyDnsnt1WYoG5AQJBExAAuFQrzt7CN32nmiqaLlJ1snBC/ixOSJf7
 0X88YB+utO0qNIhiRTI1AulslRGms9pChyKmZZaxU2Hvzk1pTrFspsc8RlJTIv9S
 Si43due99Np08/Kf5rjYulyH0fcug0qIqlrCiBvc36pOIHZzam6IzrEvKFf0dqbe
 4vtLWuylEUiSEbN5gBordfTFZcxSPI/huMUgx6hEpdA4NtaPmN57/03d49k4RYgp
 f5l9htuIqdMgHwNJRUDGMooifuvILC5HXINUbsCFT6KUF/bxA75nW7W3C2BUq0V/
 CVb/ZoHFYtuaGBEcMUzN34k0DbHghPeSmlvXT4XWq7+gVNqGe33nbSJsx1oLXWr1
 m/b3j4Ublv+VVd7L1Rr40vTRB6wN5/2uN7SiD6d83ppPD0TuAY8YvMHgoLZQmQvh
 Ak4fEz07re+tueKhHwbi+1qMIw/ciQ9O/tI7r+AsVklkAxJNAfFKW6FOEFL2cjEW
 h4rbg1z9sU+xD8G01LBnJ0to/ajrJ4ch4wV/raLgi4+dJ4Tt3+/tas6WJlxHbyFF
 IiHCFM0+KcxLcwNkalZYg/zr5qu7EcwpPKLnc68m+LjXlYVWcnNHwv5WnOCHHQTw
 6yurGGlKqBfvsb8zJJGSRWqEtRSYjDlZiMt10/u+H40HiCiLX//vSvVbZoFiwWu1
 VzgEhzhvaYw=
 =j00/
 -----END PGP SIGNATURE-----

Merge tag 'md-3.3' of git://neil.brown.name/md

md update for 3.3

Big change is new hot-replacement.
A slot in an array can hold 2 devices - one that
wants-replacement and one that is the replacement.
Once the replacement is built - either from the
original or (in the case of errors) from elsewhere,
the wants-replacement device will be removed.

* tag 'md-3.3' of git://neil.brown.name/md: (36 commits)
  md/raid1: Mark device want_replacement when we see a write error.
  md/raid1: If there is a spare and a want_replacement device, start replacement.
  md/raid1: recognise replacements when assembling arrays.
  md/raid1: handle activation of replacement device when recovery completes.
  md/raid1: Allow a failed replacement device to be removed.
  md/raid1: Allocate spare to store replacement devices and their bios.
  md/raid1:  Replace use of mddev->raid_disks with conf->raid_disks.
  md/raid10: If there is a spare and a want_replacement device, start replacement.
  md/raid10: recognise replacements when assembling array.
  md/raid10: Allow replacement device to be replace old drive.
  md/raid10: handle recovery of replacement devices.
  md/raid10:  Handle replacement devices during resync.
  md/raid10: writes should get directed to replacement as well as original.
  md/raid10: allow removal of failed replacement devices.
  md/raid10: preferentially read from replacement device if possible.
  md/raid10:  change read_balance to return an rdev
  md/raid10: prepare data structures for handling replacement.
  md/raid5: Mark device want_replacement when we see a write error.
  md/raid5: If there is a spare and a want_replacement device, start replacement.
  md/raid5: recognise replacements when assembling array.
  ...
2012-01-08 13:28:33 -08:00
Al Viro
ff01bb4832 fs: move code out of buffer.c
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:07 -05:00
NeilBrown
19d671695e md/raid1: Mark device want_replacement when we see a write error.
Now that WantReplacement drives are replaced cleanly, mark a drive
as want_replacement when we see a write error.  It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Signed-off-by:  NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown
7ef449d1ec md/raid1: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID1 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown
c19d57980b md/raid1: recognise replacements when assembling arrays.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown
8c7a2c2bcf md/raid1: handle activation of replacement device when recovery completes.
When recovery completes ->spare_active is called.
This checks if the replacement is ready and if so it fails
the original.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:57 +11:00
NeilBrown
b014f14c81 md/raid1: Allow a failed replacement device to be removed.
Replacement devices are stored at a different offset, so look
there too.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown
8f19ccb2fd md/raid1: Allocate spare to store replacement devices and their bios.
In RAID1, a replacement is much like a normal device, so we just
double the size of the relevant arrays and look at all possible
devices for reads and writes.

This means that the array looks like it is now double the size in some
way - we need to be careful about that.
In particular, we checking if the array is still degraded while
creating a recovery request we need to only consider the first 'half'
- i.e. the real (non-replacement) devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown
301946364e md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.
In general mddev->raid_disks can change unexpectedly while
conf->raid_disks will only change in a very controlled way.  So change
some uses of one to the other.

The use of mddev->raid_disks will not cause actually problems but
this way is more consistent and safer in the long term.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown
b7044d41b5 md/raid10: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID10 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:56 +11:00
NeilBrown
56a2559bb6 md/raid10: recognise replacements when assembling array.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown
4ca40c2ce0 md/raid10: Allow replacement device to be replace old drive.
When recovery finish and spare_active is called, check for a
replace that might have just become fully synced and mark it
as such, marking the original as failed.

Then when the original is removed, move the replacement into
its position.

This means that 'replacement' and spontaneously become NULL in some
situations.  Make sure we check for those.
It also means that 'rdev' and 'replacement' could appear to be
identical - check for that too.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown
24afd80d99 md/raid10: handle recovery of replacement devices.
If there is a replacement device, then recover to it,
reading from any drives - maybe the one being replaced, maybe not.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown
9ad1aefc8a md/raid10: Handle replacement devices during resync.
If we need to resync an array which has replacement devices,
we always write any block checked to every replacement.

If the resync was bitmap-based resync we will then complete the
replacement normally.
If it was a full resync, we mark the replacements as fully recovered
when the resync finishes so no further recovery is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown
475b0321a4 md/raid10: writes should get directed to replacement as well as original.
When writing, we need to submit two writes, one to the original,
and one to the replacements - if there is a replacement.

If the write to the replacement results in a write error we just
fail the device.  We only try to record write errors to the
original.

This only handles writing new data.  Writing for resync/recovery
will come later.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:55 +11:00
NeilBrown
c8ab903ea9 md/raid10: allow removal of failed replacement devices.
Enhance raid10_remove_disk to be able to remove ->replacement
as well as ->rdev

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown
abbf098e6e md/raid10: preferentially read from replacement device if possible.
When reading (for array reads, not for recovery etc) we read from the
replacement device if it has recovered far enough.
This requires storing the chosen rdev in the 'r10_bio' so we can make
sure to drop the ref on the right device when the read finishes.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown
96c3fd1f38 md/raid10: change read_balance to return an rdev
It makes more sense to return an rdev than just an index as
read_balance() gets a reference to the rdev and so returning
the pointer make this more idiomatic.

This will be needed in a future patch when we might return
a 'replacement' rdev instead of the main rdev.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown
69335ef3bc md/raid10: prepare data structures for handling replacement.
Allow each slot in the RAID10 to have 2 devices, the want_replacement
and the replacement.

Also an r10bio to have 2 bios, and for resync/recovery allocate the
second bio if there are any replacement devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown
3a6de2924a md/raid5: Mark device want_replacement when we see a write error.
Now that WantReplacement drives are replaced cleanly, mark a drive
as WantReplacement when we see a write error.  It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by:  NeilBrown <neilb@suse.de>
2011-12-23 10:17:54 +11:00
NeilBrown
7bfec5f35c md/raid5: If there is a spare and a want_replacement device, start replacement.
When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.

This requires that common md code attempt hot_add even when the array
is not formally degraded.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown
17045f52ac md/raid5: recognise replacements when assembling array.
If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown
dd054fce88 md/raid5: handle activation of replacement device when recovery completes.
When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.

Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.

This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown
9a3e1101b8 md/raid5: detect and handle replacements during recovery.
During recovery we want to write to the replacement but not
the original.  So we have two new flags
 - R5_NeedReplace if this stripe has a replacement that needs to
   be written at some stage
 - R5_WantReplace if NeedReplace, and the data is available, and
   a 'sync' has been requested on this stripe.

We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.

Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown
977df36255 md/raid5: writes should get directed to replacement as well as original.
When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.

If the write to the replacement results in a write error, we just fail
the device.  We only try to record write errors to the original.

When writing for recovery, we shouldn't write to the original.  This
will be addressed in a subsequent patch that generally addresses
recovery.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:53 +11:00
NeilBrown
657e3e4d88 md/raid5: allow removal for failed replacement devices.
Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown
14a75d3e07 md/raid5: preferentially read from replacement device if possible.
If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.

If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.

This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache.  It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown
995c4275a7 md/raid5: remove redundant bio initialisations.
We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.

Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown
ede7ee8b4d md/raid5: raid5.h cleanup
Remove some #defines that are no longer used, and replace some
others with an enum.
And remove an unused field.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown
671488cc25 md/raid5: allow each slot to have an extra replacement device
Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head.  This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.

For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.

This includes a small enhancement.  Previously we would only do
aligned reads if the target device was fully recovered.  Now we also
do them if it has recovered far enough.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:52 +11:00
NeilBrown
2d78f8c451 md: create externally visible flags for supporting hot-replace.
hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.

With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition.  It can also be use when the original
device is still reliable but it not wanted for some reason.

This will eventually be supported in RAID4/5/6 and RAID10.

This patch adds a super-block flag to distinguish the replacement
device.  If an old kernel sees this flag it will reject the device.

It also adds two per-device flags which are viewable and settable via
sysfs.
   "want_replacement" can be set to request that a device be replaced.
   "replacement" is set to show that this device is replacing another
   device.

The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement.  We currently don't make links for the
replacement - there doesn't seem to be a need.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown
b8321b68d1 md: change hot_remove_disk to take an rdev rather than a number.
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement).  So removing
a device based on the number won't work.  So pass the actual device
handle instead.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown
476a7abb9b md: remove test for duplicate device when setting slot number.
When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.

Thus the common test is not needed.

As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.

So remove the common test.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown
915c420ddf md/bitmap: be more consistent when setting new bits in memory bitmap.
For each active region corresponding to a bit in the bitmap with have
a 14bit counter (and some flags).
This counts
   number of active writes + bit in the on-disk bitmap + delay-needed.

The "delay-needed" is because we always want a delay before clearing a
bit.  So the number here is normally number of active writes plus 2.
If there have been no writes for a while, we drop to 1.
If still no writes we clear the bit and drop to 0.

So for consistency, when setting bit from the on-disk bitmap or by
request from user-space it is best to set the counter to '2' to start
with.

In particular we might also set the NEEDED_MASK flag at this time, and
in all other cases NEEDED_MASK is only set when the counter is 2 or
more.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:51 +11:00
NeilBrown
908f4fbd26 md/raid5: be more thorough in calculating 'degraded' value.
When an array is being reshaped to change the number of devices,
the two halves can be differently degraded.  e.g. one could be
missing a device and the other not.

So we need to be more careful about calculating the 'degraded'
attribute.

Instead of just inc/dec at appropriate times, perform a full
re-calculation examining both possible cases.  This doesn't happen
often so it not a big cost, and we already have most of the code to
do it.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:50 +11:00
NeilBrown
2e61ebbcc4 md/bitmap: daemon_work cleanup.
We have a variable 'mddev' in this function, but repeatedly get the
same value by dereferencing bitmap->mddev.
There is room for simplification here...

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:50 +11:00
NeilBrown
506c9e44a8 md: allow non-privileged uses to GET_*_INFO about raid arrays.
The info is already available in /proc/mdstat and /sys/block in
an accessible form so there is no point in putting a road-block in
the ioctl for information gathering.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 10:17:26 +11:00
NeilBrown
961902c0f8 md/bitmap: It is OK to clear bits during recovery.
commit d0a4bb4927 introduced a
regression which is annoying but fairly harmless.

When writing to an array that is undergoing recovery (a spare
in being integrated into the array), writing to the array will
set bits in the bitmap, but they will not be cleared when the
write completes.

For bits covering areas that have not been recovered yet this is not a
problem as the recovery will clear the bits.  However bits set in
already-recovered region will stay set and never be cleared.
This doesn't risk data integrity.  The only negatives are:
 - next time there is a crash, more resyncing than necessary will
   be done.
 - the bitmap doesn't look clean, which is confusing.

While an array is recovering we don't want to update the
'events_cleared' setting in the bitmap but we do still want to clear
bits that have very recently been set - providing they were written to
the recovering device.

So split those two needs - which previously both depended on 'success'
and always clear the bit of the write went to all devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:57:48 +11:00
NeilBrown
60fc13702a md: don't give up looking for spares on first failure-to-add
Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.

Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.

If we abort early these might not happen properly.

So remove the early abort.

In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.

Reported-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:57:19 +11:00
NeilBrown
30d7a48368 md/raid5: ensure correct assessment of drives during degraded reshape.
While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync.  In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.

Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:57:00 +11:00
NeilBrown
09cd9270ea md/linear: fix hot-add of devices to linear arrays.
commit d70ed2e4fa
broke hot-add to a linear array.
After that commit, metadata if not written to devices until they
have been fully integrated into the array as determined by
saved_raid_disk.  That patch arranged to clear that field after
a recovery completed.

However for linear arrays, there is no recovery - the integration is
instantaneous.  So we need to explicitly clear the saved_raid_disk
field.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-23 09:56:55 +11:00
Adam Kwolek
5d8c71f9e5 md: raid5 crash during degradation
NULL pointer access causes crash in raid5 module.

Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-09 14:26:11 +11:00
NeilBrown
9283d8c5af md/raid5: never wait for bad-block acks on failed device.
Once a device is failed we really want to completely ignore it.
It should go away soon anyway.

In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.

So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 16:27:57 +11:00
NeilBrown
8bd2f0a05b md: ensure new badblocks are handled promptly.
When we mark blocks as bad we need them to be acknowledged by the
metadata handler promptly.

For an in-kernel metadata handler that was already being done.  But
for an external metadata handler we need to alert it of the change by
sending a notification through the sysfs file.  This adds that
notification.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 16:26:08 +11:00
NeilBrown
52c64152a9 md: bad blocks shouldn't cause a Blocked status on a Faulty device.
Once a device is marked Faulty the badblocks - whether acknowledged or
not - become irrelevant.  So they shouldn't cause the device to be
marked as Blocked.

Without this patch, a process might write "-blocked" to clear the
Blocked status, but while that will correctly fail the device, it
won't remove the apparent 'blocked' status.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 16:22:48 +11:00
NeilBrown
af8a24347f md: take a reference to mddev during sysfs access.
When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.

The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.

To to guard against this we:
 - initialise mddev->all_mddevs in on last put so the state can be
   easily detected.
 - in md_attr_show and md_attr_store, check ->all_mddevs under
   all_mddevs_lock and mddev_get the mddev if it still appears to
   be active.

This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.

rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection.  They check that rdev->mddev still points to
mddev after taking mddev_lock.  As this is cleared  before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 15:49:46 +11:00
NeilBrown
1d23f178d5 md: refine interpretation of "hold_active == UNTIL_IOCTL".
We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not.  We can only tell from recent history of changes.

In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.

So we always preserve a newly created md device until something
significant happens.  This state is stored in 'hold_active'.

The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it.  If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.

We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.

So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs.  Other changes via sysfs
are ignored.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08 15:49:12 +11:00
NeilBrown
7c8f424798 md/lock: ensure updates to page_attrs are properly locked.
Page attributes are set using __set_bit rather than set_bit as
it normally called under a spinlock so the extra atomicity is not
needed.

However there are two places where we might set or clear page
attributes without holding the spinlock.
So add the spinlock in those cases.

This might be the cause of occasional reports that bits a aren't
getting clear properly - theory is that BITMAP_PAGE_PENDING gets lost
when BITMAP_PAGE_NEEDWRITE is set or cleared.  This is an
inconvenience, not a threat to data safety.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-11-23 10:18:52 +11:00
Dan Williams
257a4b42af md/raid5: STRIPE_ACTIVE has lock semantics, add barriers
All updates that occur under STRIPE_ACTIVE should be globally visible
when STRIPE_ACTIVE clears.  test_and_set_bit() implies a barrier, but
clear_bit() does not.

This is suitable for 3.1-stable.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2011-11-08 16:22:06 +11:00
NeilBrown
9a3f530f39 md/raid5: abort any pending parity operations when array fails.
When the number of failed devices exceeds the allowed number
we must abort any active parity operations (checks or updates) as they
are no longer meaningful, and can lead to a BUG_ON in
handle_parity_checks6.

This bug was introduce by commit 6c0069c0ae
in 2.6.29.

Reported-by: Manish Katiyar <mkatiyar@gmail.com>
Tested-by: Manish Katiyar <mkatiyar@gmail.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@kernel.org
2011-11-08 16:22:01 +11:00
Stephen Rothwell
a84450604d device-mapper: using EXPORT_SYBOL in dm-space-map-checker.c needs export.h
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:29:10 -08:00
Stephen Rothwell
6f66263f8e device-mapper: dm-bufio.c needs to include module.h
since it uses the module facilities.

Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:29:10 -08:00
Paul Gortmaker
1944ce60fe drivers/md: change module.h -> export.h in persistent-data/dm-*
For the files which are not themselves modular, we can change
them to include only the smaller export.h since all they are
doing is looking for EXPORT_SYMBOL.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-07 10:29:09 -08:00
Linus Torvalds
32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds
b4fdcb02f1 Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
  block: don't call blk_drain_queue() if elevator is not up
  blk-throttle: use queue_is_locked() instead of lockdep_is_held()
  blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
  blk-throttle: Free up policy node associated with deleted rule
  block: warn if tag is greater than real_max_depth.
  block: make gendisk hold a reference to its queue
  blk-flush: move the queue kick into
  blk-flush: fix invalid BUG_ON in blk_insert_flush
  block: Remove the control of complete cpu from bio.
  block: fix a typo in the blk-cgroup.h file
  block: initialize the bounce pool if high memory may be added later
  block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
  block: drop @tsk from attempt_plug_merge() and explain sync rules
  block: make get_request[_wait]() fail if queue is dead
  block: reorganize throtl_get_tg() and blk_throtl_bio()
  block: reorganize queue draining
  block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  block: move blk_throtl prototypes to block/blk.h
  block: fix genhd refcounting in blkio_policy_parse_and_set()
  ...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
 - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
 - drivers/staging/zram/zram_drv.c
2011-11-04 17:06:58 -07:00
Linus Torvalds
43672a0784 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm:
  dm: raid fix device status indicator when array initializing
  dm log userspace: add log device dependency
  dm log userspace: fix comment hyphens
  dm: add thin provisioning target
  dm: add persistent data library
  dm: add bufio
  dm: export dm get md
  dm table: add immutable feature
  dm table: add always writeable feature
  dm table: add singleton feature
  dm kcopyd: add dm_kcopyd_zero to zero an area
  dm: remove superfluous smp_mb
  dm: use local printk ratelimit
  dm table: propagate non rotational flag
2011-11-02 17:02:37 -07:00
Paul Gortmaker
daaa5f7cbe md: Add in export.h for files using EXPORT_SYMBOL
These files were getting the defines for EXPORT_SYMBOL because
device.h was including module.h.  But we are going to put an
end to that.  So add the proper export.h include now.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:19 -04:00
Paul Gortmaker
056075c764 md: Add module.h to all files using it implicitly
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore.  Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:18 -04:00
Linus Torvalds
571109f536 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
  md/raid10:  Fix bug when activating a hot-spare.
2011-10-31 15:21:29 -07:00
Jonathan E Brassow
2e727c3ca1 dm: raid fix device status indicator when array initializing
When devices in a RAID array are not in-sync, they are supposed to be
reported as such in the status output as an 'a' character, which means
"alive, but not in-sync".  But when the entire array is rebuilt 'A' is
being used, which is incorrect.  This patch corrects this to 'a'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:26 +00:00
Jonathan E Brassow
5a25f0eb70 dm log userspace: add log device dependency
Allow userspace dm log implementations to register their log device so it
is no longer missing from the list of device dependencies.

When device mapper targets use a device they normally call dm_get_device
which includes it in the device list returned to userspace applications
such as LVM through the DM_TABLE_DEPS ioctl.  Userspace log devices
don't use dm_get_device as userspace opens them so they are missing from
the list of dependencies.

This patch extends the DM_ULOG_CTR operation to allow userspace to
respond with the name of the log device (if appropriate) to be
registered via 'dm_get_device'.  DM_ULOG_REQUEST_VERSION is incremented.

This is backwards compatible.  If the kernel and userspace log server
have both been updated, the new information will be passed down to the
kernel and the device will be registered.  If the kernel is new, but
the log server is old, the log server will not pass down any device
information and the kernel will simply bypass the device registration
as before.  If the kernel is old but the log server is new, the log
server will see the old version number and not pass the device info.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:24 +00:00
Jonathan Brassow
b89544575d dm log userspace: fix comment hyphens
Fix comments: clustered-disk needs a hyphen not an underscore.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:22 +00:00
Joe Thornber
991d9fa02d dm: add thin provisioning target
Initial EXPERIMENTAL implementation of device-mapper thin provisioning
with snapshot support.  The 'thin' target is used to create instances of
the virtual devices that are hosted in the 'thin-pool' target.  The
thin-pool target provides data sharing among devices.  This sharing is
made possible using the persistent-data library in the previous patch.

The main highlight of this implementation, compared to the previous
implementation of snapshots, is that it allows many virtual devices to
be stored on the same data volume, simplifying administration and
allowing sharing of data between volumes (thus reducing disk usage).

Another big feature is support for arbitrary depth of recursive
snapshots (snapshots of snapshots of snapshots ...).  The previous
implementation of snapshots did this by chaining together lookup tables,
and so performance was O(depth).  This new implementation uses a single
data structure so we don't get this degradation with depth.

For further information and examples of how to use this, please read
Documentation/device-mapper/thin-provisioning.txt

Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:21:18 +00:00
Joe Thornber
3241b1d3e0 dm: add persistent data library
The persistent-data library offers a re-usable framework for the storage
and management of on-disk metadata in device-mapper targets.

It's used by the thin-provisioning target in the next patch and in an
upcoming hierarchical storage target.

For further information, please read
Documentation/device-mapper/persistent-data.txt

Signed-off-by: Joe Thornber <thornber@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:11 +00:00
Mikulas Patocka
95d402f057 dm: add bufio
The dm-bufio interface allows you to do cached I/O on devices,
holding recently-read blocks in memory and performing delayed writes.

We don't use buffer cache or page cache already present in the kernel, because:
* we need to handle block sizes larger than a page
* we can't allocate memory to perform reads or we'd have deadlocks

Currently, when a cache is required, we limit its size to a fraction of
available memory.  Usage can be viewed and changed in
/sys/module/dm_bufio/parameters/ .

The first user is thin provisioning, but more dm users are planned.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:09 +00:00
Alasdair G Kergon
3cf2e4ba74 dm: export dm get md
Export dm_get_md() for the new thin provisioning target to use.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:06 +00:00
Alasdair G Kergon
36a0456fbf dm table: add immutable feature
Introduce DM_TARGET_IMMUTABLE to indicate that the target type cannot be mixed
with any other target type, and once loaded into a device, it cannot be
replaced with a table containing a different type.

The thin provisioning pool device will use this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:04 +00:00
Alasdair G Kergon
cc6cbe141a dm table: add always writeable feature
Add a target feature flag DM_TARGET_ALWAYS_WRITEABLE to indicate that a target
does not support read-only mode.

The initial implementation of the thin provisioning target uses this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:02 +00:00
Alasdair G Kergon
3791e2fc0e dm table: add singleton feature
Introduce the concept of a singleton table which contains exactly one target.

If a target type sets the DM_TARGET_SINGLETON feature bit device-mapper
will ensure that any table that includes that target contains no others.

The thin provisioning pool target uses this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:19:00 +00:00
Mikulas Patocka
7f06965390 dm kcopyd: add dm_kcopyd_zero to zero an area
This patch introduces dm_kcopyd_zero() to make it easy to use
kcopyd to write zeros into the requested areas instead
instead of copying.  It is implemented by passing a NULL
copying source to dm_kcopyd_copy().

The forthcoming thin provisioning target uses this.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:58 +00:00
Namhyung Kim
fbdc86f3bd dm: remove superfluous smp_mb
Since set_current_state() contains a memory barrier in it,
an additional barrier isn't needed.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:56 +00:00
Namhyung Kim
71a16736a1 dm: use local printk ratelimit
printk_ratelimit() shares global ratelimiting state with all
other subsystems, so its usage is discouraged. Instead,
define and use dm's local state.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:54 +00:00
Mandeep Singh Baines
4693c9668f dm table: propagate non rotational flag
Allow QUEUE_FLAG_NONROT to propagate up the device stack if all
underlying devices are non-rotational.  Tools like ureadahead will
schedule IOs differently based on the rotational flag.

With this patch, I see boot time go from 7.75 s to 7.46 s on my device.

Suggested-by: J. Richard Barnette <jrbarnette@chromium.org>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: dm-devel@redhat.com
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-31 20:18:50 +00:00
NeilBrown
7fcc7c8acf md/raid10: Fix bug when activating a hot-spare.
This is a fairly serious bug in RAID10.

When a RAID10 array is degraded and a hot-spare is activated, the
spare does not take up the empty slot, but rather replaces the first
working device.
This is likely to make the array non-functional.   It would normally
be possible to recover the data, but that would need care and is not
guaranteed.

This bug was introduced in commit
   2bb77736ae
which first appeared in 3.1.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-31 12:59:44 +11:00
Linus Torvalds
c3ae1f3356 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (34 commits)
  md: Fix some bugs in recovery_disabled handling.
  md/raid5: fix bug that could result in reads from a failed device.
  lib/raid6: Fix filename emitted in generated code
  md.c: trivial comment fix
  MD: Allow restarting an interrupted incremental recovery.
  md: clear In_sync bit on devices added to an active array.
  md: add proper write-congestion reporting to RAID1 and RAID10.
  md: rename "mdk_personality" to "md_personality"
  md/bitmap remove fault injection options.
  md/raid5: typedef removal: raid5_conf_t -> struct r5conf
  md/raid1: typedef removal: conf_t -> struct r1conf
  md/raid10: typedef removal: conf_t -> struct r10conf
  md/raid0: typedef removal: raid0_conf_t -> struct r0conf
  md/multipath: typedef removal: multipath_conf_t -> struct mpconf
  md/linear: typedef removal: linear_conf_t -> struct linear_conf
  md/faulty: remove typedef: conf_t -> struct faulty_conf
  md/linear: remove typedefs: dev_info_t -> struct dev_info
  md: remove typedefs: mirror_info_t -> struct mirror_info
  md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio
  md: remove typedefs: mdk_thread_t -> struct md_thread
  ...
2011-10-26 21:39:42 +02:00
NeilBrown
d890fa2b05 md: Fix some bugs in recovery_disabled handling.
In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
  1/ one place in raid1 was missed and still sets to '1'.
  2/ We didn't explicitly set the conf-> value at array creation
     time.
     It defaulted to '0' just like the mddev value does so they
     could appear equal and thus disable recovery.
     This did not affect normal 'md' as it calls bind_rdev_to_array
     which changes the mddev value.  However the dmraid interface
     doesn't call this and so doesn't change ->recovery_disabled; so at
     array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.

Reported-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown  <neilb@suse.de>
2011-10-26 11:54:39 +11:00
NeilBrown
355840e7a7 md/raid5: fix bug that could result in reads from a failed device.
This bug was introduced in 415e72d034
which was in 2.6.36.

There is a small window of time between when a device fails and when
it is removed from the array.  During this time we might still read
from it, but we won't write to it - so it is possible that we could
read stale data.

We didn't need the test of 'Faulty' before because the test on
In_sync is sufficient.  Since we started allowing reads from the early
part of non-In_sync devices we need a test on Faulty too.

This is suitable for any kernel from 2.6.36 onwards, though the patch
might need a bit of tweaking in 3.0 and earlier.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-26 10:31:04 +11:00
Tao Ma
9562ad9ab3 block: Remove the control of complete cpu from bio.
bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken.  The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-24 16:11:30 +02:00
Alasdair G Kergon
d136f2efdf dm kcopyd: fix job_pool leak
Fix memory leak introduced by commit a6e50b409d
(dm snapshot: skip reading origin when overwriting complete chunk).

When allocating a set of jobs from kc->job_pool, job->master_job must be
set (to point to itself) so that the mempool item gets freed when the
master_job completes.

master_job was introduced by commit c6ea41fbbe
(dm kcopyd: preallocate sub jobs to avoid deadlock)

Reported-by: Michael Leun <ml@newton.leun.net>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-23 20:55:17 +01:00
Jens Axboe
5c04b426f2 Merge branch 'v3.1-rc10' into for-3.2/core
Conflicts:
	block/blk-core.c
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:30:42 +02:00
Chris Dunlop
751e67ca2e md.c: trivial comment fix
Trivial comment fix

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-19 17:15:15 +11:00
Andrei Warkentin
d70ed2e4fa MD: Allow restarting an interrupted incremental recovery.
If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).

Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-18 12:16:48 +11:00
NeilBrown
d30519fc59 md: clear In_sync bit on devices added to an active array.
When we add a device to an active array it can be meaningful to set
the 'insync' flag.  This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.

Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.

So clear In_sync after moving its value to saved_raid_disk.

Reported-by: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-18 12:13:47 +11:00
NeilBrown
34db0cd60f md: add proper write-congestion reporting to RAID1 and RAID10.
RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread.  This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.

However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput.  The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).

So impose a limit on the number of pending write requests.  By default
it is 1024 which seems to be generally suitable.  Make it configurable
via module option just in case someone finds a regression.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:50:01 +11:00
NeilBrown
84fc4b56db md: rename "mdk_personality" to "md_personality"
"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:58 +11:00
NeilBrown
29d3247ea2 md/bitmap remove fault injection options.
These are too hard to use to be much more than noise.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:56 +11:00
NeilBrown
d1688a6d55 md/raid5: typedef removal: raid5_conf_t -> struct r5conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:52 +11:00
NeilBrown
e809636047 md/raid1: typedef removal: conf_t -> struct r1conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:05 +11:00
NeilBrown
e879a8793f md/raid10: typedef removal: conf_t -> struct r10conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:49:02 +11:00
NeilBrown
e373ab1091 md/raid0: typedef removal: raid0_conf_t -> struct r0conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:59 +11:00
NeilBrown
69724e28ca md/multipath: typedef removal: multipath_conf_t -> struct mpconf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:57 +11:00
NeilBrown
e849b9381f md/linear: typedef removal: linear_conf_t -> struct linear_conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:54 +11:00
NeilBrown
8f1ae43dd2 md/faulty: remove typedef: conf_t -> struct faulty_conf
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:52 +11:00
NeilBrown
a71207713a md/linear: remove typedefs: dev_info_t -> struct dev_info
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:49 +11:00
NeilBrown
0f6d02d580 md: remove typedefs: mirror_info_t -> struct mirror_info
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:46 +11:00
NeilBrown
9f2c9d12bc md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:43 +11:00
NeilBrown
2b8bf3451d md: remove typedefs: mdk_thread_t -> struct md_thread
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:48:23 +11:00
NeilBrown
fd01b88c75 md: remove typedefs: mddev_t -> struct mddev
Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:47:53 +11:00
NeilBrown
3cb0300200 md: removing typedefs: mdk_rdev_t -> struct md_rdev
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11 16:45:26 +11:00
NeilBrown
50de8df4ab md/raid0: convert some printks to pr_debug.
When md assembles a RAID0 array it prints out lots of info which
is really just for debugging, so convert that to pr_debug.
It also prints out the resulting configuration which could be
interesting, so keep that as 'printk' but tidy it up a bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:22 +11:00
NeilBrown
36a4e1fe0f md: remove PRINTK and dprintk debugging and use pr_debug
Being able to dynamically enable these make them much more useful.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:17 +11:00
NeilBrown
bdc04e6b15 md: remove some old DEBUGging code.
This code is not really helpful and is hard to maintain, so just
discard it.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:04 +11:00
NeilBrown
db298e1946 md/raid5: convert to macros into inline functions.
More type-safety.  Easier to read.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:23:00 +11:00
NeilBrown
0fc280f606 md/raid1/ avoid bio search in end_sync_read()
We know which device we just read from so we don't need to
search the bios to find out.  Just use ->read_disk.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:55 +11:00
Namhyung Kim
ba3ae3bee3 md/raid1: factor out common bio handling code
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:53 +11:00
NeilBrown
e4f869d9de md/raid5: remove pointless NULL test.
In the 'abort' branch of run(), 'conf' cannot possibly be NULL,
so remove the test.

Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:49 +11:00
NeilBrown
ce550c2059 md/raid1: add documentation to r1_private_data_s data structure.
There wasn't much and it is inconsistent.
Also rearrange fields to keep related fields together.

Reported-by: Aapo Laine <aapo.laine@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07 14:22:33 +11:00
Linus Torvalds
6367f1775e Merge branch 'for-linus' of http://people.redhat.com/agk/git/linux-dm
* 'for-linus' of http://people.redhat.com/agk/git/linux-dm:
  dm crypt: always disable discard_zeroes_data
  dm: raid fix write_mostly arg validation
  dm table: avoid crash if integrity profile changes
  dm: flakey fix corrupt_bio_byte error path
2011-10-06 08:31:47 -07:00
Milan Broz
983c7db347 dm crypt: always disable discard_zeroes_data
If optional discard support in dm-crypt is enabled, discards requests
bypass the crypt queue and blocks of the underlying device are discarded.
For the read path, discarded blocks are handled the same as normal
ciphertext blocks, thus decrypted.

So if the underlying device announces discarded regions return zeroes,
dm-crypt must disable this flag because after decryption there is just
random noise instead of zeroes.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25 23:26:21 +01:00
Jonthan Brassow
8232480944 dm: raid fix write_mostly arg validation
Fix off-by-one error in validation of write_mostly.

The user-supplied value given for the 'write_mostly' argument must be an
index starting at 0.  The validation of the supplied argument failed to
check for 'N' ('>' vs '>='), which would have caused an access beyond the
end of the array.

Reported-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25 23:26:19 +01:00
Mike Snitzer
876fbba1db dm table: avoid crash if integrity profile changes
Commit a63a5cf (dm: improve block integrity support) introduced a
two-phase initialization of a DM device's integrity profile.  This
patch avoids dereferencing a NULL 'template_disk' pointer in
blk_integrity_register() if there is an integrity profile mismatch in
dm_table_set_integrity().

This can occur if the integrity profiles for stacked devices in a DM
table are changed between the call to dm_table_prealloc_integrity() and
dm_table_set_integrity().

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org # 2.6.39
2011-09-25 23:26:17 +01:00
Mike Snitzer
68e58a294f dm: flakey fix corrupt_bio_byte error path
If no arguments were provided to the corrupt_bio_byte feature an error
should be returned immediately.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25 23:26:15 +01:00
Daniel P. Berrange
2dba6a911c md: don't delay reboot by 1 second if no MD devices exist
The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.

1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.

* drivers/md/md.c: md_notify_reboot() - only impose a delay if
  there was at least one MD device to be stopped during reboot

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-23 19:54:04 +10:00
Wang Sheng-Hui
7e84152626 trival: md_k.h should be md.h in the beginning comment of file md.h
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21 15:37:46 +10:00
NeilBrown
2585f3ef8c md/bitmap: improve handling of 'allclean'.
The 'allclean' flag is used to cache the fact that there is nothing to
do, so we can avoid waking up and scanning the bitmap regularly.

The two sorts of pages that might need the attention of the bitmap
daemon are BITMAP_PAGE_PENDING and BITMAP_PAGE_NEEDWRITE pages.

So make sure allclean reflects exactly when there are none of those.
So:
  set it before scanning all pages with either bit set.
  clear it whenever these bits are set
  clear it when we desire not to clear one of these bits.
  don't clear it any other time.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21 15:37:46 +10:00
NeilBrown
5a537df44d md/bitmap: rename and tidy up BITMAP_PAGE_CLEAN
The flag 'BITMAP_PAGE_CLEAN' has a confusing name as it doesn't mean
that the page is clean, but rather that there are counters in the page
which allow bits in the bitmap to be cleared - i.e. maybe cleaning can
happen.

So change it to BITMAP_PAGE_PENDING and fix some irregularities:
 - Don't set it in bitmap_init_from_disk as bitmap_set_memory_bits
   sets it when needed
 - in bitmap_daemon_work, if we find a counter that is '1', but
   need_sync is set, then set BITMAP_PAGE_PENDING again (it was
   recently cleared) to ensure we don't forget about this bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21 15:37:46 +10:00
NeilBrown
01f96c0a99 md: Avoid waking up a thread after it has been freed.
Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
   without subsequently clearing ->thread.  A subsequent call
   to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
   disappeared either by:
      - holding the ->mutex
      - having an active request, so something else must be keeping
        the array active.
   However mddev_unlock calls md_wakeup_thread after dropping the
   mutex and without any certainty of an active request, so the
   ->thread could theoretically disappear.
   So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>
cc: stable@kernel.org
2011-09-21 15:30:20 +10:00
Christoph Hellwig
5a7bbad27a block: remove support for bio remapping from ->make_request
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:12:01 +02:00
Jens Axboe
c20e8de27f block: rename __make_request() to blk_queue_bio()
Now that it's exported, lets put it in a more sane namespace.

Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:31 +02:00
Christoph Hellwig
166e1f901b block: export __make_request
Avoid the hacks need for request based device mappers currently by simply
exporting the symbol instead of trying to get it through the back door.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:08:27 +02:00
NeilBrown
27a7b260f7 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.
0.90 metadata uses an unsigned 32bit number to count the number of
kilobytes used from each device.
This should allow up to 4TB per device.
However we multiply this by 2 (to get sectors) before casting to a
larger type, so sizes above 2TB get truncated.

Also we allow rdev->sectors to be larger than 4TB, so it is possible
for the array to be resized larger than the metadata can handle.
So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
used.

Also the sanity check at the end of super_90_load should include level
1 as it used ->size too. (RAID0 and Linear don't use ->size at all).

Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl>
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:28 +10:00
NeilBrown
079fa166a2 md/raid1,10: Remove use-after-free bug in make_request.
A single request to RAID1 or RAID10 might result in multiple
requests if there are known bad blocks that need to be avoided.

To detect if we need to submit another write request we test:
 	if (sectors_handled < (bio->bi_size >> 9)) {

However this is after we call **_write_done() so the 'bio' no longer
belongs to us - the writes could have completed and the bio freed.

So move the **_write_done call until after the test against
bio->bi_size.

This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

Reported-by: Bruno Wolff III <bruno@wolff.to>
Tested-by: Bruno Wolff III <bruno@wolff.to>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:23 +10:00
NeilBrown
19d5f834d6 md/raid10: unify handling of write completion.
A write can complete at two different places:
1/ when the last member-device write completes, through
   raid10_end_write_request
2/ in make_request() when we remove the initial bias from ->remaining.

These two should do exactly the same thing and the comment says they
do, but they don't.

So factor the correct code out into a function and call it in both
places.  This makes the code much more similar to RAID1.

The difference is only significant if there is an error, and they
usually take a while, so it is unlikely that there will be an error
already when make_request is completing, so this is unlikely to cause
real problems.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10 17:21:17 +10:00
NeilBrown
43220aa0f2 md/raid5: fix a hang on device failure.
Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
cannot work with internal metadata as it is the raid5d thread which
will clear the blocked flag.
This wasn't a problem in 3.0 and earlier as we only set the blocked
flag when external metadata was used then.
However we now set it always, so we need to be more careful.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-31 12:49:14 +10:00
NeilBrown
7da64a0abc md: fix clearing of 'blocked' flag in the presence of bad blocks.
When the 'blocked' flag on a device is cleared while there are
unacknowledged bad blocks we must fail the device.  This is needed for
backwards compatability of the interface.

The code currently uses the wrong test for "unacknowledged bad blocks
exist".  Change it to the right test.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-30 16:20:17 +10:00
NeilBrown
1b6afa1758 md/linear: avoid corrupting structure while waiting for rcu_free to complete.
I don't know what I was thinking putting 'rcu' after a dynamically
sized array!  The array could still be in use when we call rcu_free()
(That is the point) so we mustn't corrupt it.

Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:43:53 +10:00
Namhyung Kim
a5bf4df0c8 md: use REQ_NOIDLE flag in md_super_write()
Queue idling is used for the anticipation of immediate
sequencial I/O's but md_super_write() is a kind of one-
shot operation, coupled with md_super_wait(), so the
idling in this case will be just a waste of time.

Specifying REQ_NOIDLE prevents it. Instead of adding
the flag to submit_bio() directly, use pre-defined
macro WRITE_FLUSH_FUA.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:43:34 +10:00
NeilBrown
aeb9b21184 md: ensure changes to 'write-mostly' are reflected in metadata.
The 'write-mostly' flag can be changed through sysfs.
With 0.90 metadata, those changes are reflected in the metadata.
For 1.x metadata, they aren't.

So fix super_1_sync to record 'write-mostly' status.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:43:08 +10:00
NeilBrown
5ef56c8fec md: report failure if a 'set faulty' request doesn't.
Sometimes a device will refuse to be set faulty.  e.g. RAID1 will
never let the last working device become faulty.

So check if "md_error()" did manage to set the faulty flag and fail
with EBUSY if it didn't.

Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198
Reported-by: Mike Hommey <mh+reportbug@glandium.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25 14:42:51 +10:00
Mike Snitzer
ed8b752bcc dm table: set flush capability based on underlying devices
DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities
regardless of whether or not a given DM device's underlying devices
also advertised a need for them.

Block's flush-merge changes from 2.6.39 have proven to be more costly
for DM devices.  Performance regressions have been reported even when
DM's underlying devices do not advertise that they have a write cache.

Fix the performance regressions by configuring a DM device's flushing
capabilities based on those of the underlying devices' capabilities.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:08 +01:00
Milan Broz
772ae5f54d dm crypt: optionally support discard requests
Add optional parameter field to dmcrypt table and support
"allow_discards" option.

Discard requests bypass crypt queue processing. Bio is simple remapped
to underlying device.

Note that discard will be never enabled by default because of security
consequences.  It is up to the administrator to enable it for encrypted
devices.

(Note that userspace cryptsetup does not understand new optional
parameters yet.  Support for this will come later.  Until then, you
should use 'dmsetup' to enable and disable this.)

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:08 +01:00
Jonathan Brassow
327372797c dm raid: add md raid1 support
Support the MD RAID1 personality through dm-raid.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
b12d437b73 dm raid: support metadata devices
Add the ability to parse and use metadata devices to dm-raid.  Although
not strictly required, without the metadata devices, many features of
RAID are unavailable.  They are used to store a superblock and bitmap.

The role, or position in the array, of each device must be recorded in
its superblock.  This is to help with fault handling, array reshaping,
and sanity checks.  RAID 4/5/6 devices must be loaded in a specific order:
in this way, the 'array_position' field helps validate the correctness
of the mapping when it is loaded.  It can be used during reshaping to
identify which devices are added/removed.  Fault handling is impossible
without this field.  For example, when a device fails it is recorded in
the superblock.  If this is a RAID1 device and the offending device is
removed from the array, there must be a way during subsequent array
assembly to determine that the failed device was the one removed.  This
is done by correlating the 'array_position' field and the bit-field
variable 'failed_devices'.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
46bed2b5c1 dm raid: add write_mostly parameter
Add the write_mostly parameter to RAID1 dm-raid tables.

This allows the user to set the WriteMostly flag on a RAID1 device that
should normally be avoided for read I/O.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Jonathan Brassow
c1084561bb dm raid: add region_size parameter
Allow the user to specify the region_size.

Ensures that the supplied value meets md's constraints, viz. the number of
regions does not exceed 2^21.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:07 +01:00
Mikulas Patocka
759dea204c dm ioctl: forbid multiple device specifiers
Exactly one of name, uuid or device must be specified when referencing
an existing device.  This removes the ambiguity (risking the wrong
device being updated) if two conflicting parameters were specified.
Previously one parameter got used and any others were ignored silently.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka
ba2e19b0f4 dm ioctl: introduce __get_dev_cell
Move logic to find device based on major/minor number to a separate
function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell).
This makes the function __find_device_hash_cell more straightforward.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mikulas Patocka
0ddf9644cc dm ioctl: fill in device parameters in more ioctls
Move parameter filling from find_device to __find_device_hash_cell.

This patch causes ioctls using __find_device_hash_cell
(DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD)
to return device parameters, bringing them into line with the other
ioctls.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mike Snitzer
a3998799fb dm flakey: add corrupt_bio_byte feature
Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a
specified position with a specified value during intervals when the device is
"down".

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:06 +01:00
Mike Snitzer
b26f5e3d71 dm flakey: add drop_writes
Add 'drop_writes' option to drop writes silently while the
device is 'down'.  Reads are not touched.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer
dfd068b01f dm flakey: support feature args
Add the ability to specify arbitrary feature flags when creating a
flakey target.  This code uses the same target argument helpers that
the multipath target does.

Also remove the superfluous 'dm-flakey' prefixes from the error messages,
as they already contain the prefix 'flakey'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer
30e4171bfe dm flakey: use dm_target_offset and support discards
Use dm_target_offset() and support discards.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:05 +01:00
Mike Snitzer
498f0103ea dm table: share target argument parsing functions
Move multipath target argument parsing code into dm-table so other
targets can share it.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mikulas Patocka
a6e50b409d dm snapshot: skip reading origin when overwriting complete chunk
If we write a full chunk in the snapshot, skip reading the origin device
because the whole chunk will be overwritten anyway.

This patch changes the snapshot write logic when a full chunk is written.
In this case:
  1. allocate the exception
  2. dispatch the bio (but don't report the bio completion to device mapper)
  3. write the exception record
  4. report bio completed

Callbacks must be done through the kcopyd thread, because callbacks must not
race with each other.  So we create two new functions:

  dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback.
  (This function must not be called from interrupt context.)

  dm_kcopyd_do_callback: submit callback.
  (This function may be called from interrupt context.)

Performance test (on snapshots with 4k chunk size):
  without the patch:
    non-direct-io sequential write (dd):    17.7MB/s
    direct-io sequential write (dd):        20.9MB/s
    non-direct-io random write (mkfs.ext2): 0.44s

  with the patch:
    non-direct-io sequential write (dd):    26.5MB/s
    direct-io sequential write (dd):        33.2MB/s
    non-direct-io random write (mkfs.ext2): 0.27s

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mikulas Patocka
d5b9dd04bd dm: ignore merge_bvec for snapshots when safe
Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate
whether the device can accept bios larger than the size its merge
function returns.  When set, use this to send large bios to snapshots
which can split them if necessary.  Snapshot I/O may be significantly
fragmented and this approach seems to improve peformance.

Before the patch, dm_set_device_limits restricted bio size to page size
if the underlying device had a merge function and the target didn't
provide a merge function.  After the patch, dm_set_device_limits
restricts bio size to page size if the underlying device has a merge
function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't
provide a merge function.

The snapshot target can't provide a merge function because when the merge
function is called, it is impossible to determine where the bio will be
remapped.  Previously this led us to impose a 4k limit, which we can
now remove if the snapshot store is located on a device without a merge
function.  Together with another patch for optimizing full chunk writes,
it improves performance from 29MB/s to 40MB/s when writing to the
filesystem on snapshot store.

If the snapshot store is placed on a non-dm device with a merge function
(such as md-raid), device mapper still limits all bios to page size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Mike Snitzer
0864901254 dm table: clean dm_get_device and move exports
There is no need for __table_get_device to be factored out.
Also move the exports to the end of their respective functions.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:04 +01:00
Alasdair G Kergon
3e8dbb7f39 dm raid: tidy includes
A dm target only needs to use include/linux dm headers.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Alasdair G Kergon
2ca4c92f58 dm ioctl: prevent empty message
Detect invalid empty messages in core dm instead of requiring every target to
check this.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Jonathan Brassow
13c87583ea dm raid: cleanup parameter handling
Re-order the parameters so they are handled consistently in the same order
where defined, parsed and output.

Only include rebuild parameters in the STATUSTYPE_TABLE output if they were
supplied in the original table line.

Correct the parameter count when outputting rebuild: there are two words,
not one.

Use case-independent checks for keywords (as in other device-mapper targets).

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Jonathan Brassow
a2d2b0345a dm snapshot: style cleanups
Coding style cleanups.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
2011-08-02 12:32:03 +01:00
Mikulas Patocka
aa3f0794d2 dm snapshot: remove unused definitions
Remove a couple of unused #defines.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:03 +01:00
Mikulas Patocka
5bf45a3dcd dm kcopyd: remove nr_pages field from job structure
The nr_pages field in struct kcopyd_job is only used temporarily in
run_pages_job() to count the number of required pages.
We can use a local variable instead.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Mikulas Patocka
4622afb3f5 dm kcopyd: remove offset field from job structure
The offset field in struct kcopyd_job is always zero so remove it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Joe Perches
e29e65aacb dm: use vzalloc
Use vzalloc() instead of vmalloc()+memset().

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Kirill A. Shutemov
6c9b27ab08 dm log: userspace use list_move
Replace list_del() followed by list_add() with list_move().

Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:02 +01:00
Akinobu Mita
c8f543e078 dm log: clean up bit little endian bitops
Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().

This also removes unnecessary casts.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Mike Snitzer
936688d7eb dm table: fix discard support
Remove 'discards_supported' from the dm_table structure.  The same
information can be easily discovered from the table's target(s) in
dm_table_supports_discards().

Before this fix dm_table_supports_discards() would skip checking the
individual targets' 'discards_supported' flag if any one target in the
table didn't set num_discard_requests > 0.  Now the per-target
'discards_supported' flag is effective at insuring the final DM device
advertises discard support.  But, to be clear, targets that don't
support discards (!num_discard_requests) will not receive discard
requests.

Also DMWARN if a target sets 'discards_supported' override but forgets
to set 'num_discard_requests'.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Alasdair G Kergon
283a8328ca dm: suppress endian warnings
Suppress sparse warnings about cpu_to_le32() by using __le32 types for
on-disk data etc.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Alasdair G Kergon
d15b774c29 dm: fix idr leak on module removal
Destroy _minor_idr when unloading the core dm module.  (Found by kmemleak.)

Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Mikulas Patocka
bb91bc7bac dm io: flush cpu cache with vmapped io
For normal kernel pages, CPU cache is synchronized by the dma layer.
However, this is not done for pages allocated with vmalloc. If we do I/O
to/from vmallocated pages, we must synchronize CPU cache explicitly.

Prior to doing I/O on vmallocated page we must call
flush_kernel_vmap_range to flush dirty cache on the virtual address.
After finished read we must call invalidate_kernel_vmap_range to
invalidate cache on the virtual address, so that accesses to the virtual
address return newly read data and not stale data from CPU cache.

This patch fixes metadata corruption on dm-snapshots on PA-RISC and
possibly other architectures with caches indexed by virtual address.

Cc: stable <stable@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:01 +01:00
Mike Snitzer
286f367dad dm mpath: fix potential NULL pointer in feature arg processing
Avoid dereferencing a NULL pointer if the number of feature arguments
supplied is fewer than indicated.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Cc: stable@kernel.org
2011-08-02 12:32:00 +01:00
Mikulas Patocka
762a80d9fc dm snapshot: flush disk cache when merging
This patch makes dm-snapshot flush disk cache when writing metadata for
merging snapshot.

Without cache flushing the disk may reorder metadata write and other
data writes and there is a possibility of data corruption in case of
power fault.

Cc: stable@kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02 12:32:00 +01:00
Linus Torvalds
6140333d36 Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (75 commits)
  md/raid10: handle further errors during fix_read_error better.
  md/raid10: Handle read errors during recovery better.
  md/raid10: simplify read error handling during recovery.
  md/raid10: record bad blocks due to write errors during resync/recovery.
  md/raid10:  attempt to fix read errors during resync/check
  md/raid10:  Handle write errors by updating badblock log.
  md/raid10: clear bad-block record when write succeeds.
  md/raid10: avoid writing to known bad blocks on known bad drives.
  md/raid10 record bad blocks as needed during recovery.
  md/raid10: avoid reading known bad blocks during resync/recovery.
  md/raid10 - avoid reading from known bad blocks - part 3
  md/raid10: avoid reading from known bad blocks - part 2
  md/raid10: avoid reading from known bad blocks - part 1
  md/raid10: Split handle_read_error out from raid10d.
  md/raid10: simplify/reindent some loops.
  md/raid5: Clear bad blocks on successful write.
  md/raid5.  Don't write to known bad block on doubtful devices.
  md/raid5: write errors should be recorded as bad blocks if possible.
  md/raid5: use bad-block log to improve handling of uncorrectable read errors.
  md/raid5: avoid reading from known bad blocks.
  ...
2011-07-28 05:50:27 -07:00
NeilBrown
58c54fcca3 md/raid10: handle further errors during fix_read_error better.
If we find more read/write errors we should record a bad block before
failing the device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown
5e5702898e md/raid10: Handle read errors during recovery better.
Currently when we get a read error during recovery, we simply abort
the recovery.

Instead, repeat the read in page-sized blocks.
On successful reads, write to the target.
On read errors, record a bad block on the destination,
and only if that fails do we abort the recovery.

As we now retry reads we need to know where we read from.  This was in
bi_sector but that can be changed during a read attempt.
So store the correct from_addr and to_addr in the r10_bio for later
access.


Signed-off-by: NeilBrown<neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown
e684e41db3 md/raid10: simplify read error handling during recovery.
If a read error is detected during recovery the code currently
fails the read device.
This isn't really necessary.  recovery_request_write will signal
a write error to end_sync_write and it will record a write
error on the destination device which will record a bad block
there or kick it from the array.

So just remove this call to do md_error.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown
1a0b7cd826 md/raid10: record bad blocks due to write errors during resync/recovery.
If we get a write error during resync/recovery don't fail the device
but instead record a bad block.  If that fails we can then fail the
device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown
f84ee364dd md/raid10: attempt to fix read errors during resync/check
We already attempt to fix read errors found during normal IO
and a 'repair' process.
It is best to try to repair them at any time they are found,
so move a test so that during sync and check a read error will
be corrected by over-writing with good data.

If both (all) devices have known bad blocks in the sync section we
won't try to fix even though the bad blocks might not overlap.  That
should be considered later.

Also if we hit a read error during recovery we don't try to fix it.
It would only be possible to fix if there were at least three copies
of data, which is not very common with RAID10.  But it should still
be considered later.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:25 +10:00
NeilBrown
bd870a16c5 md/raid10: Handle write errors by updating badblock log.
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown
749c55e942 md/raid10: clear bad-block record when write succeeds.
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown
d4432c23be md/raid10: avoid writing to known bad blocks on known bad drives.
Writing to known bad blocks on drives that have seen a write error
is asking for trouble.  So try to avoid these blocks.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown
e875ecea26 md/raid10 record bad blocks as needed during recovery.
When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown
40c356ce5a md/raid10: avoid reading known bad blocks during resync/recovery.
During resync/recovery limit the size of the request to avoid
reading into a bad block that does not start at-or-before the current
read address.

Similarly if there is a bad block at this address, don't allow the
current request to extend beyond the end of that bad block.

Now that we don't ever read from known bad blocks, it is safe to allow
devices with those blocks into the array.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown
8dbed5cebd md/raid10 - avoid reading from known bad blocks - part 3
When attempting to repair a read error, don't read from
devices with a known bad block.

As we are only reading PAGE_SIZE blocks, we don't try to
narrow down to smaller regions in the hope that only part of this
page is bad - it isn't worth the effort.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:24 +10:00
NeilBrown
7399c31bc9 md/raid10: avoid reading from known bad blocks - part 2
When redirecting a read error to a different device, we must
again avoid bad blocks and possibly split the request.

Spin_lock typo fixed thanks to Dan Carpenter <error27@gmail.com>

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown
856e08e237 md/raid10: avoid reading from known bad blocks - part 1
This patch just covers the basic read path:
 1/ read_balance needs to check for badblocks, and return not only
    the chosen slot, but also how many good blocks are available
    there.
 2/ read submission must be ready to issue multiple reads to
    different devices as different bad blocks on different devices
    could mean that a single large read cannot be served by any one
    device, but can still be served by the array.
    This requires keeping count of the number of outstanding requests
    per bio.  This count is stored in 'bi_phys_segments'

On read error we currently just fail the request if another target
cannot handle the whole request.  Next patch refines that a bit.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown
560f8e5532 md/raid10: Split handle_read_error out from raid10d.
raid10d() is too big and is about to get bigger, so split
handle_read_error() out as a separate function.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown
1294b9c973 md/raid10: simplify/reindent some loops.
When a loop ends with a large if, it can be neater to change the
if to invert the condition and just 'continue'.
Then the body of the if can be indented to a lower level.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown
b84db560ea md/raid5: Clear bad blocks on successful write.
On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:23 +10:00
NeilBrown
73e92e51b7 md/raid5. Don't write to known bad block on doubtful devices.
If a device has seen write errors, don't write to any known
bad blocks on that device.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown
bc2607f393 md/raid5: write errors should be recorded as bad blocks if possible.
When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block.  Only if that fails
does the device get marked as faulty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown
7f0da59bdc md/raid5: use bad-block log to improve handling of uncorrectable read errors.
If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown
31c176ecdf md/raid5: avoid reading from known bad blocks.
There are two times that we might read in raid5:
1/ when a read request fits within a chunk on a single
   working device.
   In this case, if there is any bad block in the range of
   the read, we simply fail the cache-bypass read and
   perform the read though the stripe cache.

2/ when reading into the stripe cache.  In this case we
   mark as failed any device which has a bad block in that
   strip (1 page wide).
   Note that we will both avoid reading and avoid writing.
   This is correct (as we will never read from the block, there
   is no point writing), but not optimal (as writing could 'fix'
   the error) - that will be addressed later.

If we have not seen any write errors on the device yet, we treat a bad
block like a recent read error.  This will encourage an attempt to fix
the read error which will either generate a write error, or will
ensure good data is stored there.  We don't yet forget the bad block
in that case.  That comes later.

Now that we honour bad blocks when reading we can allow devices with
bad blocks into the array.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:39:22 +10:00
NeilBrown
62096bce23 md/raid1: factor several functions out or raid1d()
raid1d is too big with several deep branches.
So separate them out into their own functions.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:38:13 +10:00
NeilBrown
3a9f28a511 md/raid1: improve handling of read failure during recovery.
If we cannot read a block from anywhere during recovery, there is
now a better approach than just giving up.
We can record a bad block on each device and keep going - being
careful not to clear the bad block when a write succeeds as it might -
it will be a write of incorrect data.

We have now reached the state where - for raid1 - we only call
md_error if md_set_badblocks has failed.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:33:42 +10:00
NeilBrown
d8f05d2995 md/raid1: record badblocks found during resync etc.
If we find a bad block while writing as part of resync/recovery we
need to report that back to raid1d which must record the bad block,
or fail the device.

Similarly when fixing a read error, a further error should just
record a bad block if possible rather than failing the device.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:33:00 +10:00
NeilBrown
cd5ff9a16f md/raid1: Handle write errors by updating badblock log.
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:32:41 +10:00
NeilBrown
2ca68f5ed7 md/raid1: store behind-write pages in bi_vecs.
When performing write-behind we allocate pages to store the data
during write.
Previously we just keep a list of pages.  Now we keep a list of
bi_vec which includes offset and size.
This means that the r1bio has complete information to create a new
bio which will be needed for retrying after write errors.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:32:10 +10:00
NeilBrown
4367af5561 md/raid1: clear bad-block record when write succeeds.
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:49 +10:00
NeilBrown
1f68f0c4b6 md/raid1: avoid writing to known-bad blocks on known-bad drives.
If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:48 +10:00
NeilBrown
de393cdea6 md: make it easier to wait for bad blocks to be acknowledged.
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown
d7a9d443bc md: add 'write_error' flag to component devices.
If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:48 +10:00
NeilBrown
06f603851f md/raid1: avoid reading known bad blocks during resync
When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.

Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown
d2eb35acfd md/raid1: avoid reading from known bad blocks.
Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
  1/ read_balance needs to check for bad blocks, and return not only
     the chosen device, but also how many good blocks are available
     there.
  2/ fix_read_error needs to avoid trying to read from bad blocks.
  3/ read submission must be ready to issue multiple reads to
     different devices as different bad blocks on different devices
     could mean that a single large read cannot be served by any one
     device, but can still be served by the array.
     This requires keeping count of the number of outstanding requests
     per bio.  This count is stored in 'bi_phys_segments'
  4/ retrying a read needs to also be ready to submit a smaller read
     and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:48 +10:00
NeilBrown
9f2f383078 md: Disable bad blocks and v0.90 metadata.
v0.90 metadata cannot record bad blocks, so when loading metadata
for such a device, set shift to -1.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:47 +10:00
NeilBrown
2699b67223 md: load/store badblock list from v1.x metadata
Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.

Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28 11:31:47 +10:00
NeilBrown
34b343cff4 md: don't allow arrays to contain devices with bad blocks.
As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28 11:31:47 +10:00