Commit Graph

2404 Commits

Author SHA1 Message Date
Jonathan E Brassow
c039c332f2 dm raid: move sectors_per_dev calculation
In preparation for RAID10 inclusion in dm-raid, we move the sectors_per_dev
calculation later in the device creation process.  This is because we won't
know up-front how many stripes vs how many mirrors there are which will
change the calculation.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:04 +01:00
Jonathan E Brassow
f999e8fe70 dm raid: restructure parse_raid_params
In preparation for RAID10 addition to dm-raid, we change an 'if' conditional
to a 'switch' conditional to make it easier to see what is being checked for
each RAID type.

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:04 +01:00
Mike Snitzer
a58a935d5a dm mpath: add retain_attached_hw_handler feature
A SCSI device handler might get attached to a device during the
initial device scan.  We do not necessarily want to override
this when loading a multipath table, so this patch adds a new
multipath feature argument "retain_attached_hw_handler".

During SCSI device scan all loaded SCSI device handlers will be
consulted for a match (via scsi_dh's provided .match).  If a match is
found that device handler will be attached.  We need a way to have
userspace multipathd's provided 'hw_handler' not override the already
attached hardware handler.

When specifying the new feature 'retain_attached_hw_handler' multipath
will use the currently attached hardware handler instead of trying to
attach the one specified during table load.  If no hardware handler is
attached the specified hardware handler will still be used.

Leverages scsi_dh_attach's ability to increment the scsi_dh's reference
count if the same scsi_dh name is provided when attaching - currently
attached scsi_dh name is determined with scsi_dh_attached_handler_name.

Depends upon commit 7e8a74b177
("[SCSI] scsi_dh: add scsi_dh_attached_handler_name").

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Babu Moger <babu.moger@netapp.com>
Reviewed-by: Chandra Seetharaman <sekharan@us.ibm.com>
Acked-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:04 +01:00
Mikulas Patocka
f9a8e0cd26 dm thin: optimize power of two block size
dm-thin will be most likely used with a block size that is a power of
two. So it should be optimized for this case.

This patch changes division and modulo operations to shifts and bit
masks if block size is a power of two.

A test that bi_sector is divisible by a block size is removed from
io_overlaps_block. Device mapper never sends bios that span a block
boundary. Consequently, if we tested that bi_size is equivalent to block
size, bi_sector must already be on a block boundary.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:03 +01:00
Mikulas Patocka
4929630901 dm thin: split discards on block boundary
This patch sets the variable "ti->split_discard_requests" for the dm thin
target so that device mapper core splits discard requests on a block
boundary.

Consequently, a discard request that spans multiple blocks is never sent
to dm-thin. The patch also removes some code in process_discard that
deals with discards that span multiple blocks.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:03 +01:00
Mikulas Patocka
7acf0277ce dm: introduce split_discard_requests
This patch introduces a new variable split_discard_requests. It can be
set by targets so that discard requests are split on max_io_len
boundaries.

When split_discard_requests is not set, discard requests are only split on
boundaries between targets, as was the case before this patch.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:03 +01:00
Mike Snitzer
55f2b8bdb0 dm thin: support for non power of 2 pool blocksize
Non power of 2 blocksize support is needed to properly align thinp IO
on storage that has non power of 2 optimal IO sizes (e.g. RAID6 10+2).

Use sector_div to support non power of 2 blocksize for the pool's
data device.  This provides comparable performance to the power of 2
math that was performed until now (as tested on modern x86_64 hardware).

The kernel currently assumes that limits->discard_granularity is a power
of two so the thin target only enables discard support if the block
size is a power of two.

Eliminate pool structure's 'block_shift', 'offset_mask' and
remaining 4 byte holes.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:02 +01:00
Mikulas Patocka
33d07c0dfa dm stripe: optimize chunk_size calculations
dm-stripe is usually used with a chunk size that is a power of two.
Use faster shifts and bit masks in such cases.

stripe_width is already optimized in a similar way.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:02 +01:00
Mikulas Patocka
8f069b41bc dm stripe: remove minimum stripe size
There is no technical limitation in device mapper that would prevent the
dm-stripe target from using a stripe size smaller than page size.

This patch removes the limit and makes stripe volumes portable across
architectures with different page size.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:01 +01:00
Mike Snitzer
eb850de608 dm stripe: support for non power of 2 chunksize
Support non-power-of-2 chunk sizes with dm striping for proper alignment
of stripe IO on storage that has non-power-of-2 optimal IO sizes (e.g.
RAID6 10+2).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:01 +01:00
Mike Snitzer
542f903814 dm: support non power of two target max_io_len
Remove the restriction that limits a target's specified maximum incoming
I/O size to be a power of 2.

Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'.
Change it from sector_t to uint32_t, which is plenty big enough, and
introduce a wrapper function dm_set_target_max_io_len() to set it.
Use sector_div() to process it now that it is not necessarily a power of 2.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:00 +01:00
Mikulas Patocka
1df05483d7 dm stripe: remove stripes_mask
The structure stripe_c contains a stripes_mask field. This field is
useless because it can be trivially calculated by subtracting one from
stripes. It is used only at one place. This patch removes it.

The patch also changes ffs(stripes) - 1 to __ffs(stripes).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:00 +01:00
Mikulas Patocka
f14fa693c9 dm stripe: fix size test
dm-stripe is supposed to ensure that all the space allocated to the
stripes is fully used and that all stripes are the same size.  This
patch fixes the test.  It checks that device length is divisible by the
chunk size and checks that the resulting quotient is divisible by the
number of stripes (which is equivalent to testing if device length is
divisible by chunk_size * stripes).

Previously, the code only tested that the number of sectors in the target
was divisible by each of the chunk size and the number of stripes
separately, which could leave entire stripes unused.

(A setup that genuinely needs some stripes to be shorter than others
can be created by concatenating striped targets.)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:08:00 +01:00
Mike Snitzer
f09996c993 dm thin: provide specific errors for two table load failure cases
Provide specific error message strings for two pool_ctr() failure cases
that currently give just "Unknown error".

Reference: test_two_pools_pointing_to_the_same_metadata_fails and
test_different_pool_cant_replace_pool in thinp-test-suite.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:59 +01:00
majianpeng
1a66a08ae8 dm: replace simple_strtoul
Replace obsolete simple_strtoul() with kstrtou8/kstrtouint.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:59 +01:00
Alasdair G Kergon
70c4861102 dm snapshot: remove redundant assignment in merge fn
Remove redundant bvm->bi_sector self-assignment in dm snapshot's
origin_merge().

Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:59 +01:00
Joe Thornber
8c971178a7 dm thin metadata: introduce THIN_MAX_CONCURRENT_LOCKS
Introduce THIN_MAX_CONCURRENT_LOCKS into dm-thin-metadata to
give a name to an otherwise "magic" number.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:58 +01:00
Joe Thornber
d973ac196b dm thin metadata: remove pointless label from __commit_transaction
Remove the pointless label 'out' from __commit_transaction in
dm-thin-metadata.c

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:58 +01:00
Joe Thornber
3caf6d73d4 dm persistent data: remove debug space map checker
Remove debug space map checker from dm persistent data.

The space map checker is a wrapper for other space maps that double
checks the reference counts are correct.  It holds all these reference
counts in memory rather than on disk, so uses a lot of memory and is
thus restricted to small pools.

As yet, this checker hasn't found any issues, but has caused a few of
its own due to people turning it on by default with larger pools.

Removing.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:58 +01:00
Mike Snitzer
17b7d63f7e dm thin: clean up compiler warning
Clean up "warning: dubious: !x & y".  Also make it clear that
__snapshotted_since() returns a bool and that dm_thin_lookup_result's
'shared' member is a flag.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:57 +01:00
Alasdair G Kergon
7768ed33cc dm thin: reduce endio_hook pool size
Reduce the slab size used for the dm_thin_endio_hook mempool.

Allocation has been seen to fail on machines with smaller amounts
of memory due to fragmentation.

  lvm: page allocation failure. order:5, mode:0xd0
  device-mapper: table: 253:38: thin-pool: Error creating pool's endio_hook mempool

Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-27 15:07:57 +01:00
Linus Torvalds
935173744a Three fixes for device-mapper discard processing:
- avoid a crash in dm-raid1 when discards coincide with mirror recovery;
   - avoid discarding shared data that's still needed in dm-thin;
   - don't guarantee that discarded blocks will be wiped in dm-raid1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQCV8qAAoJEK2W1qbAHj1niSAP/2K0RkgWvL0hwuaM+us0oh29
 XFou6Tb9pH+//QfKOJuClHeSfZFoHYuevvJPtwTqPlHGONE2YXeBtVmyp0k+BS69
 xoaQy+OoZFrEbhxyJFrg+lDcxVGRtvo7x9zegeRf++o/skRfRgAjzyLkI8bk4t3v
 c3vSDTVBikJXlTxa+J7EQpeW29DBiky+tIHQQx0+98u2VSlaFFP6MdLr1ROeq7yF
 +z3kEXk6qzwL9ZHTWuVCvhi7bw4i18UTrH0wxZuUXWRpz+Va5h7w+/zcQbau6D/s
 K+BmlAW/fxzZOW4guFU6pCLlVGU4BsJxUXT55UaP4Dx9UuV59EtIPsDb8/Y/pGMX
 t9xnC4GmSOjw52pW2VR2gUJwG/c5mJ9g/mdP6twQzcC4JJ+CYg4Q5lH88qzDqceS
 VCrW681nIKIVoja5n1adv6gbZax8hlR/z8ElXrqELDmXk7nKBLOLdDVSXzZ9ceX1
 RnvtAZE/zrxcslKHw52Sd37c8YRer/fgx3kQxhXd1nb096DgiWvE/taD/ixjWHQX
 Eu1KrQIelvw63/BNNTKYRF7xS0dGKsGNaXWln7cMONG28CnrWG/8f+mp+KG73x5e
 Fc8yCONHNbqmf95yx1N0MgfYlZFjBBw0+BtqmR7QVcnG3r4SaSug+F72SPb5nN/B
 ZBmwNcSBaaC952+5pMZa
 =gbLp
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper discard fixes from Alasdair G Kergon:
  - avoid a crash in dm-raid1 when discards coincide with mirror
    recovery;
  - avoid discarding shared data that's still needed in dm-thin;
  - don't guarantee that discarded blocks will be wiped in dm-raid1.

* tag 'dm-3.5-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm raid1: set discard_zeroes_data_unsupported
  dm thin: do not send discards to shared blocks
  dm raid1: fix crash with mirror recovery and discard
2012-07-20 11:51:22 -07:00
Mikulas Patocka
7c8d3a42fe dm raid1: set discard_zeroes_data_unsupported
We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
the underlying disks support zero on discard.  So this patch sets
ti->discard_zeroes_data_unsupported.

For example, if the mirror is in the process of resynchronizing, it may
happen that kcopyd reads a piece of data, then discard is sent on the
same area and then kcopyd writes the piece of data to another leg.
Consequently, the data is not zeroed.

The flag was made available by commit 983c7db347
(dm crypt: always disable discard_zeroes_data).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-20 14:25:07 +01:00
Mikulas Patocka
650d2a06b4 dm thin: do not send discards to shared blocks
When process_discard receives a partial discard that doesn't cover a
full block, it sends this discard down to that block. Unfortunately, the
block can be shared and the discard would corrupt the other snapshots
sharing this block.

This patch detects block sharing and ends the discard with success when
sending it to the shared block.

The above change means that if the device supports discard it can't be
guaranteed that a discard request zeroes data. Therefore, we set
ti->discard_zeroes_data_unsupported.

Thin target discard support with this bug arrived in commit
104655fd4d (dm thin: support discards).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-20 14:25:05 +01:00
Mikulas Patocka
751f188dd5 dm raid1: fix crash with mirror recovery and discard
This patch fixes a crash when a discard request is sent during mirror
recovery.

Firstly, some background.  Generally, the following sequence happens during
mirror synchronization:
- function do_recovery is called
- do_recovery calls dm_rh_recovery_prepare
- dm_rh_recovery_prepare uses a semaphore to limit the number
  simultaneously recovered regions (by default the semaphore value is 1,
  so only one region at a time is recovered)
- dm_rh_recovery_prepare calls __rh_recovery_prepare,
  __rh_recovery_prepare asks the log driver for the next region to
  recover. Then, it sets the region state to DM_RH_RECOVERING. If there
  are no pending I/Os on this region, the region is added to
  quiesced_regions list. If there are pending I/Os, the region is not
  added to any list. It is added to the quiesced_regions list later (by
  dm_rh_dec function) when all I/Os finish.
- when the region is on quiesced_regions list, there are no I/Os in
  flight on this region. The region is popped from the list in
  dm_rh_recovery_start function. Then, a kcopyd job is started in the
  recover function.
- when the kcopyd job finishes, recovery_complete is called. It calls
  dm_rh_recovery_end. dm_rh_recovery_end adds the region to
  recovered_regions or failed_recovered_regions list (depending on
  whether the copy operation was successful or not).

The above mechanism assumes that if the region is in DM_RH_RECOVERING
state, no new I/Os are started on this region. When I/O is started,
dm_rh_inc_pending is called, which increases reg->pending count. When
I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
If the count is zero and the region was in DM_RH_RECOVERING state,
dm_rh_dec adds it to the quiesced_regions list.

Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
in DM_RH_RECOVERING state, it could be added to quiesced_regions list
multiple times or it could be added to this list when kcopyd is copying
data (it is assumed that the region is not on any list while kcopyd does
its jobs). This results in memory corruption and crash.

There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
do not belong to any region, so they are always added to the sync list
in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
requests. These bypasses avoid the crash possibility described above.

These bypasses were improperly implemented for REQ_DISCARD when
the mirror target gained discard support in commit
5fc2ffeabb (dm raid1: support discard).

In do_writes, REQ_DISCARD requests is always added to the sync queue and
immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
corruption described above.

This patch changes it so that REQ_DISCARD requests follow the same path
as REQ_FLUSH. This avoids the crash.

Reference: https://bugzilla.redhat.com/837607

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-20 14:25:03 +01:00
NeilBrown
58e94ae184 md/raid1: close some possible races on write errors during resync
commit 4367af5561
   md/raid1: clear bad-block record when write succeeds.

Added a 'reschedule_retry' call possibility at the end of
end_sync_write, but didn't add matching code at the end of
sync_request_write.  So if the writes complete very quickly, or
scheduling makes it seem that way, then we can miss rescheduling
the request and the resync could hang.

Also commit 73d5c38a95
    md: avoid races when stopping resync.

Fix a race condition in this same code in end_sync_write but didn't
make the change in sync_request_write.

This patch updates sync_request_write to fix both of those.
Patch is suitable for 3.1 and later kernels.

Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Original-version-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 15:59:18 +10:00
NeilBrown
a05b7ea03d md: avoid crash when stopping md array races with closing other open fds.
md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.

However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.

If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).

So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.

This is needed when setting a read-write device to read-only too.

Cc: stable@vger.kernel.org
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 15:59:18 +10:00
NeilBrown
25f7fd470b md: fix bug in handling of new_data_offset
commit c6563a8c38
    md: add possibility to change data-offset for devices.

introduced a 'new_data_offset' attribute which should normally
be the same as 'data_offset', but can be explicitly set to a different
value to allow a reshape operation to move the data.

Unfortunately when the 'data_offset' is explicitly set through
sysfs, the new_data_offset is not also set, so the two would become
out-of-sync incorrectly.

One result of this is that trying to set the 'size' after the
'data_offset' would fail because it is not permitted to set the size
when the 'data_offset' and 'new_data_offset' are different - as that
can be confusing.
Consequently when mdadm tried to do this while assembling an IMSM
array it would fail.

This bug was introduced in 3.5-rc1.

Reported-by: Brian Downing <bdowning@lavos.net>
Bisected-by: Brian Downing <bdowning@lavos.net>
Tested-by: Brian Downing <bdowning@lavos.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-19 15:59:18 +10:00
Linus Torvalds
fdb1335a82 md: One use-after-free bugfix for RAID1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAUACxPznsnt1WYoG5AQJxDA//dD/5ZQ+35N0x6fpeOT0hxipmdk/iCmfZ
 eAxiN34YH3Zxk1oF+ZjRUzecL+I2Q3w8SAxeS5lowuJvUbPY6A5ttBZ2xJhC6oZ2
 5MgliHN7IjBfVul0DOodCN+lYG3ve7CO5fOz/QnMGBwQFdTHMMJYQw9Qf+QwsxfC
 YXxhDwt9lJrcQhUo50Na0WM0F7lC60A8Ny6ANtiLGRrfI/IDgCg2UN16VpchOWZT
 Szn9iPOB7wCGEpzCDMJf/dbZ/mQxrccfeG1F3qM9w5WNQ73SiNswCTvwjmW6O477
 32pqrK5xxPgYvqB28t93159lZpaY2GKNC9KvQ6vVqQPVih+zJxn+6qqvdvWNI3cD
 hTiUKW1O26lcQpeI8ADNrk9PQqX5o10Ypy42SgpUv8pfgcjLiJkXIv9GG5K3KfsT
 5Ea4W2Cru33d0VGsTw6rr2w1UwALjUWEXHOJRn+P5yrGJhfIXSV84RTzOWYYbBYl
 uAo1jLdlKI/jnnXMlOCi2w19PqGb1ZZpBwg5Gs5K0cBUBQCJAUU1MS1qJXEKo1oL
 i2ZMWmQxjPXLk3fadV1eoGOSpd2zOFdfCItf39qospLrC1Ym0C9p+1obkRLAQeTf
 bZMf261ZwIhH3bTCEVucOUvT1tlYaiE7s/zabWvjCajIzWEKIrWoOQw/U3eftJeZ
 SnnFsporIXk=
 =cy9g
 -----END PGP SIGNATURE-----

Merge tag 'md-3.5-fixes' of git://neil.brown.name/md

Pull use-after-free RAID1 bugfix from NeilBrown.

* tag 'md-3.5-fixes' of git://neil.brown.name/md:
  md/raid1: fix use-after-free bug in RAID1 data-check code.
2012-07-13 17:59:33 -07:00
NeilBrown
2d4f4f3384 md/raid1: fix use-after-free bug in RAID1 data-check code.
This bug has been present ever since data-check was introduce
in 2.6.16.  However it would only fire if a data-check were
done on a degraded array, which was only possible if the array
has 3 or more devices.  This is certainly possible, but is quite
uncommon.

Since hot-replace was added in 3.3 it can happen more often as
the same condition can arise if not all possible replacements are
present.

The problem is that as soon as we submit the last read request, the
'r1_bio' structure could be freed at any time, so we really should
stop looking at it.  If the last device is being read from we will
stop looking at it.  However if the last device is not due to be read
from, we will still check the bio pointer in the r1_bio, but the
r1_bio might already be free.

So use the read_targets counter to make sure we stop looking for bios
to submit as soon as we have submitted them all.

This fix is suitable for any -stable kernel since 2.6.16.

Cc: stable@vger.kernel.org
Reported-by: Arnold Schulz <arnysch@gmx.net>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-09 11:34:13 +10:00
Linus Torvalds
6c8addcb76 md - fix build error in previous patch.
I really shouldn't do important things late in the day.  It seems
 that I get careless.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIVAwUAT/OCnTnsnt1WYoG5AQIlBxAAsEAYpXhz2m081nY9lYCF4m3CqWoUdsZh
 mI/LDVUW6d26fz9uUgTEqv4JmJ3/v2m0PZEVhGiuQF5dkQbMYz/1ddBiaRMM6vWC
 pC5+RlcFnYbx62Iayr8/bXYhMZ0lLmxa1oupjEPklchZaJ4+EhRnckbfo+0OFbET
 LGe5JFpgtyx5YtHtVJRwAzNkkbHRYQxHuLuX3kMhAONVBmZ+v9c9/0yJZr2TlmYv
 wz/khYcMvGGBfr9u60JaIRPpz+b0Hhw9Qh9BhWW4On0wRFytCKktLShTptgtf6BT
 ZnXaR2X6vpgoYEJC1nrylyBlLkjnpTUhznWdySZKMGZg8UvZTiPXxN9zZne7HyEN
 eosnFx1CiFma4aZyjXvcA2RI7ZnOyKdgnOVbrS29b3z2QpmmvOn/WwnNguFJiupC
 WR9ZOpBih/mSCpLQDMkU/F6soXy/vTDcxjfYdLQc8HFeN/BJWT2szH54yD5wtd7G
 e4H3Dqs31mxRJG+wVj1XVbcLY9VdHaFwXBf6Y85fqUc3DFJ6y7fX+TEkMIEzOtxt
 PQkgZlFSP7QkowMn/3NK1ID3a9Ivjdi3l3fYyDSj582V1iXTJqYGICmibScdJSBJ
 ohzkvnsz+fxpEXkQ0jRVN5gstPTVWXS8yrUOF326my60sMiI9Jres+lE9ynrhdZN
 OqZjo07ULrU=
 =pDob
 -----END PGP SIGNATURE-----

Merge tag 'md-3.5-fixes' of git://neil.brown.name/md

Pull raid10 build failure fix from NeilBrown:
 "I really shouldn't do important things late in the day.  It seems that
  I get careless."

* tag 'md-3.5-fixes' of git://neil.brown.name/md:
  md/raid10: fix careless build error
2012-07-03 18:05:35 -07:00
NeilBrown
10684112c9 md/raid10: fix careless build error
build error introduced by commit b357f04a67

That function doesn't get extra args until a later patch.  Bother.

Reported-by: Fengguang Wu <wfg@linux.intel.com> 
Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Tobias Klausmann <tobias.johannes.klausmann@mni.thm.de>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-04 09:35:35 +10:00
Linus Torvalds
3492ee7274 Four minor thin provisioning fixes and correct and update dm-verity
documentation.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJP8t5wAAoJEK2W1qbAHj1ngaoP/0rSfdmKnP+XFUNHKmYbXv/F
 0kLMiFLQYWepbsW1+1t/e+VmssJcmJ8DlSt0DycCag3HpgwajM4MVAic5CDjEXkX
 9ewbe2LbObY8aWdnzhe+gRN7jCKPH41u6bhBNWrcskoMksfqHlpBhnk37CVS4Z5G
 OpqqgUjSXfcI05q9fZdb9BV/SvvmPj9LDC1T9mg41/zG2vAwUTkmvd7lqTFaQH/i
 35UhWANJkr5LuUHOwPqfBQSA0psPa0Z8BvAHAzW6StNurwMT/HCk/BSlCJ26rK+L
 UcojGShCRoGO4Bf/r+IDMvwhnKtJZWPxOwYwAwzPHT4pnNQ+cp9PANm5tbZrmVWt
 fZXAttdMDAiVL/iZ57AB07rLJUNYWUEfvR2YPHuBzN+aIMIs+7ORkonK5/NoQtI3
 XzdaSaLIAjdWJsskwzZFK2bFXsFTJ/J1ptnADFlxcppT/93wQ9YOu6t2dMuxWkKa
 FCYsGikXIP6LinkNWVF6wmI5wwWXINEqMABe0PXFU0kzf8saAsFsjpWmt5j0Wsk+
 nX2+x4wqaazrg48LuNsb6MH/7IgaSgX18NI+kjtULLs08Bnq7VfV+cD6XJcyJmg9
 6Bk+1+7wjwf60o5ZYcFwIQd3L8oqG8jXSH0b48fDLzAN8POkZt3eASVAEsOVaqIR
 xulqeo55eO4OkdOSL3A+
 =3VU9
 -----END PGP SIGNATURE-----

Merge tag 'dm-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper fixes from Alasdair G Kergon:
 "Four minor thin provisioning fixes and correct and update dm-verity
  documentation."

* tag 'dm-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm: verity fix documentation
  dm persistent data: fix allocation failure in space map checker init
  dm persistent data: handle space map checker creation failure
  dm persistent data: fix shadow_info_leak on dm_tm_destroy
  dm thin: commit metadata before creating metadata snapshot
2012-07-03 11:08:16 -07:00
Mike Snitzer
b0239faaf8 dm persistent data: fix allocation failure in space map checker init
If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and memory is fragmented and a
sufficiently-large metadata device is used in a thin pool then the space
map checker will fail to allocate the memory it requires.

Switch from kmalloc to vmalloc to allow larger virtually contiguous
allocations for the space map checker's internal count arrays.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Cc: stable@kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-03 12:55:37 +01:00
Mike Snitzer
62662303e7 dm persistent data: handle space map checker creation failure
If CONFIG_DM_DEBUG_SPACE_MAPS is enabled and dm_sm_checker_create()
fails, dm_tm_create_internal() would still return success even though it
cleaned up all resources it was supposed to have created.  This will
lead to a kernel crash:

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
...
RIP: 0010:[<ffffffff81593659>]  [<ffffffff81593659>] dm_bufio_get_block_size+0x9/0x20
Call Trace:
  [<ffffffff81599bae>] dm_bm_block_size+0xe/0x10
  [<ffffffff8159b8b8>] sm_ll_init+0x78/0xd0
  [<ffffffff8159c1a6>] sm_ll_new_disk+0x16/0xa0
  [<ffffffff8159c98e>] dm_sm_disk_create+0xfe/0x160
  [<ffffffff815abf6e>] dm_pool_metadata_open+0x16e/0x6a0
  [<ffffffff815aa010>] pool_ctr+0x3f0/0x900
  [<ffffffff8158d565>] dm_table_add_target+0x195/0x450
  [<ffffffff815904c4>] table_load+0xe4/0x330
  [<ffffffff815917ea>] ctl_ioctl+0x15a/0x2c0
  [<ffffffff81591963>] dm_ctl_ioctl+0x13/0x20
  [<ffffffff8116a4f8>] do_vfs_ioctl+0x98/0x560
  [<ffffffff8116aa51>] sys_ioctl+0x91/0xa0
  [<ffffffff81869f52>] system_call_fastpath+0x16/0x1b

Fix the space map checker code to return an appropriate ERR_PTR and have
dm_sm_disk_create() and dm_tm_create_internal() check for it with
IS_ERR.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-03 12:55:35 +01:00
Mike Snitzer
25d7cd6faa dm persistent data: fix shadow_info_leak on dm_tm_destroy
Cleanup the shadow table before destroying the transaction manager.

Reference: leak was identified with kmemleak when running
test_discard_random_sectors in the thinp-test-suite.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-03 12:55:33 +01:00
Joe Thornber
0d200aefd4 dm thin: commit metadata before creating metadata snapshot
Userland sometimes sees a corrupt metadata block if metadata is changing
rapidly when a metadata snapshot is reserved for userland,  To make the
problem go away, commit before we take the metadata snapshot (which is a
sensible thing to do anyway).

The checksums mean userland spots this corruption immediately so there's
no risk of acting on incorrect data.  No corruption exists from the
kernel's point of view, and thin_check passes after pool shutdown.

I believe this is to do with shared blocks at the first level of the
{device, mapping} btree.  Prior to the metadata-snap support no sharing
at this level was possible, so this patch is only required after commit
cc8394d86f ("dm thin: provide userspace
access to pool metadata").

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-07-03 12:55:31 +01:00
NeilBrown
b357f04a67 md: fix up plugging (again).
The value returned by "mddev_check_plug" is only valid until the
next 'schedule' as that will unplug things.  This could happen at any
call to mempool_alloc.
So just calling mddev_check_plug at the start doesn't really make
sense.

So call it just before, or just after, queuing things for the thread.
As the action that happens at unplug is to wake the thread, this makes
lots of sense.
If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
wake thread immediately.

RAID5 is a bit different.  Requests are queued for the thread and the
thread is woken by release_stripe.  So we don't need to wake the
thread on failure.
However the thread doesn't perform certain actions when there is any
active plug, so it is important to install a plug before waking the
thread.  So for RAID5 we install the plug *before* queuing the request
and waking the thread.

Without this patch it is possible for raid1 or raid10 to queue a
request without then waking the thread, resulting in the array locking
up.

Also change raid10 to only flush_pending_write when there are not
active plugs, just like raid1.

This patch is suitable for 3.0 or later.  I plan to submit it to
-stable, but I'll like to let it spend a few weeks in mainline
first to be sure it is completely safe.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 17:45:31 +10:00
NeilBrown
f456309106 md: support re-add of recovering devices.
We currently only allow a device to be re-added if it appear to be
in-sync.  This is overly restrictive as it may be desirable to re-add
a device that is in the middle of recovery.

So remove the test for "InSync" - the test on rdev->raid_disk is
sufficient to ensure that the re-add will succeed.

Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:59:06 +10:00
NeilBrown
32644afd89 md/raid1: fix bug in read_balance introduced by hot-replace
When we added hot_replace we doubled the number of devices
that could be in a RAID1 array.  So we doubled how far read_balance
would search.  Unfortunately we didn't double the point at which
it looped back to the beginning - so it effectively loops over
all non-replacement disks twice.
This doesn't cause bad behaviour, but it pointless and means we
never read from replacement devices.

Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:58:42 +10:00
Shaohua Li
fab363b5ff raid5: delayed stripe fix
There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:57:19 +10:00
majianpeng
2e8ac30312 md/raid456: When read error cannot be recovered, record bad block
We may not be able to fix a bad block if:
 - the array is degraded
 - the over-write fails.

In these cases we currently eject the device, but we should
record a bad block if possible.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:57:02 +10:00
NeilBrown
0232605d98 md: make 'name' arg to md_register_thread non-optional.
Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.

So make it non-optional and always explicitly pass the name
of the level that the array will be.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:56:52 +10:00
NeilBrown
055d3747db md/raid10: fix failure when trying to repair a read error.
commit 58c54fcca3
     md/raid10: handle further errors during fix_read_error better.

in 3.1 added "r10_sync_page_io" which takes an IO size in sectors.
But we were passing the IO size in bytes!!!
This resulting in bio_add_page failing, and empty request being sent
down, and a consequent BUG_ON in scsi_lib.

[fix missing space in error message at same time]

This fix is suitable for 3.1.y and later.

Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 15:55:33 +10:00
NeilBrown
5f066c632f md/raid5: fix refcount problem when blocked_rdev is set.
commit 43220aa0f2
    md/raid5: fix a hang on device failure.

fixed a hang, but introduced a refcounting in-balance so
that if the presence of bad-blocks ever caused an rdev to
be 'blocked' we would increment the refcount on the rdev and
never decrement it.

So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
is not called.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 12:13:29 +10:00
majianpeng
7c2c57c9a9 md:Add blk_plug in sync_thread.
Add blk_plug in sync_thread will increase the performance of sync.
Because sync_thread did not blk_plug,so when raid sync, the bio merge
not well.

Testing environment:
SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller.
OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
x86_64 x86_64 x86_64 GNU/Linux.
RAID5: four ST31000524NS disk.

Without blk_plug:recovery speed about 63M/Sec;
Add blk_plug:recovery speed about 120M/Sec.

Using blktrace:
blktrace -d /dev/sdb -w 60  -o -|blkparse -i -

without blk_plug:
Total (8,16):
 Reads Queued:      309811,     1239MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:   283583,     1189MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:   273351,     1149MiB	 Writes Completed:        0,        0KiB
 Read Merges:        23533,    94132KiB	 Write Merges:            0,        0KiB
 IO unplugs:             0        	 Timer unplugs:           0

add blk_plug:
Total (8,16):
 Reads Queued:      428697,     1714MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:     3954,     1714MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:     3956,     1715MiB	 Writes Completed:        0,        0KiB
 Read Merges:       424743,     1698MiB	 Write Merges:            0,        0KiB
 IO unplugs:             0        	 Timer unplugs:        3384

The ratio of merge will be markedly increased.

Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 12:12:26 +10:00
majianpeng
1850753d2e md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev
In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
nr_pending so we lose the reference we hold on the rdev.
So atomic_inc it first to maintain the reference.

This bug was introduced by commit  73e92e51b7
    md/raid5.  Don't write to known bad block on doubtful devices.

which appeared in 3.0, so patch is suitable for stable kernels since
then.

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 12:11:54 +10:00
majianpeng
6c0544e255 md/raid5: Do not add data_offset before call to is_badblock
In chunk_aligned_read() we are adding data_offset before calling
is_badblock.  But is_badblock also adds data_offset, so that is bad.

So move the addition of data_offset to after the call to
is_badblock.

This bug was introduced by commit 31c176ecdf
     md/raid5: avoid reading from known bad blocks.
which first appeared in 3.0.  So that patch is suitable for any
-stable kernel from 3.0.y onwards.  However it will need minor
revision for most of those (as the comment didn't appear until
recently).

Cc: stable@vger.kernel.org
Signed-off-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 12:09:57 +10:00
NeilBrown
5cfb22a1f8 md/raid5: prefer replacing failed devices over want-replacement devices.
If a RAID5 has both a failed device and a device marked as
'WantReplacement', then we should preferentially replace the failed
device.
However the current code replaces whichever is found first.
So split into 2 loops, check fail failed/missing first, and only check
for WantReplacement if nothing is failed or missing.

Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 11:46:53 +10:00
NeilBrown
fc448a18ae md/raid10: Don't try to recovery unmatched (and unused) chunks.
If a RAID10 has an odd number of chunks - as might happen when there
are an odd number of devices - the last chunk has no pair and so is
not mirrored.  We don't store data there, but when recovering the last
device in an array we retry to recover that last chunk from a
non-existent location.  This results in an error, and the recovery
aborts.

When we get to that last chunk we should just stop - there is nothing
more to do anyway.

This bug has been present since the introduction of RAID10, so the
patch is appropriate for any -stable kernel.

Cc: stable@vger.kernel.org
Reported-by: Christian Balzer <chibi@gol.com>
Tested-by: Christian Balzer <chibi@gol.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2012-07-03 10:37:30 +10:00