Commit Graph

3885 Commits

Author SHA1 Message Date
Mike Snitzer
bd4aaf8f9b dm: fix dm_merge_bvec regression on 32 bit systems
A DM regression on 32 bit systems was reported against v4.2-rc3 here:
https://lkml.org/lkml/2015/7/29/401

Fix this by reverting both commit 1c220c69 ("dm: fix casting bug in
dm_merge_bvec()") and 148e51ba ("dm: improve documentation and code
clarity in dm_merge_bvec").  This combined revert is done to eliminate
the possibility of a partial revert in stable@ kernels.

In hindsight the correct fix, at the time 1c220c69 was applied to fix
the regression that 148e51ba introduced, should've been to simply revert
148e51ba.

Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Tested-by: Adam Williamson <awilliam@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
2015-08-03 22:49:59 -04:00
NeilBrown
199dc6ed51 md/raid0: update queue parameter in a safer location.
When a (e.g.) RAID5 array is reshaped to RAID0, the updating
of queue parameters (e.g. max number of sectors per bio) is
done in the wrong place.
It should be part of ->run, but it is actually part of ->takeover.
This means it happens before level_store() calls:

	blk_set_stacking_limits(&mddev->queue->limits);

and so it ineffective.  This can lead to errors from underlying
devices.

So move all the relevant settings out of create_stripe_zones()
and into raid0_run().

As this can lead to a bug-on it is suitable for any -stable
kernel which supports reshape to RAID0.  So 2.6.35 or later.
As the bug has been present for five years there is no urgency,
so no need to rush into -stable.

Fixes: 9af204cf72 ("md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover")
Cc: stable@vger.kernel.org (v2.6.35+ - please delay until after -final release).
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:12:44 +10:00
Benjamin Randazzo
25eafe1a81 md: simplify get_bitmap_file now that "file" is zeroed.
There is no point assigning '\0' to file->pathname[0] as
file is now zeroed out, so remove that branch and
simplify the code.

[Original patch combined this with the change to use
 kzalloc.  I split the two so that the change to kzalloc
 is easier to backport. - neilb]

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:12:44 +10:00
NeilBrown
49895bcc7e md/raid5: don't let shrink_slab shrink too far.
I have a report of drop_one_stripe() called from
raid5_cache_scan() apparently finding ->max_nr_stripes == 0.

This should not be allowed.

So add a test to keep max_nr_stripes above min_nr_stripes.

Also use a 'mask' rather than a 'mod' in drop_one_stripe
to ensure 'hash' is valid even if max_nr_stripes does reach zero.


Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b66)
Reported-by: Tomas Papan <tomas.papan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 17:10:56 +10:00
Benjamin Randazzo
b6878d9e03 md: use kzalloc() when bitmap is disabled
In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a
mdu_bitmap_file_t called "file".

5769         file = kmalloc(sizeof(*file), GFP_NOIO);
5770         if (!file)
5771                 return -ENOMEM;

This structure is copied to user space at the end of the function.

5786         if (err == 0 &&
5787             copy_to_user(arg, file, sizeof(*file)))
5788                 err = -EFAULT

But if bitmap is disabled only the first byte of "file" is initialized
with zero, so it's possible to read some bytes (up to 4095) of kernel
space memory from user space. This is an information leak.

5775         /* bitmap disabled, zero the first byte and copy out */
5776         if (!mddev->bitmap_info.file)
5777                 file->pathname[0] = '\0';

Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 14:56:02 +10:00
NeilBrown
423f04d63c md/raid1: extend spinlock to protect raid1_end_read_request against inconsistencies
raid1_end_read_request() assumes that the In_sync bits are consistent
with the ->degaded count.
raid1_spare_active updates the In_sync bit before the ->degraded count
and so exposes an inconsistency, as does error()
So extend the spinlock in raid1_spare_active() and error() to hide those
inconsistencies.

This should probably be part of
  Commit: 34cab6f420 ("md/raid1: fix test for 'was read error from
  last working device'.")
as it addresses the same issue.  It fixes the same bug and should go
to -stable for same reasons.

Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-03 12:29:42 +10:00
Mike Snitzer
795e633a2d dm cache: fix device destroy hang due to improper prealloc_used accounting
Commit 665022d72f ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:

  [<ffffffff81651457>] schedule+0x37/0x80
  [<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
  [<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
  [<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
  [<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
  [<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
  [<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
  [<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
  [<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
  [<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
  [<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
  [<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
  [<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
  [<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff811fd689>] SyS_ioctl+0x79/0x90
  [<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
  [<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after.  This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:09 -04:00
Mike Snitzer
3508e6590d Revert "dm cache: do not wake_worker() in free_migration()"
This reverts commit 386cb7cdee.

Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.

Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.

Reported-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-29 14:32:08 -04:00
Jens Axboe
b7c44ed9d2 block: manipulate bio->bi_flags through helpers
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.

It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:20 -06:00
Christoph Hellwig
4246a0b63b block: add a bi_error field to struct bio
Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29 08:55:15 -06:00
Baruch Siach
6ed443c07f dm crypt: update wiki page URL
Cryptsetup moved to gitlab.  This is a leftover from commit e44f23b32d
(dm crypt: update URLs to new cryptsetup project page, 2015-04-05).

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-27 07:58:16 -04:00
Colin Ian King
134bf30c06 dm cache policy smq: fix alloc_bitset check that always evaluates as false
static analysis by cppcheck has found a check on alloc_bitset that
always evaluates as false and hence never finds an allocation failure:

[drivers/md/dm-cache-policy-smq.c:1689]: (warning) Logical conjunction
  always evaluates to false: !EXPR && EXPR.

Fix this by removing the incorrect mq->cache_hit_bits check

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-27 07:58:15 -04:00
Mike Snitzer
0a927c2f02 dm thin: return -ENOSPC when erroring retry list due to out of data space
Otherwise -EIO would be returned when -ENOSPC should be used
consistently.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-26 17:39:19 -04:00
Linus Torvalds
aca105a697 Some md fixes for 4.2
Several are tagged for -stable.
 A few aren't because they are not very, serious or
 because they are in the 'experimental' cluster code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVsbRgAAoJEDnsnt1WYoG54BcQAJlVxcGdMXNAbFAfhToH+cwY
 DcCKmnXiu3TCcK/gtr0SlUKIhv7kPE2HaGbTko3g5/uTk7SCuMSWRbjpQJr1u98U
 9VUPZV0RGLEcQXgsjG3sobEtdSYMSn1/BpdJpmsn/Q6cyTheiEJA9fEghn9F0Iw9
 3Ctc56aL0nKsnpeRTvWmKR3F995L2Ene+pIHPbSqTlQcGU+DdxQL/iY9YspMzeih
 6SWQICo59+pGSRQdKaU968nZ1a8hJCvhV2NbW0JsJMo9iGo5htCJLdxZatscZV6E
 xAppCRymW/Nl0n59wCOBhUp+0skVhtYZ1UmNdx/vUA7vcZJX85vIG7WGaaSQAmlF
 Y4nNMSaX6SWbt/oVgq+JiD39T1oFZN5N5Fh9juqcp3fshiWLO74cnTwea1LtkQZX
 LfW2HhajDUgoJqoXL6ppKDK8l7alQdcaoYn0C6SfqRZ+PLB5ERrSBHNZpKDbI9aw
 CFW93lHTtlwtwZn0S7k06mCG8ilO4iPthUVaJEeI1kUWVKY0Ju4xnVTt/7jCroUa
 Km14KdWeZsS8j9/xr6FYBjarjuh7M9SoflgUK84txE+PIZsSczyrErpBkMf2YBWQ
 0KduqQabXa1XYWGtZ7Uhj6iOOGNBcTcZsww8cBEDOJEDyCtSs5w/iPmsFCEGUuo2
 YTz6qKpYPgB1VzK2PoBG
 =6ZCK
 -----END PGP SIGNATURE-----

Merge tag 'md/4.2-fixes' of git://neil.brown.name/md

Pull md fixes from Neil Brown:
 "Some md fixes for 4.2

  Several are tagged for -stable.
  A few aren't because they are not very, serious or because they are in
  the 'experimental' cluster code"

* tag 'md/4.2-fixes' of git://neil.brown.name/md:
  md/raid5: clear R5_NeedReplace when no longer needed.
  Fix read-balancing during node failure
  md-cluster: fix bitmap sub-offset in bitmap_read_sb
  md: Return error if request_module fails and returns positive value
  md: Skip cluster setup in case of error while reading bitmap
  md/raid1: fix test for 'was read error from last working device'.
  md: Skip cluster setup for dm-raid
  md: flush ->event_work before stopping array.
  md/raid10: always set reshape_safe when initializing reshape_position.
  md/raid5: avoid races when changing cache size.
2015-07-25 11:24:58 -07:00
NeilBrown
e6030cb06c md/raid5: clear R5_NeedReplace when no longer needed.
This flag is currently never cleared, which can in rare cases
trigger a warn-on if it is still set but the block isn't
InSync.

So clear it when it isn't need, which includes if the replacement
device has failed.

Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:38:04 +10:00
Goldwyn Rodrigues
90382ed9af Fix read-balancing during node failure
During a node failure, We need to suspend read balancing so that the
reads are directed to the first device and stale data is not read.
Suspending writes is not required because these would be recorded and
synced eventually.

A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep().
area_resyncing() will respond true for the entire devices if this
flag is set and the request type is READ. The flag is cleared
in recover_done().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reported-By: David Teigland <teigland@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:59 +10:00
Goldwyn Rodrigues
33e38ac688 md-cluster: fix bitmap sub-offset in bitmap_read_sb
bitmap_read_sb is modifying mddev->bitmap_info.offset. This works for
the first bitmap read. However, when multiple bitmaps need to be opened
by the same node, it ends up corrupting the offset. Fix it by using a
local variable.

Also, bitmap_read_sb is not required in bitmap_copy_from_slot since
it is called in bitmap_create. Remove bitmap_read_sb().

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:55 +10:00
Goldwyn Rodrigues
b0c26a79d6 md: Return error if request_module fails and returns positive value
request_module() can return 256 (process exited) in some cases,
which is not as specified in the documentation before the
request_module() definition. Convert the error to -ENOENT.

The positive error number results in bitmap_create() returning
a value that is meant to be an error but doesn't look like one,
so it is dereferenced as a point and causes a crash.

(not needed for stable as this is "experimental" code)
Fixes: edb39c9ded ("Introduce md_cluster_operations to handle cluster functions")
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:51 +10:00
Goldwyn Rodrigues
f735727319 md: Skip cluster setup in case of error while reading bitmap
If the bitmap read fails, the error code set is -EINVAL. However,
we don't check for errors and go ahead with cluster_setup.
Skip the cluster setup in case of error.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:48 +10:00
NeilBrown
34cab6f420 md/raid1: fix test for 'was read error from last working device'.
When we get a read error from the last working device, we don't
try to repair it, and don't fail the device.  We simple report a
read error to the caller.

However the current test for 'is this the last working device' is
wrong.
When there is only one fully working device, it assumes that a
non-faulty device is that device.  However a spare which is rebuilding
would be non-faulty but so not the only working device.

So change the test from "!Faulty" to "In_sync".  If ->degraded says
there is only one fully working device and this device is in_sync,
this must be the one.

This bug has existed since we allowed read_balance to read from
a recovering spare in v3.0

Reported-and-tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Fixes: 76073054c9 ("md/raid1: clean up read_balance.")
Cc: stable@vger.kernel.org (v3.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24 13:37:21 +10:00
Goldwyn Rodrigues
d3b178adb3 md: Skip cluster setup for dm-raid
There is a bug that the bitmap superblock isn't initialised properly for
dm-raid, so a new field can have garbage in new fields.
(dm-raid does initialisation in the kernel - md initialised the
 superblock in mdadm).

This means that for dm-raid we cannot currently trust the new ->nodes
field. So:
 - use __GFP_ZERO to initialise the superblock properly for all new
    arrays
 - initialise all fields in bitmap_info in bitmap_new_disk_sb
 - ignore ->nodes for dm arrays (yes, this is a hack)

This bug exposes dm-raid to bug in the (still experimental) md-cluster
code, so it is suitable for -stable.  It does cause crashes.

References: https://bugzilla.kernel.org/show_bug.cgi?id=100491
Cc: stable@vger.kernel.org (v4.1)
Signed-off-By: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-23 09:22:00 +10:00
NeilBrown
ee5d004fd0 md: flush ->event_work before stopping array.
The 'event_work' worker used by dm-raid may still be running
when the array is stopped.  This can result in an oops.

So flush the workqueue on which it is run after detaching
and before destroying the device.

Reported-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org (2.6.38+ please delay 2 weeks after -final release)
Fixes: 9d09e663d5 ("dm: raid456 basic support")
2015-07-22 14:09:29 +10:00
NeilBrown
299b0685e3 md/raid10: always set reshape_safe when initializing reshape_position.
'reshape_position' tracks where in the reshape we have reached.
'reshape_safe' tracks where in the reshape we have safely recorded
in the metadata.

These are compared to determine when to update the metadata.
So it is important that reshape_safe is initialised properly.
Currently it isn't.  When starting a reshape from the beginning
it usually has the correct value by luck.  But when reducing the
number of devices in a RAID10, it has the wrong value and this leads
to the metadata not being updated correctly.
This can lead to corruption if the reshape is not allowed to complete.

This patch is suitable for any -stable kernel which supports RAID10
reshape, which is 3.5 and later.

Fixes: 3ea7daa5d7 ("md/raid10: add reshape support")
Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-22 14:08:24 +10:00
NeilBrown
2d5b569b66 md/raid5: avoid races when changing cache size.
Cache size can grow or shrink due to various pressures at
any time.  So when we resize the cache as part of a 'grow'
operation (i.e. change the size to allow more devices) we need
to blocks that automatic growing/shrinking.

So introduce a mutex.  auto grow/shrink uses mutex_trylock()
and just doesn't bother if there is a blockage.
Resizing the whole cache holds the mutex to ensure that
the correct number of new stripes is allocated.

This bug can result in some stripes not being freed when an
array is stopped.  This leads to the kmem_cache not being
freed and a subsequent array can try to use the same kmem_cache
and get confused.

Fixes: edbe83ab4c ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-22 14:04:15 +10:00
Linus Torvalds
3f8476fe89 - Revert a request-based DM core change that caused IO latency to
increase and adversely impact both throughput and system load
 
 - Fix for a use after free bug in DM core's device cleanup
 
 - A couple DM btree removal fixes (used by dm-thinp)
 
 - A DM thinp fix for order-5 allocation failure
 
 - A DM thinp fix to not degrade to read-only metadata mode when in
   out-of-data-space mode for longer than the 'no_space_timeout'
 
 - Fix a long-standing oversight in both dm-thinp and dm-cache by
   now exporting 'needs_check' in status if it was set in metadata
 
 - Fix an embarrassing dm-cache busy-loop that caused worker threads to
   eat cpu even if no IO was actively being issued to the cache device
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVqUfSAAoJEMUj8QotnQNa+RUH+wXrHCGI6J7RHIXVd5igP9K0
 yFZGEnLZe6Ebt5CACLcKn/qN0g97iwCrlcxFt+1Gj/GbW1GIQzs7vg38La3PZxWZ
 jAkI3JMY816bP1x3VK1HtMsk2gRaE/hh0gxK5pPLB9a+ZdEsz9UML0rs+JseOdn3
 n+454dhwOyChwz7zFEbpn+mfjoruFScGX0Y2qaSHBV/xMhmExpthw9V1yFC2v2tW
 8cAHOMDLNLHhR5adF9YxjZH8wILbyYK9oPy3iGhj/TF/Dx7saWYG4UlnL5xIOLsB
 5WK9gRrJJ/Wf0FsDdN88AaY4Bdpj4esS2JeTZpvujxeBb7ZNeJoCUqyzggURv/c=
 =hCjo
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - revert a request-based DM core change that caused IO latency to
   increase and adversely impact both throughput and system load

 - fix for a use after free bug in DM core's device cleanup

 - a couple DM btree removal fixes (used by dm-thinp)

 - a DM thinp fix for order-5 allocation failure

 - a DM thinp fix to not degrade to read-only metadata mode when in
   out-of-data-space mode for longer than the 'no_space_timeout'

 - fix a long-standing oversight in both dm-thinp and dm-cache by now
   exporting 'needs_check' in status if it was set in metadata

 - fix an embarrassing dm-cache busy-loop that caused worker threads to
   eat cpu even if no IO was actively being issued to the cache device

* tag 'dm-4.2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: avoid calls to prealloc_free_structs() if possible
  dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
  dm cache: do not wake_worker() in free_migration()
  dm cache: display 'needs_check' in status if it is set
  dm thin: display 'needs_check' in status if it is set
  dm thin: stay in out-of-data-space mode once no_space_timeout expires
  dm: fix use after free crash due to incorrect cleanup sequence
  Revert "dm: only run the queue on completion if congested or no requests pending"
  dm btree: silence lockdep lock inversion in dm_btree_del()
  dm thin: allocate the cell_sort_array dynamically
  dm btree remove: fix bug in redistribute3
2015-07-17 20:53:57 -07:00
Jens Axboe
2bb4cd5cc4 block: have drivers use blk_queue_max_discard_sectors()
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-17 08:41:53 -06:00
Mike Snitzer
665022d72f dm cache: avoid calls to prealloc_free_structs() if possible
If no work was performed then prealloc_data_structs() wasn't ever called
so there isn't any need to call prealloc_free_structs().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:08 -04:00
Mike Snitzer
e782eff591 dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()
Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs()
if the policy doesn't have any dirty blocks ready for writeback.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:07 -04:00
Mike Snitzer
386cb7cdee dm cache: do not wake_worker() in free_migration()
All methods that queue work call wake_worker() as you'd expect.
E.g. cell_defer, defer_bio, quiesce_migration (which is called by
writeback, promote, demote_then_promote, invalidate, discard, etc).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 22:32:06 -04:00
Mike Snitzer
255eac2005 dm cache: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the cache status if it is set
in the cache metadata.

Also, update cache documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Mike Snitzer
e4c78e210d dm thin: display 'needs_check' in status if it is set
There is currently no way to see that the needs_check flag has been set
in the metadata.  Display 'needs_check' in the thin-pool status if it is
set in the thinp metadata.

Also, update thinp documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:50 -04:00
Mike Snitzer
bcc696fac1 dm thin: stay in out-of-data-space mode once no_space_timeout expires
This fixes an issue where running out of data space would cause the
thin-pool's metadata to become read-only.  There was no reason to make
metadata read-only -- calling set_pool_mode() with PM_READ_ONLY was a
misguided way to error all queued and future write IOs.  We can
accomplish the same by degrading from PM_OUT_OF_DATA_SPACE to
PM_OUT_OF_DATA_SPACE with error_if_no_space enabled.

Otherwise, the use of PM_READ_ONLY could cause a race where commit() was
started before the PM_READ_ONLY transition but dm_pool_commit_metadata()
would go on to fail because the block manager had transitioned to
read-only.  The return of -EPERM from dm_pool_commit_metadata(), due to
attempting to commit while in read-only mode, caused the thin-pool to
set 'needs_check' because a metadata_operation_failed().  This needless
cascade of failures makes life for users more difficult than needed.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-16 10:23:49 -04:00
Mikulas Patocka
b06075a98d dm: fix use after free crash due to incorrect cleanup sequence
Linux 4.2-rc1 Commit 0f20972f7b ("dm: factor out a common
cleanup_mapped_device()") moved a common cleanup code to a separate
function.  Unfortunately, that commit incorrectly changed the order of
cleanup, so that it destroys the mapped_device's srcu structure
'io_barrier' before destroying its workqueue.

The function that is executed on the workqueue (dm_wq_work) uses the srcu
structure, thus it may use it after being freed.  That results in a
crash in the LVM test suite's mirror-vgreduce-removemissing.sh test.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 0f20972f7b ("dm: factor out a common cleanup_mapped_device()")
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-07-13 09:14:11 -04:00
Jens Axboe
77b5a08427 bcache: don't embed 'return' statements in closure macros
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
2015-07-11 09:57:32 -06:00
Mike Snitzer
621739b00e Revert "dm: only run the queue on completion if congested or no requests pending"
This reverts commit 9a0e609e3f.
(Resolved a conflict during revert due to commit bfebd1cdb4 that came
after)

This revert is motivated by a couple failure reports on request-based DM
multipath testbeds:
1) Netapp reported that their multipath fault injection test under heavy
   IO load can stall longer than 300 seconds.
2) IBM reported elevated lock contention in their testbed (likely due to
   increased back pressure due to IO not being dispatched as quickly):
   https://www.redhat.com/archives/dm-devel/2015-July/msg00057.html

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+
2015-07-08 16:16:07 -04:00
Joe Thornber
1c7518794a dm btree: silence lockdep lock inversion in dm_btree_del()
Allocate memory using GFP_NOIO when deleting a btree.  dm_btree_del()
can be called via an ioctl and we don't want to recurse into the FS or
block layer.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-06 10:45:02 -04:00
Joe Thornber
a822c83e47 dm thin: allocate the cell_sort_array dynamically
Given the pool's cell_sort_array holds 8192 pointers it triggers an
order 5 allocation via kmalloc.  This order 5 allocation is prone to
failure as system memory gets more fragmented over time.

Fix this by allocating the cell_sort_array using vmalloc.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 21:59:37 -04:00
Dennis Yang
4c7e309340 dm btree remove: fix bug in redistribute3
redistribute3() shares entries out across 3 nodes.  Some entries were
being moved the wrong way, breaking the ordering.  This manifested as a
BUG() in dm-btree-remove.c:shift() when entries were removed from the
btree.

For additional context see:
https://www.redhat.com/archives/dm-devel/2015-May/msg00113.html

Signed-off-by: Dennis Yang <shinrairis@gmail.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-07-05 17:43:58 -04:00
Linus Torvalds
1dc51b8288 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted VFS fixes and related cleanups (IMO the most interesting in
  that part are f_path-related things and Eric's descriptor-related
  stuff).  UFS regression fixes (it got broken last cycle).  9P fixes.
  fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups".  The
  file_table locking rule change by Eric Dumazet is a rather big and
  fundamental update even if the patch isn't huge.   - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
  9p: cope with bogus responses from server in p9_client_{read,write}
  p9_client_write(): avoid double p9_free_req()
  9p: forgetting to cancel request on interrupted zero-copy RPC
  dax: bdev_direct_access() may sleep
  block: Add support for DAX reads/writes to block devices
  dax: Use copy_from_iter_nocache
  dax: Add block size note to documentation
  fs/file.c: __fget() and dup2() atomicity rules
  fs/file.c: don't acquire files->file_lock in fd_install()
  fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
  vfs: avoid creation of inode number 0 in get_next_ino
  namei: make set_root_rcu() return void
  make simple_positive() public
  ufs: use dir_pages instead of ufs_dir_pages()
  pagemap.h: move dir_pages() over there
  remove the pointless include of lglock.h
  fs: cleanup slight list_entry abuse
  xfs: Correctly lock inode when removing suid and file capabilities
  fs: Call security_ops->inode_killpriv on truncate
  fs: Provide function telling whether file_remove_privs() will do anything
  ...
2015-07-04 19:36:06 -07:00
Joe Perches
d1aa1ab33d MAINTAINERS: BCACHE: Kent Overstreet has changed email address
Kent's email address in MAINTAINERS seems to be invalid.
This was his last sign-off address, so use that if appropriate.

Fix the S: status entry while there.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:01 -07:00
Pekka Enberg
958b43384e bcache: use kvfree() in various places
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30 19:45:00 -07:00
Linus Torvalds
6aaf0da872 md updates for 4.2
A mixed bag
  - a few bug fixes
  - some performance improvement that decrease lock contention
  - some clean-up
 
 Nothing major.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJVi6weAAoJEDnsnt1WYoG50CsP/RqFbZicRSIvzXUURwP+yCP0
 3YZuURj4IXC6Cy/HLX+bZoj1p/b+GIRsZ72fWFJrd2LheaAI6WojCCLlnmXUtI/Y
 LIppF8/A2hfCNbF9cILByvrbzfndeEGK8kvootBDpvD0jlYiGePPAMQY2zx0MAyb
 T4yJ/KiziLniP6x7vqZrQ6I1MRVjeanN6RWXktFtixMpNOKUJe3PiZbUz4VDIrHR
 DaiHCbMjvRIkUWgNY8HmijEt+c8AYia7muqLj359dy2xF1hlUIdCx+61cgFD1zd8
 enKDH3xp+3B9BEgHe+AtxTAzpqSgU93tdhUjGcy/orA+yYjAAcA4ifngrzfE3VKb
 kwQgPh2JvUrubavrcto0hthS5RldrCpDXebOM4aEq+7lDHCwrZ39Qio5+1F7TLt5
 A5E3Eb7dPRdp9T3LrluX8/f7bO/Wbmxvv/RwnSLTpnGQoBWIAqCpQ+e9ro446Gsx
 /phXv3tE78fKj88LgQY/mm8ICeCppmQGLrpmjk9bkaZzqFdzQoURVmPh8QPMuJB4
 iMHpOOKLzrUlW/23rRxaIKwPuFyxlNuLAvyA3ezsymGiZ+SqSeFCEm1jN64EfMCI
 39rpfZt2pcVVOZJ9YeuzZG9wpie96yGZgnVWlP3FPjqRpboXqmtHlYA6EMRtqDAy
 mjSiGDF2bxkT1/YcjELD
 =sXTI
 -----END PGP SIGNATURE-----

Merge tag 'md/4.2' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "A mixed bag

   - a few bug fixes
   - some performance improvement that decrease lock contention
   - some clean-up

  Nothing major"

* tag 'md/4.2' of git://neil.brown.name/md:
  md: clear Blocked flag on failed devices when array is read-only.
  md: unlock mddev_lock on an error path.
  md: clear mddev->private when it has been freed.
  md: fix a build warning
  md/raid5: ignore released_stripes check
  md/raid5: per hash value and exclusive wait_for_stripe
  md/raid5: split wait_for_stripe and introduce wait_for_quiescent
  wait: introduce wait_event_exclusive_cmd
  md: convert to kstrto*()
  md/raid10: make sync_request_write() call bio_copy_data()
2015-06-29 11:10:56 -07:00
Linus Torvalds
22165fa798 - Revert block and DM core changes the removed request-based DM's
ability to handle partial request completions -- otherwise with the
   current SCSI LLDs these changes could lead to silent data corruption.
 
 - Fix two DM version bumps that were missing from the initial 4.2 DM
   pull request (enabled userspace lvm2 to know certain changes have been
   made).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVjWetAAoJEMUj8QotnQNaEngIAMVwExw0u04jqoW9rUwLDbpr
 PS2A4lh/MGtMqGGPwJp5qiwnKkgQ5/FcxRpslNQYqA6KrIlnjWJhacWl7tOrwqxn
 +WBsHIUwjcpwK2RqxSS3Petb6xDd7A3LfTQVhKV9xKZpZp8Y25a+1MPmUYKsFLBH
 DJ1d9bXPMdN1qjBXBU1rKkVxj6z8iNz/lv24eN0MGyWhfUUTc8lQg3eey3L0BzCc
 siOuupFQXaWIkbawLZrmvPPNm1iMoABC1OPZCTB1AYZYx1rqzEGUR1nZN+qWf6Wf
 rZtAPZehbRzvOaf5jC6tEfAcTF23aPEyp4LD+aAQpbuC/1IBi8a3S8z6PvR5EjA=
 =QY48
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "Apologies for not pressing this request-based DM partial completion
  issue further, it was an oversight on my part.  We'll have to get it
  fixed up properly and revisit for a future release.

   - Revert block and DM core changes the removed request-based DM's
     ability to handle partial request completions -- otherwise with the
     current SCSI LLDs these changes could lead to silent data
     corruption.

   - Fix two DM version bumps that were missing from the initial 4.2 DM
     pull request (enabled userspace lvm2 to know certain changes have
     been made)"

* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache policy smq: fix "default" version to be 1.4.0
  dm: bump the ioctl version to 4.32.0
  Revert "block, dm: don't copy bios for request clones"
  Revert "dm: do not allocate any mempools for blk-mq request-based DM"
2015-06-26 12:35:01 -07:00
Linus Torvalds
47a469421d Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - most of the rest of MM

 - lots of misc things

 - procfs updates

 - printk feature work

 - updates to get_maintainer, MAINTAINERS, checkpatch

 - lib/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  exit,stats: /* obey this comment */
  coredump: add __printf attribute to cn_*printf functions
  coredump: use from_kuid/kgid when formatting corename
  fs/reiserfs: remove unneeded cast
  NILFS2: support NFSv2 export
  fs/befs/btree.c: remove unneeded initializations
  fs/minix: remove unneeded cast
  init/do_mounts.c: add create_dev() failure log
  kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
  fs/efs: femove unneeded cast
  checkpatch: emit "NOTE: <types>" message only once after multiple files
  checkpatch: emit an error when there's a diff in a changelog
  checkpatch: validate MODULE_LICENSE content
  checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
  checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
  checkpatch: fix processing of MEMSET issues
  checkpatch: suggest using ether_addr_equal*()
  checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
  checkpatch: remove local from codespell path
  checkpatch: add --showfile to allow input via pipe to show filenames
  ...
2015-06-26 09:52:05 -07:00
Mike Snitzer
b5451e4568 dm cache policy smq: fix "default" version to be 1.4.0
Commit bccab6a0 ("dm cache: switch the "default" cache replacement
policy from mq to smq") should've incremented the "default" policy's
version number to 1.4.0 rather than reverting to version 1.0.0.

Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:14:28 -04:00
Mike Snitzer
78d8e58a08 Revert "block, dm: don't copy bios for request clones"
This reverts commit 5f1b670d0b.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
  https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
  https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html

Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:11:58 -04:00
Mike Snitzer
4e6e36c371 Revert "dm: do not allocate any mempools for blk-mq request-based DM"
This reverts commit cbc4e3c135.

Reported-by: Junichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-26 10:11:07 -04:00
Rasmus Villemoes
90a9befb20 drivers/md/md.c: use strreplace()
There's no point in starting over when we meet a '/'.  This also
eliminates a stack variable and a little .text.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-25 17:00:40 -07:00
Linus Torvalds
6597ac8a51 - DM core cleanups
- blk-mq request-based DM no longer uses any mempools now that partial
     completions are no longer handled as part of cloned requests
 
 - DM raid cleanups and support for MD raid0
 
 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy
   - smq is the new default dm-cache policy
 
 - DM thinp cleanups and much more efficient large discard support
 
 - DM statistics support for request-based DM and nanosecond resolution
   timestamps
 
 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJViE/tAAoJEMUj8QotnQNag8wIAMhcmy46qGCy6SrOew6D8EQ0
 4Ielsr5ZI67Q9pZVI1x/Zdt5GfDUGXiSEy/y8xoIuJfmsRXwnZ/gkOUyWpSUW8dH
 mgPUUpGF0aN2D/66P3chgm39rVZEt3crDY+SQu02Fm86JKlMUomFbWKgMXpsM6Ga
 HHW80zLF9Ca6D5m8xMze/ic1KBJtA9zaXcO9xnMCBym8mSaORtgwoCzKAgy43H7U
 KIBwPKR/y+c7SaKWGxXwxxhTKZDyP4FbgtiJ2Dc9yBDoD0RzbyuY/EFMEUPEbEMv
 e9fFHNEzmKlsNVY2vYXkVH4TdldWrrjkmYj/JVtz8cifoupWOOnDTjb2aqjAIww=
 =DKCC
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM core cleanups:

     * blk-mq request-based DM no longer uses any mempools now that
       partial completions are no longer handled as part of cloned
       requests

 - DM raid cleanups and support for MD raid0

 - DM cache core advances and a new stochastic-multi-queue (smq) cache
   replacement policy

     * smq is the new default dm-cache policy

 - DM thinp cleanups and much more efficient large discard support

 - DM statistics support for request-based DM and nanosecond resolution
   timestamps

 - Fixes to DM stripe, DM log-writes, DM raid1 and DM crypt

* tag 'dm-4.2-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (39 commits)
  dm stats: add support for request-based DM devices
  dm stats: collect and report histogram of IO latencies
  dm stats: support precise timestamps
  dm stats: fix divide by zero if 'number_of_areas' arg is zero
  dm cache: switch the "default" cache replacement policy from mq to smq
  dm space map metadata: fix occasional leak of a metadata block on resize
  dm thin metadata: fix a race when entering fail mode
  dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
  dm thin: range discard support
  dm thin metadata: add dm_thin_remove_range()
  dm thin metadata: add dm_thin_find_mapped_range()
  dm btree: add dm_btree_remove_leaves()
  dm stats: Use kvfree() in dm_kvfree()
  dm cache: age and write back cache entries even without active IO
  dm cache: prefix all DMERR and DMINFO messages with cache device name
  dm cache: add fail io mode and needs_check flag
  dm cache: wake the worker thread every time we free a migration object
  dm cache: add stochastic-multi-queue (smq) policy
  dm cache: boost promotion of blocks that will be overwritten
  dm cache: defer whole cells
  ...
2015-06-25 16:34:39 -07:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
Linus Torvalds
bfffa1cc9d Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block
Pull core block IO update from Jens Axboe:
 "Nothing really major in here, mostly a collection of smaller
  optimizations and cleanups, mixed with various fixes.  In more detail,
  this contains:

   - Addition of policy specific data to blkcg for block cgroups.  From
     Arianna Avanzini.

   - Various cleanups around command types from Christoph.

   - Cleanup of the suspend block I/O path from Christoph.

   - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

   - Eliminating atomic inc/dec of both remaining IO count and reference
     count in a bio.  From me.

   - Fixes for SG gap and chunk size support for data-less (discards)
     IO, so we can merge these better.  From me.

   - Small restructuring of blk-mq shared tag support, freeing drivers
     from iterating hardware queues.  From Keith Busch.

   - A few cfq-iosched tweaks, from Tahsin Erdogan and me.  Makes the
     IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
  cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
  cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
  cfq-iosched: move group scheduling functions under ifdef
  cfq-iosched: fix the setting of IOPS mode on SSDs
  blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
  block, cgroup: implement policy-specific per-blkcg data
  block: Make CFQ default to IOPS mode on SSDs
  block: add blk_set_queue_dying() to blkdev.h
  blk-mq: Shared tag enhancements
  block: don't honor chunk sizes for data-less IO
  block: only honor SG gap prevention for merges that contain data
  block: fix returnvar.cocci warnings
  block, dm: don't copy bios for request clones
  block: remove management of bi_remaining when restoring original bi_end_io
  block: replace trylock with mutex_lock in blkdev_reread_part()
  block: export blkdev_reread_part() and __blkdev_reread_part()
  suspend: simplify block I/O handling
  block: collapse bio bit space
  block: remove unused BIO_RW_BLOCK and BIO_EOF flags
  block: remove BIO_EOPNOTSUPP
  ...
2015-06-25 14:29:53 -07:00
Neil Brown
ab16bfc732 md: clear Blocked flag on failed devices when array is read-only.
The Blocked flag indicates that a device has failed but that this
fact hasn't been recorded in the metadata yet.  Writes to such
devices cannot be allowed until the metadata has been updated.

On a read-only array, the Blocked flag will never be cleared.
This prevents the device being removed from the array.

If the metadata is being handled by the kernel
(i.e. !mddev->external), then we can be sure that if the array is
switch to writable, then a metadata update will happen and will
record the failure.  So we don't need the flag set.

If metadata is externally managed, it is upto the external manager
to clear the 'blocked' flag.

Reported-by: XiaoNi <xni@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-25 17:16:49 +10:00
NeilBrown
9a8c0fa861 md: unlock mddev_lock on an error path.
This error path retuns while still holding the lock - bad.

Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-25 17:14:09 +10:00
NeilBrown
bd6919228d md: clear mddev->private when it has been freed.
If ->private is set when ->run is called, it is assumed to be
a 'config'  prepared as part of 'reshape'.

So it is important when we free that config, that we also clear ->private.
This is not often a problem as the mddev will normally be discarded
shortly after the config us freed.
However if an 'assemble' races with a final close, the assemble can use
the old mddev which has a stale ->private.  This leads to any of
various sorts of crashes.

So clear ->private after calling ->free().

Reported-by: Nate Clark <nate@neworld.us>
Cc: stable@vger.kernel.org (v4.0+)
Fixes: afa0f557cb ("md: rename ->stop to ->free")
Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-25 17:14:09 +10:00
Miklos Szeredi
2726d56620 vfs: add seq_file_path() helper
Turn
	seq_path(..., &file->f_path, ...);
into
	seq_file_path(..., file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:01:07 -04:00
Miklos Szeredi
9bf39ab2ad vfs: add file_path() helper
Turn
	d_path(&file->f_path, ...);
into
	file_path(file, ...);

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23 18:00:05 -04:00
Mikulas Patocka
e262f34741 dm stats: add support for request-based DM devices
This makes it possible to use dm stats with DM multipath.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17 12:40:41 -04:00
Mikulas Patocka
dfcfac3e4c dm stats: collect and report histogram of IO latencies
Add an option to dm statistics to collect and report a histogram of
IO latencies.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17 12:40:40 -04:00
Mikulas Patocka
c96aec344d dm stats: support precise timestamps
Make it possible to use precise timestamps with nanosecond granularity
in dm statistics.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-17 12:40:40 -04:00
Mikulas Patocka
dd4c1b7d0c dm stats: fix divide by zero if 'number_of_areas' arg is zero
If the number_of_areas argument was zero the kernel would crash on
div-by-zero.  Add better input validation.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.12+
2015-06-17 12:40:39 -04:00
Mike Snitzer
bccab6a01a dm cache: switch the "default" cache replacement policy from mq to smq
The Stochastic multiqueue (SMQ) policy (vs MQ) offers the promise of
less memory utilization, improved performance and increased adaptability
in the face of changing workloads.  SMQ also does not have any
cumbersome tuning knobs.

Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target.  Doing so will cause all of the
mq policy's hints to be dropped.  Also, performance of the cache may
degrade slightly until smq recalculates the origin device's hotspots
that should be cached.

In the future the "mq" policy will just silently make use of "smq" and
the mq code will be removed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
2015-06-17 12:40:38 -04:00
Joe Thornber
6096d91af0 dm space map metadata: fix occasional leak of a metadata block on resize
The metadata space map has a simplified 'bootstrap' mode that is
operational when extending the space maps.  Whilst in this mode it's
possible for some refcount decrement operations to become queued (eg, as
a result of shadowing one of the bitmap indexes).  These decrements were
not being applied when switching out of bootstrap mode.

The effect of this bug was the leaking of a 4k metadata block.  This is
detected by the latest version of thin_check as a non fatal error.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-06-17 10:09:23 -04:00
Firo Yang
4e02361232 md: fix a build warning
Warning like this:

drivers/md/md.c: In function "update_array_info":
drivers/md/md.c:6394:26: warning: logical not is only applied
to the left hand side of comparison [-Wlogical-not-parentheses]
      !mddev->persistent  != info->not_persistent||

Fix it as Neil Brown said:
mddev->persistent != !info->not_persistent ||

Signed-off-by: Firo Yang <firogm@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:38 +10:00
Shaohua Li
713bc5c2de md/raid5: ignore released_stripes check
conf->released_stripes list isn't always related to where there are
free stripes pending. Active stripes can be in the list too.
And even free stripes were active very recently.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:33 +10:00
Yuanhan Liu
e9e4c377e2 md/raid5: per hash value and exclusive wait_for_stripe
I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.

Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.

Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.

To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.

Here are some test results I have got with this patch applied(all test run
3 times):

`fsmark.files_per_sec'
=====================

next-20150317                 this patch
-------------------------     -------------------------
metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
-------------------------     -------------------------   --------     ------------------------------
      25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
      32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
      48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
       6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
      63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
     245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
      17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
     188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
     254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync

`time.system_time'
=====================

next-20150317                 this patch
-------------------------    -------------------------
metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
-------------------------    -------------------------    --------     ------------------------------
    7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
    7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
   14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
    3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
    5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
    5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
    2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
    3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
    3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose

Where,

     1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark

     1t, 64t: where 't' means thread

     4M: means the single file size, corresponding to the '-s' option of fsmark
     40G, 30G, 120G: means the total test size

     4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
               the size of one ramdisk. So, it would be 48G in total. And we made a
               raid on those ramdisk

As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).

v2: use bits instead of array to note down wait queue need to wake up.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:27 +10:00
Yuanhan Liu
b1b4648648 md/raid5: split wait_for_stripe and introduce wait_for_quiescent
I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.

After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.

Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.

So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.

v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
    Commit log refactor suggestion from Neil.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:21 +10:00
Alexey Dobriyan
4c9309c0cc md: convert to kstrto*()
Convert away from deprecated simple_strto*() functions.

Add "fit into sector_t" checks.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 10:00:06 +10:00
Kent Overstreet
c31df25f20 md/raid10: make sync_request_write() call bio_copy_data()
Refactor sync_request_write() of md/raid10 to use bio_copy_data()
instead of open coding bio_vec iterations.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <mlin@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-17 09:59:57 +10:00
NeilBrown
ea358cd0d2 md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished.  However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
can lean to multiple reshapes running at the same time, which isn't
good.

To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.

Signed-off-by: NeilBrown  <neilb@suse.de>
2015-06-12 20:16:33 +10:00
NeilBrown
8e8e2518fc md: Close race when setting 'action' to 'idle'.
Checking ->sync_thread without holding the mddev_lock()
isn't really safe, even after flushing the workqueue which
ensures md_start_sync() has been run.

While this code is waiting for the lock, md_check_recovery could reap
the thread itself, and then start another thread (e.g. recovery might
finish, then reshape starts).  When this thread gets the lock
md_start_sync() hasn't run so it doesn't get reaped, but
MD_RECOVERY_RUNNING gets cleared.  This allows two threads to start
which leads to confusion.

So don't both if MD_RECOVERY_RUNNING isn't set, but if it is do
the flush and the test and the reap all under the mddev_lock to
avoid any race with md_check_recovery.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Cc: stable@vger.kernel.org (v4.0+)
2015-06-12 20:16:26 +10:00
NeilBrown
c008f1d356 md: don't return 0 from array_state_store
Returning zero from a 'store' function is bad.
The return value should be either len length of the string
or an error.

So use 'len' if 'err' is zero.

Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: stable@vger.kernel (v4.0+)
2015-06-12 20:16:16 +10:00
Joe Thornber
b1f11aff04 dm thin metadata: fix a race when entering fail mode
In dm_thin_find_block() the ->fail_io flag was checked outside the
metadata device's root_lock, causing dm_thin_find_block() to race with
the setting of this flag.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:06 -04:00
Mike Snitzer
fd467696e8 dm thin: fail messages with EOPNOTSUPP when pool cannot handle messages
Use EOPNOTSUPP, rather than EINVAL, error code when user attempts to
send the pool a message.  Otherwise usespace is led to believe the
message failed due to invalid argument.

Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:05 -04:00
Joe Thornber
34fbcf6257 dm thin: range discard support
Previously REQ_DISCARD bios have been split into block sized chunks
before submission to the thin target.  There are a couple of issues with
this:

 - If the block size is small, a large discard request can
   get broken up into a great many bios which is both slow and causes
   a lot of memory pressure.

 - The thin pool block size and the discard granularity for the
   underlying data device need to be compatible if we want to passdown
   the discard.

This patch relaxes the block size granularity for thin devices.  It
makes use of the recent range locking added to the bio_prison to
quiesce a whole range of thin blocks before unmapping them.  Once a
thin range has been unmapped the discard can then be passed down to
the data device for those sub ranges where the data blocks are no
longer used (ie. they weren't shared in the first place).

This patch also doesn't make any apologies about open-coding portions
of block core as a means to supporting async discard completions in the
near-term -- if/when late bio splitting lands it'll all get cleaned up.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:05 -04:00
Joe Thornber
6550f075f5 dm thin metadata: add dm_thin_remove_range()
Removes a range of blocks from the btree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:04 -04:00
Joe Thornber
a5d895a90b dm thin metadata: add dm_thin_find_mapped_range()
Retrieve the next run of contiguously mapped blocks.  Useful for working
out where to break up IO.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:03 -04:00
Joe Thornber
4ec331c3ea dm btree: add dm_btree_remove_leaves()
Removes a range of leaf values from the tree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:03 -04:00
Pekka Enberg
0f24b79b52 dm stats: Use kvfree() in dm_kvfree()
Use kvfree() instead of open-coding it.

Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:02 -04:00
Joe Thornber
fba10109a4 dm cache: age and write back cache entries even without active IO
The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions.  However if no IO is going through the
cache, then that bottom half work doesn't occur.  With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.

Fix this by introducing a new 'can_block' parameter to the tick()
method.  When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:01 -04:00
Mike Snitzer
b61d950962 dm cache: prefix all DMERR and DMINFO messages with cache device name
Having the DM device name associated with the ERR or INFO message is
very helpful.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:01 -04:00
Joe Thornber
028ae9f76f dm cache: add fail io mode and needs_check flag
If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode.  If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.

Once needs_check is set the cache device will not be allowed to
activate.  Activation requires write access to metadata.  Future work is
needed to add proper support for running the cache in read-only mode.

Once in fail-io mode the cache will report a status of "Fail".

Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:00 -04:00
Joe Thornber
88bf5184fa dm cache: wake the worker thread every time we free a migration object
When the cache is idle, writeback work was only being issued every
second.  With this change outstanding writebacks are streamed
constantly.  This offers a writeback performance improvement.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:13:00 -04:00
Joe Thornber
66a6363566 dm cache: add stochastic-multi-queue (smq) policy
The stochastic-multi-queue (smq) policy addresses some of the problems
with the current multiqueue (mq) policy.

Memory usage
------------

The mq policy uses a lot of memory; 88 bytes per cache block on a 64
bit machine.

SMQ uses 28bit indexes to implement it's data structures rather than
pointers.  It avoids storing an explicit hit count for each block.  It
has a 'hotspot' queue rather than a pre cache which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).

All these mean smq uses ~25bytes per cache block.  Still a lot of
memory, but a substantial improvement nontheless.

Level balancing
---------------

MQ places entries in different levels of the multiqueue structures
based on their hit count (~ln(hit count)).  This means the bottom
levels generally have the most entries, and the top ones have very
few.  Having unbalanced levels like this reduces the efficacy of the
multiqueue.

SMQ does not maintain a hit count, instead it swaps hit entries with
the least recently used entry from the level above.  The over all
ordering being a side effect of this stochastic process.  With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.

Adaptability
------------

The MQ policy maintains a hit count for each cache block.  For a
different block to get promoted to the cache it's hit count has to
exceed the lowest currently in the cache.  This means it can take a
long time for the cache to adapt between varying IO patterns.
Periodically degrading the hit counts could help with this, but I
haven't found a nice general solution.

SMQ doesn't maintain hit counts, so a lot of this problem just goes
away.  In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote.  If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels.  This lets it adapt to new IO patterns very quickly.

Performance
-----------

In my tests SMQ shows substantially better performance than MQ.  Once
this matures a bit more I'm sure it'll become the default policy.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-06-11 17:12:59 -04:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Tejun Heo
4452226ea2 writeback: move backing_dev_info->state into bdi_writeback
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear.  For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi.  To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
  changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
  wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
  uses of bdi->state are mechanically replaced with bdi->wb.state
  introducing no behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Linus Torvalds
0f1e5b5d19 Quite a few fixes for DM's blk-mq support thanks to extra DM multipath
testing from Junichi Nomura and Bart Van Assche.
 
 Also fix a casting bug in dm_merge_bvec() that could cause only a single
 page to be added to a bio (Joe identified this while testing dm-cache
 writeback).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVaLGPAAoJEMUj8QotnQNa1pcH/3TumkM0rvl1DK7DfWvjgfpp
 +h+whWn79lOUzan/AszRvRU4Pwr7symndnJzJ/tqQ76fm8x84obvKa0jFfB9rmlM
 sgw45alqB+AKcKAvocsuDIdzy2/K8FHBihccPWhe87AtJRlj57nTOJWO06gzExOp
 /AJ/0Vgv3pxUCL2M4XucpQOV8eUg+gpwjDMxuMEsPVzBVYJfuRvNKoERtd9LSCSV
 Y5iqdHCJiW0SWesZNrfthAAlReFzqb1c3mq55G0q0C+Z77Jf3TIPBhVfouM4NOKt
 wjQoPavzyAEiKfLFy6QFCjhSOXKCI3klqO7zcyzyygVue5UTHICYRJKn7TFtbfU=
 =xGAV
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper fixes from Mike Snitzer:
 "Quite a few fixes for DM's blk-mq support thanks to extra DM multipath
  testing from Junichi Nomura and Bart Van Assche.

  Also fix a casting bug in dm_merge_bvec() that could cause only a
  single page to be added to a bio (Joe identified this while testing
  dm-cache writeback)"

* tag 'dm-4.1-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix casting bug in dm_merge_bvec()
  dm: fix reload failure of 0 path multipath mapping on blk-mq devices
  dm: fix false warning in free_rq_clone() for unmapped requests
  dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY
  dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
  dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED
  dm: run queue on re-queue
2015-05-29 14:39:24 -07:00
Joe Thornber
40775257b9 dm cache: boost promotion of blocks that will be overwritten
When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).

The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:07 -04:00
Joe Thornber
651f5fa2a3 dm cache: defer whole cells
Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).

This patch passes whole cells across to the worker.  This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy.  This reduces the overhead of using dm-cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:06 -04:00
Joe Thornber
3cdf93f9d8 dm bio prison: add dm_cell_promote_or_release()
Rather than always releasing the prisoners in a cell, the client may
want to promote one of them to be the new holder.  There is a race here
though between releasing an empty cell, and other threads adding new
inmates.  So this function makes the decision with its lock held.

This function can have two outcomes:
i)  An inmate is promoted to be the holder of the cell (return value of 0).
ii) The cell has no inmate for promotion and is released (return value of 1).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:06 -04:00
Joe Thornber
451b9e0071 dm cache: pull out some bitset utility functions for reuse
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:05 -04:00
Joe Thornber
20f6814b94 dm cache: pass a new 'critical' flag to the policies when requesting writeback work
We only allow non critical writeback if the origin is idle.  It is up
to the policy to decide what writeback work is critical.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:04 -04:00
Joe Thornber
066dbaa386 dm cache: track IO to the origin device using io_tracker
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:04 -04:00
Joe Thornber
77289d3207 dm cache: add io_tracker
A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.

FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:03 -04:00
Joe Thornber
fb4100ae7f dm cache: fix race when issuing a POLICY_REPLACE operation
There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.

This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.

Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
2015-05-29 14:19:03 -04:00
Milan Broz
54cea3f668 dm crypt: add comments to better describe crypto processing logic
A crypto driver can process requests synchronously or asynchronously
and can use an internal driver queue to backlog requests.
Add some comments to clarify internal logic and completion return codes.

Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:02 -04:00
Lidong Zhong
ed63287dd6 dm raid1: keep issuing IO after leg failure
Currently if there is a leg failure, the bio will be put into the hold
list until userspace does a remove/replace on the leg.  Doing so in a
cluster config (clvmd) is problematic because there may be a temporary
path failure that results in cluster raid1 remove/replace.  Such
recovery takes a long time due to a full resync.

Update dm-raid1 to optionally ignore these failures so bios continue
being issued without interrupton.  To enable this feature userspace
must pass "keep_log" when creating the dm-raid1 device.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Tested-by: Liuhua Wang <lwang@suse.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:02 -04:00
Geert Uytterhoeven
f4ad317aed dm log writes: use ULL suffix for 64-bit constants
On 32-bit:
drivers/md/dm-log-writes.c: In function ‘log_super’:
drivers/md/dm-log-writes.c:323: warning: integer constant is too large for ‘long’ type

Add a ULL suffix to WRITE_LOG_MAGIC to fix this.
Also add a ULL suffix to WRITE_LOG_VERSION as it's stored in a __le64
field.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:01 -04:00
Luis Henriques
e223e1de4f dm stripe: drop useless exit point from dm_stripe_init()
Signed-off-by: Luis Henriques <luis.henriques@canonical.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:01 -04:00
Heinz Mauelshagen
0cf4503174 dm raid: add support for the MD RAID0 personality
Add dm-raid access to the MD RAID0 personality to enable single zone
striping.

The following changes enable that access:
- add type definition to raid_types array
- make bitmap creation conditonal in super_validate(), because
  bitmaps are not allowed in raid0
- set rdev->sectors to the data image size in super_validate()
  to allow the raid0 personality to calculate the MD array
  size properly
- use mdddev(un)lock() functions instead of direct mutex_(un)lock()
  (wrapped in here because it's a trivial change)
- enhance raid_status() to always report full sync for raid0
  so that userspace checks for 100% sync will succeed and allow
  for resize (and takeover/reshape once added in future paches)
- enhance raid_resume() to not load bitmap in case of raid0
- add merge function to avoid data corruption (seen with readahead)
  that resulted from bio payloads that grew too large.  This problem
  did not occur with the other raid levels because it either did not
  apply without striping (raid1) or was avoided via stripe caching.
- raise version to 1.7.0 because of the raid0 API change

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:00 -04:00
Heinz Mauelshagen
c76d53f43e dm raid: a few cleanups
- ensure maximum device limit in superblock
- rename DMPF_* (print flags) to CTR_FLAG_* (constructor flags)
  and their respective struct raid_set member
- use strcasecmp() in raid10_format_to_md_layout() as in the constructor

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:19:00 -04:00
Heinz Mauelshagen
0f4106b32f dm raid: fixup documentation for discard support
Remove comment above parse_raid_params() that claims
"devices_handle_discard_safely" is a table line argument when it is
actually is a module parameter.

Also, backfill dm-raid target version 1.6.0 documentation.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Reviewed-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:59 -04:00
Mike Snitzer
49f154c732 dm thin metadata: remove in-core 'read_only' flag
Leverage the block manager's read_only flag instead of duplicating it;
access with new dm_bm_is_read_only() method.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:59 -04:00
Mike Snitzer
f8ae75253e dm thin: cleanup schedule_zero() to read more logically
The overwrite has only ever about optimizing away the need to zero a
block if the entire block was being overwritten.  As such it is only
relevant when zeroing is enabled.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
2015-05-29 14:18:58 -04:00
Mike Snitzer
8b908f8e94 dm thin: cleanup overwrite's endio restore to be centralized
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:58 -04:00
Mike Snitzer
0f20972f7b dm: factor out a common cleanup_mapped_device()
Introduce a single common method for cleaning up a DM device's
mapped_device.  No functional change, just eliminates duplication of
delicate mapped_device cleanup code.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:58 -04:00
Mike Snitzer
2d76fff18f dm: cleanup methods that requeue requests
More often than not a request that is requeued _is_ mapped (meaning the
clone request is allocated and clone->q is initialized).  Rename
dm_requeue_unmapped_original_request() to avoid potential confusion due
to function name containing "unmapped".

Also, remove dm_requeue_unmapped_request() since callers can easily call
the dm_requeue_original_request() directly.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:57 -04:00
Mike Snitzer
cbc4e3c135 dm: do not allocate any mempools for blk-mq request-based DM
Do not allocate the io_pool mempool for blk-mq request-based DM
(DM_TYPE_MQ_REQUEST_BASED) in dm_alloc_rq_mempools().

Also refine __bind_mempools() to have more precise awareness of which
mempools each type of DM device uses -- avoids mempool churn when
reloading DM tables (particularly for DM_TYPE_REQUEST_BASED).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 14:18:57 -04:00
Mike Snitzer
183f7802e7 Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2 2015-05-29 14:17:16 -04:00
Joe Thornber
1c220c69ce dm: fix casting bug in dm_merge_bvec()
dm_merge_bvec() was originally added in f6fccb ("dm: introduce
merge_bvec_fn").  In that commit a value in sectors is converted to
bytes using << 9, and then assigned to an int.  This code made
assumptions about the value of BIO_MAX_SECTORS.

A later commit 148e51 ("dm: improve documentation and code clarity in
dm_merge_bvec") was meant to have no functional change but it removed
the use of BIO_MAX_SECTORS in favor of using queue_max_sectors().  At
this point the cast from sector_t to int resulted in a zero value.  The
fallout being dm_merge_bvec() would only allow a single page to be added
to a bio.

This interim fix is minimal for the benefit of stable@ because the more
comprehensive cleanup of passing a sector_t to all DM targets' merge
function will impact quite a few DM targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.19+
2015-05-29 13:41:16 -04:00
Junichi Nomura
15b94a6904 dm: fix reload failure of 0 path multipath mapping on blk-mq devices
dm-multipath accepts 0 path mapping.

  # echo '0 2097152 multipath 0 0 0 0' | dmsetup create newdev

Such a mapping can be used to release underlying devices while still
holding requests in its queue until working paths come back.

However, once the multipath device is created over blk-mq devices,
it rejects reloading of 0 path mapping:

  # echo '0 2097152 multipath 0 0 1 1 queue-length 0 1 1 /dev/sda 1' \
      | dmsetup create mpath1
  # echo '0 2097152 multipath 0 0 0 0' | dmsetup load mpath1
  device-mapper: reload ioctl on mpath1 failed: Invalid argument
  Command failed

With following kernel message:
  device-mapper: ioctl: can't change device type after initial table load.

DM tries to inherit the current table type using dm_table_set_type()
but it doesn't work as expected because of unnecessary check about
whether the target type is hybrid or not.

Hybrid type is for targets that work as either request-based or bio-based
and not required for blk-mq or non blk-mq checking.

Fixes: 65803c2059 ("dm table: train hybrid target type detection to select blk-mq if appropriate")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 13:41:16 -04:00
Linus Torvalds
c492e2d464 Assorted fixes for new RAID5 stripe-batching functionality.
Unfortunately this functionality was merged a little prematurely.
 The necessary testing and code review is now complete (or as
 complete as it can be) and to code passes a variety of tests
 and looks quite sensible.
 
 Also a fix for some recent locking changes - a race was introduced
 which causes a reshape request to sometimes fail.  No data safety issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVWf/sDnsnt1WYoG5AQJ8hQ/+KIGUijacXXUXBE4QuO1DMTkltV61bk6E
 TJQ6fTuMvXeOuyGm+BoSFTJrOJiP6/PVxl4jnAkjLlvAK/JVKekG0PXv2flmD9EJ
 udK/g8d2k+4L2O0uiGdGSfOQaEaQ4OQvNmQOP9GF/FXNdyYfZbSJnxG+kzWnStGZ
 3LNEMoDok9TiUDVSJ3PgibnUHYr3zNJFjBGszfRW0HqXBRWM5TI6HQ0bWwrm61mQ
 sIOvFeS7CVOBQWW7zkY3uvz/g7dpuPlXqmDOomF+prKlU320SrpSDDBD2Qg56rXh
 8YGAzLPV8R6xB5hjGFnoHtvxF/f5Fntb3WbC5az0zv+q/phDYA9Nd2UN5APemyGB
 PJuxW4Ojq2DWIAvmf0HQEkvjJlqeugCgIQXJJ8yvIaBXJJjit1jMSEXjolM4vlLh
 h6Su/hwoyTi9NxdYpFeR6JuyHjzTyrjyBkbW8y12wVQjmDncBdKtieYZX4TvPxVz
 n7Qrk2bpFhR/icP6eYWCvt6iwU1e+5lXNb/18AYm9bJe5BE5/N1X0azrxbdZT4cl
 1DvQw2HAMBGp+nSr+R1lqO4yX+busBZUTYsaGvH4T7Ubs+UjwgTE3tPoevj6w829
 0/7r/UPfSn0XFbd5rrPY+bOBsAOIMDG5g3mj7K7+38sVeX9VOVN4sGftS5dWTr9e
 RQBTZAK0+qI=
 =Y0Vm
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md

Pull m,ore md bugfixes gfrom Neil Brown:
 "Assorted fixes for new RAID5 stripe-batching functionality.

  Unfortunately this functionality was merged a little prematurely.  The
  necessary testing and code review is now complete (or as complete as
  it can be) and to code passes a variety of tests and looks quite
  sensible.

  Also a fix for some recent locking changes - a race was introduced
  which causes a reshape request to sometimes fail.  No data safety
  issues"

* tag 'md/4.1-rc5-fixes' of git://neil.brown.name/md:
  md: fix race when unfreezing sync_action
  md/raid5: break stripe-batches when the array has failed.
  md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
  md/raid5: be more selective about distributing flags across batch.
  md/raid5: add handle_flags arg to break_stripe_batch_list.
  md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
  md/raid5: remove condition test from check_break_stripe_batch_list.
  md/raid5: Ensure a batch member is not handled prematurely.
  md/raid5: close race between STRIPE_BIT_DELAY and batching.
  md/raid5: ensure whole batch is delayed for all required bitmap updates.
2015-05-29 10:35:21 -07:00
Mike Snitzer
e5d8de32cc dm: fix false warning in free_rq_clone() for unmapped requests
When stacking request-based dm device on non blk-mq device and
device-mapper target could not map the request (error target is used,
multipath target with all paths down, etc), the WARN_ON_ONCE() in
free_rq_clone() will trigger when it shouldn't.

The warning was added by commit aa6df8d ("dm: fix free_rq_clone() NULL
pointer when requeueing unmapped request").  But free_rq_clone() with
clone->q == NULL is valid usage for the case where
dm_kill_unmapped_request() initiates request cleanup.

Fix this false warning by just removing the WARN_ON -- it only generated
false positives and was never useful in catching the intended case
(completing clone request not being mapped e.g. clone->q being NULL).

Fixes: aa6df8d ("dm: fix free_rq_clone() NULL pointer when requeueing unmapped request")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-29 11:07:36 -04:00
NeilBrown
56ccc1125b md: fix race when unfreezing sync_action
A recent change removed the need for locking around writing
to "sync_action" (and various other places), but introduced a
subtle race.
When e.g. setting 'reshape' on a 'frozen' array, the 'frozen'
flag is cleared before 'reshape' is set, so the md thread can
get in and start trying recovery - which isn't wanted.

So instead of clearing MD_RECOVERY_FROZEN for any command
except 'frozen', only clear it when each specific command
is parsed.  This allows the handling of 'reshape' to clear
the bit while a lock is held.

Also remove some places where we set MD_RECOVERY_NEEDED,
as it is always set on non-error exit of the function.


Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 6791875e2e ("md: make reconfig_mutex optional for writes to md sysfs files.")
2015-05-28 18:04:45 +10:00
NeilBrown
626f2092c8 md/raid5: break stripe-batches when the array has failed.
Once the array has too much failure, we need to break
stripe-batches up so they can all be dealt with.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:48:59 +10:00
NeilBrown
787b76fa37 md/raid5: call break_stripe_batch_list from handle_stripe_clean_event
Now that the code in break_stripe_batch_list() is nearly identical
to the end of handle_stripe_clean_event, replace the later
with a function call.

The only remaining difference of any interest is the masking that is
applieds to dev[i].flags copied from head_sh.
R5_WriteError certainly isn't wanted as it is set per-stripe, not
per-patch.  R5_Overlap isn't wanted as it is explicitly handled.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:47:02 +10:00
NeilBrown
1b956f7a8f md/raid5: be more selective about distributing flags across batch.
When a batch of stripes is broken up, we keep some of the flags
that were per-stripe, and copy other flags from the head to all
others.

This only happens while a stripe is being handled, so many of the
flags are irrelevant.

The "SYNC_FLAGS" (which I've renamed to make it clear there are
several) and STRIPE_DEGRADED are set per-stripe and so need to be
preserved.  STRIPE_INSYNC is the only flag that is set on the head
that needs to be propagated to all others.

For safety, add a WARN_ON if others are set, except:
 STRIPE_HANDLE - this is safe and per-stripe and we are going to set
      in several cases anyway
 STRIPE_INSYNC
 STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
 STRIPE_ON_PLUG_LIST
 STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
           stripe to be on one of these lists, but it can happen
           as can be safely ignored.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:40:01 +10:00
NeilBrown
3960ce7961 md/raid5: add handle_flags arg to break_stripe_batch_list.
When we break a stripe_batch_list we sometimes want to set
STRIPE_HANDLE on the individual stripes, and sometimes not.

So pass a 'handle_flags' arg.  If it is zero, always set STRIPE_HANDLE
(on non-head stripes).  If not zero, only set it if any of the given
flags are present.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:39:30 +10:00
NeilBrown
fb642b92c2 md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list
break_stripe_batch list didn't clear head_sh->batch_head.
This was probably a bug.

Also clear all R5_Overlap flags and if any were cleared, wake up
'wait_for_overlap'.
This isn't always necessary but the worst effect is a little
extra checking for code that is waiting on wait_for_overlap.

Also, don't use wake_up_nr() because that does the wrong thing
if 'nr' is zero, and it number of flags cleared doesn't
strongly correlate with the number of threads to wake.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:36:25 +10:00
NeilBrown
4e3d62ff49 md/raid5: remove condition test from check_break_stripe_batch_list.
handle_stripe_clean_event() contains a chunk of code very
similar to check_break_stripe_batch_list().
If we make the latter more like the former, we can end up
with just one copy of this code.

This  first step removed the condition (and the 'check_') part
of the name.  This has the added advantage of making it clear
what check is being performed at the point where the function is
called.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:36:06 +10:00
NeilBrown
b15a9dbdbf md/raid5: Ensure a batch member is not handled prematurely.
If a stripe is a member of a batch, but not the head, it must
not be handled separately from the rest of the batch.

'clear_batch_ready()' handles this requirement to some
extent but not completely.  If a member is passed to handle_stripe()
a second time it returns '0' indicating the stripe can be handled,
which is wrong.
So add an extra test.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:35:47 +10:00
NeilBrown
d0852df543 md/raid5: close race between STRIPE_BIT_DELAY and batching.
When we add a write to a stripe we need to make sure the bitmap
bit is set.  While doing that the stripe is not locked so it could
be added to a batch after which further changes to STRIPE_BIT_DELAY
and ->bm_seq are ineffective.

So we need to hold off adding to a stripe until bitmap_startwrite has
completed at least once, and we need to avoid further changes to
STRIPE_BIT_DELAY once the stripe has been added to a batch.

If a bitmap_startwrite() completes after the stripe was added to a
batch, it will not have set the bit, only incremented a counter, so no
extra delay of the stripe is needed.

Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:34:40 +10:00
NeilBrown
2b6b245742 md/raid5: ensure whole batch is delayed for all required bitmap updates.
When we add a stripe to a batch, we need to be sure that
head stripe will wait for the bitmap update required for the new
stripe.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-28 11:29:14 +10:00
Mike Snitzer
45714fbed4 dm: requeue from blk-mq dm_mq_queue_rq() using BLK_MQ_RQ_QUEUE_BUSY
Use BLK_MQ_RQ_QUEUE_BUSY to requeue a blk-mq request directly from the
DM blk-mq device's .queue_rq.  This cleans up the previous convoluted
handling of request requeueing that would return BLK_MQ_RQ_QUEUE_OK
(even though it wasn't) and then run blk_mq_requeue_request() followed
by blk_mq_kick_requeue_list().

Also, document that DM blk-mq ontop of old request_fn devices cannot
fail in clone_rq() since the clone request is preallocated as part of
the pdu.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-27 17:37:23 -04:00
Mike Snitzer
4c6dd53dd3 dm mpath: fix leak of dm_mpath_io structure in blk-mq .queue_rq error path
Otherwise kmemleak reported:

unreferenced object 0xffff88009b14e2b0 (size 16):
  comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s)
  hex dump (first 16 bytes):
    40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00  @...............
  backtrace:
    [<ffffffff81600029>] kmemleak_alloc+0x49/0xb0
    [<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160
    [<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20
    [<ffffffff8111cb37>] mempool_alloc+0x57/0x150
    [<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath]
    [<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath]
    [<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod]
    [<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod]
    [<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380
    [<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100
    [<ffffffff812ce350>] blk_sq_make_request+0x240/0x300
    [<ffffffff812c0f30>] generic_make_request+0xc0/0x110
    [<ffffffff812c0ff2>] submit_bio+0x72/0x150
    [<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0
    [<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40
    [<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390

Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-05-27 17:37:22 -04:00
Junichi Nomura
3a1407559a dm: fix NULL pointer when clone_and_map_rq returns !DM_MAPIO_REMAPPED
When stacking request-based DM on blk_mq device, request cloning and
remapping are done in a single call to target's clone_and_map_rq().
The clone is allocated and valid only if clone_and_map_rq() returns
DM_MAPIO_REMAPPED.

The "IS_ERR(clone)" check in map_request() does not cover all the
!DM_MAPIO_REMAPPED cases that are possible (E.g. if underlying devices
are not ready or unavailable, clone_and_map_rq() may return
DM_MAPIO_REQUEUE without ever having established an ERR_PTR).  Fix this
by explicitly checking for a return that is not DM_MAPIO_REMAPPED in
map_request().

Without this fix, DM core may call setup_clone() for a NULL clone
and oops like this:

   BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
   IP: [<ffffffff81227525>] blk_rq_prep_clone+0x7d/0x137
   ...
   CPU: 2 PID: 5793 Comm: kdmwork-253:3 Not tainted 4.0.0-nm #1
   ...
   Call Trace:
    [<ffffffffa01d1c09>] map_tio_request+0xa9/0x258 [dm_mod]
    [<ffffffff81071de9>] kthread_worker_fn+0xfd/0x150
    [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
    [<ffffffff81071cec>] ? kthread_parkme+0x24/0x24
    [<ffffffff81071fdd>] kthread+0xe6/0xee
    [<ffffffff81093a59>] ? put_lock_stats+0xe/0x20
    [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b
    [<ffffffff814c2d98>] ret_from_fork+0x58/0x90
    [<ffffffff81071ef7>] ? __init_kthread_worker+0x5b/0x5b

Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
2015-05-27 09:48:51 -04:00
Junichi Nomura
4ae9944d13 dm: run queue on re-queue
Without kicking queue, requeued request may stay forever in
the queue if there are no other I/O activities to the device.

The original error had been in v2.6.39 with commit 7eaceaccab
("block: remove per-queue plugging"), which replaced conditional
plugging by periodic runqueue.

Commit 9d1deb83d4 in v4.1-rc1 removed the periodic runqueue
and the problem started to manifest.

Fixes: 9d1deb83d4 ("dm: don't schedule delayed run of the queue if nothing to do")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-26 09:57:36 -04:00
Linus Torvalds
a30ec4b347 md fixes for 4.1-rc4
- one serious RAID0 data corruption - caused by recent bugfix that wasn't
   reviewed properly.
 - one raid5 fix in new code (a couple more of those to come).
 - one little fix to stop static analysis complaining about silly rcu
   annotation.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVV6rmDnsnt1WYoG5AQICpQ//VleOWnZERlVQN1nYzGHqIyPUz+dV6x8W
 RfzccMzUWqaHpZY/BaTMqJxPWgt6lxqqtN5F3u+XMdM+L2Pl0NXoRTqB+FzpLucR
 d0p76FwbW2IWnycxXUZtGjm8xT942KPdqXIP+LQRKaIhguV9P9th9WbPRvlXFfB6
 pO1h1AdxVlsfn6qgBpe49pYoUuNmpwSTTJ8Gvb7YRsHXnmB/lsjQvVsWLiMqPwug
 +bR0gM07OvTfYenrY82ztu42+xAoVoeu/XhHmPPLEV3S/SA/9kVLgBcWjq2bBjw0
 REzR2i+exEtCRe2yvetDX8320IKZcZSOE8BwBB2iUm8OVjplPxw83Bk/kB2+Zr8n
 VxAa9/vCZPLNCWyzOSi2V471NeFBys9PVkVIroHT5nQZGadkw0JYt1cEbcHSBKS/
 NYmJCu1IM0o9DPzn8jW7sXjDcB0WE8VwFMJ09QSiEYQu2bAb7j2f1pcoweN2MeNF
 qOitGj/luygXg2TBvema9qPL533DUj+eSYyyO7OQBsfkfZSDM5di4bGoZBVw118Q
 kARaJ3IZffT8gsVpgSTCsviDv5QUcs0izVS1sMKB6/SDJjtLXBMERW+LkrmjBsnB
 6+CMmwNZUIzugzBgSPBVozdNQQYjAXL7LNV6Q9YQhE0RK0+10fLLxmpovJR62bRf
 PWlg23+n4+E=
 =jXKM
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "I have a few more raid5 bugfixes pending, but I want them to get a bit
  more review first.  In the meantime:

   - one serious RAID0 data corruption - caused by recent bugfix that
     wasn't reviewed properly.

   - one raid5 fix in new code (a couple more of those to come).

   - one little fix to stop static analysis complaining about silly rcu
     annotation"

* tag 'md/4.1-rc4-fixes' of git://neil.brown.name/md:
  md/bitmap: remove rcu annotation from pointer arithmetic.
  md/raid0: fix restore to sector variable in raid0_make_request
  raid5: fix broken async operation chain
2015-05-22 15:10:07 -07:00
Christoph Hellwig
5f1b670d0b block, dm: don't copy bios for request clones
Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence.  With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:57 -06:00
Mike Snitzer
326e1dbb57 block: remove management of bi_remaining when restoring original bi_end_io
Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-22 08:58:55 -06:00
NeilBrown
8532e34390 md/bitmap: remove rcu annotation from pointer arithmetic.
Evaluating  "&mddev->disks" is simple pointer arithmetic, so
it does not need 'rcu' annotations - no dereferencing is happening.

Also enhance the comment to explain that 'rdev' in that case
is not actually a pointer to an rdev.

Reported-by: Patrick Marlier <patrick.marlier@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-21 09:14:41 +10:00
Eric Work
a81157768a md/raid0: fix restore to sector variable in raid0_make_request
The variable "sector" in "raid0_make_request()" was improperly updated
by a call to "sector_div()" which modifies its first argument in place.
Commit 47d68979cc restored this variable
after the call for later re-use.  Unfortunetly the restore was done after
the referenced variable "bio" was advanced.  This lead to the original
value and the restored value being different.  Here we move this line to
the proper place.

One observed side effect of this bug was discarding a file though
unlinking would cause an unrelated file's contents to be discarded.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 47d68979cc ("md/raid0: fix bug with chunksize not a power of 2.")
Cc: stable@vger.kernel.org (any that received above backport)
URL: https://bugzilla.kernel.org/show_bug.cgi?id=98501
2015-05-21 09:14:25 +10:00
Shaohua Li
487696957e raid5: fix broken async operation chain
ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned
by async_gen_syndrome should be added as the dependent tx of next stripe.

The issue is introduced by commit 59fc630b8b
    RAID5: batch adjacent full stripe write

Reported-and-tested-by: Maxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-21 09:14:20 +10:00
Linus Torvalds
c91aa67eed A few fixes for md.
Most of these are related to the new "batched stripe writeout",
 but there are a few others.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVVAgfznsnt1WYoG5AQKNsBAAw+yXVnRlLfzTPOhiIolvdmdmck6ghpV7
 eTHTbDEVkDo+a+caU01mhkHtL2wEuxGqr4XHTASIM9Xjn2YLhH14HOtBPl6+IddR
 F58X+SN8/pK4tzc5P42AAbfGfqeJSBPSl49L94+Pd9BsagNYZOUx7c/WK5lKay6O
 ZQRvI3Bafg7VTzHCw931sN11yY/PS+45zpOentk5TlavRPcZfuVTh1QzSyAUvbtN
 sAFK2+ISflj8+TW0rLo44c3kkC7ARFKcq5lJt4WsQUntJTjiIROvqvaEP+IUZkGr
 y+yvCJbskJ6B03XrcuvdkFrp3Qe0yWtHCN6qKhc8AI/c/pcuz24xTYQ6V6LDyAy8
 N2z6m1Kri97qxZpOONNh31WOCmVyFmwiz7+54soEeTDg3xgTb60UiZe8T+9ZcxoY
 sN8ksVSGGapgQUUGYcidbZla6uSWNjZq9hQgDJFUwG3HhvT3E/qiUT9kqAxfPsLE
 Vxw/KQ5GkW+7ldaPmGuv29hc3THjsiLY5Bx0kBlmgao3fcT61WbNcZPJOQfMxlT7
 6qoJSiNoq612TaHj3X06FXhMGE/fwjwjOO+PpULxM47giHcKrVR/qlG4kREYa2vj
 UgV05rP/S+k3Mce4dHzvUuE57t0ADff0d/mZvhfzERjNdwTMdWgia89XLg7D06rk
 UGB/o2IwWug=
 =tapF
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md

Pull md bugfixes from Neil Brown:
 "A few fixes for md.

  Most of these are related to the new "batched stripe writeout", but
  there are a few others"

* tag 'md/4.1-rc3-fixes' of git://neil.brown.name/md:
  md/raid5: fix handling of degraded stripes in batches.
  md/raid5: fix allocation of 'scribble' array.
  md/raid5: don't record new size if resize_stripes fails.
  md/raid5: avoid reading parity blocks for full-stripe write to degraded array
  md/raid5: more incorrect BUG_ON in handle_stripe_fill.
  md/raid5: new alloc_stripe() to allocate an initialize a stripe.
  md-raid0: conditional mddev->queue access to suit dm-raid
2015-05-11 10:33:31 -07:00
Linus Torvalds
5d5df5ee7c Two additional fixes for changes introduced via DM during the 4.1 merge
window: The first reverts a dm-crypt change that wasn't correct.  The
 second fixes a device format regression that impacted userspace.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVTRZOAAoJEMUj8QotnQNaW7EIAJlHn+S8czm1Cb4gWBn7kg+X
 vzH5NIzr/SpDX8o3R8NBdrB8rgqTm4jQZrptbmgLG+j9XoaQupuFyNCiaAw47v2G
 P/WYlodNwTkb3I48XjwCRo00MtR3cEJ8ywNzvEJUvgPkgMMIzhieHsVT9L8bZv3n
 XDs8JzZyF966U0BeCjF4oDAazUrpEvWf0h4C5L47g8C0UQI7aGwYKoSvZm3DAImP
 awbJbnqtQuoRcI0HISHrjYi1vghgnmJY6aSx3tYSJPTNRkFNqgap7eZrUacicnOH
 bUVL3snBVebK3JMJhJXgfGW/FeeP9juhEY08JNTOZ5wa6BNuru0GHeqKuI3arHY=
 =jlAN
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:
 "Two additional fixes for changes introduced via DM during the 4.1
  merge window.

  The first reverts a dm-crypt change that wasn't correct.  The second
  fixes a device format regression that impacted userspace"

* tag 'dm-4.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  init: fix regression by supporting devices with major:minor:offset format
  Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"
2015-05-08 20:38:21 -07:00
Linus Torvalds
1daac193f2 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A collection of fixes since the merge window;

   - fix for a double elevator module release, from Chao Yu.  Ancient bug.

   - the splice() MORE flag fix from Christophe Leroy.

   - a fix for NVMe, fixing a patch that went in in the merge window.
     From Keith.

   - two fixes for blk-mq CPU hotplug handling, from Ming Lei.

   - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

   - two blk-mq fixes from Shaohua, fixing a race on queue stop and a
     bad merge issue with FUA writes.

   - division-by-zero fix for writeback from Tejun.

   - a block bounce page accounting fix, making sure we inc/dec after
     bouncing so that pre/post IO pages match up.  From Wang YanQing"

* 'for-linus' of git://git.kernel.dk/linux-block:
  splice: sendfile() at once fails for big files
  blk-mq: don't lose requests if a stopped queue restarts
  blk-mq: fix FUA request hang
  block: destroy bdi before blockdev is unregistered.
  block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
  elevator: fix double release of elevator module
  writeback: use |1 instead of +1 to protect against div by zero
  blk-mq: fix CPU hotplug handling
  blk-mq: fix race between timeout and CPU hotplug
  NVMe: Fix VPD B0 max sectors translation
2015-05-08 19:49:35 -07:00
NeilBrown
bb27051f9f md/raid5: fix handling of degraded stripes in batches.
There is no need for special handling of stripe-batches when the array
is degraded.

There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.

So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.

So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.

Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device.  This simply removes the replacement device and
requires no extra handling.

Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 18:47:57 +10:00
NeilBrown
738a273806 md/raid5: fix allocation of 'scribble' array.
As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.

We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.

So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 46d5b78562 ("raid5: use flex_array for scribble data")
2015-05-08 18:47:48 +10:00
NeilBrown
6e9eac2dce md/raid5: don't record new size if resize_stripes fails.
If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.

This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.

So only update pool_size if there is no error.

This bug was introduced in 2.6.17 and the patch is suitable for
-stable.

Fixes: ad01c9e375 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Cc: stable@vger.kernel.org (v2.6.17+)
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 18:47:35 +10:00
NeilBrown
10d82c5f0d md/raid5: avoid reading parity blocks for full-stripe write to degraded array
When performing a reconstruct write, we need to read all blocks
that are not being over-written .. except the parity (P and Q) blocks.

The code currently reads these (as they are not being over-written!)
unnecessarily.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ea664c8245 ("md/raid5: need_this_block: tidy/fix last condition.")
2015-05-08 18:47:17 +10:00
NeilBrown
b0c783b323 md/raid5: more incorrect BUG_ON in handle_stripe_fill.
It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.

So move the 'BUG_ON' to where it belongs.

Signed-off-by: NeilBrown  <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
2015-05-08 18:46:52 +10:00
NeilBrown
f18c1a35f6 md/raid5: new alloc_stripe() to allocate an initialize a stripe.
The new batch_lock and batch_list fields are being initialized in
grow_one_stripe() but not in resize_stripes().  This causes a crash
on resize.

So separate the core initialization into a new function and call it
from both allocation sites.

Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: 59fc630b8b ("RAID5: batch adjacent full stripe write")
2015-05-08 18:40:01 +10:00
Heinz Mauelshagen
b6538fe329 md-raid0: conditional mddev->queue access to suit dm-raid
This patch is a prerequisite for dm-raid "raid0" support to allow
dm-raid to access the MD RAID0 personality doing unconditional
accesses to mddev->queue, which is NULL in case of dm-raid stacked on
top of MD.

Most of the conditional mddev->queue accesses made it to upstream but
this missing one, which prohibits md raid0 to set disk stack limits
(being done in dm core in case of md underneath dm).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-05-08 18:39:40 +10:00
Jens Axboe
dac56212e8 bio: skip atomic inc/dec of ->bi_cnt for most use cases
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.

If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.

Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:49 -06:00
Jens Axboe
c4cf5261f8 bio: skip atomic inc/dec of ->bi_remaining for non-chains
Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05 13:32:47 -06:00
Rabin Vincent
c0403ec0bb Revert "dm crypt: fix deadlock when async crypto algorithm returns -EBUSY"
This reverts Linux 4.1-rc1 commit 0618764cb2.

The problem which that commit attempts to fix actually lies in the
Freescale CAAM crypto driver not dm-crypt.

dm-crypt uses CRYPTO_TFM_REQ_MAY_BACKLOG.  This means the the crypto
driver should internally backlog requests which arrive when the queue is
full and process them later.  Until the crypto hw's queue becomes full,
the driver returns -EINPROGRESS.  When the crypto hw's queue if full,
the driver returns -EBUSY, and if CRYPTO_TFM_REQ_MAY_BACKLOG is set, is
expected to backlog the request and process it when the hardware has
queue space.  At the point when the driver takes the request from the
backlog and starts processing it, it calls the completion function with
a status of -EINPROGRESS.  The completion function is called (for a
second time, in the case of backlogged requests) with a status/err of 0
when a request is done.

Crypto drivers for hardware without hardware queueing use the helpers,
crypto_init_queue(), crypto_enqueue_request(), crypto_dequeue_request()
and crypto_get_backlog() helpers to implement this behaviour correctly,
while others implement this behaviour without these helpers (ccp, for
example).

dm-crypt (before the patch that needs reverting) uses this API
correctly.  It queues up as many requests as the hw queues will allow
(i.e. as long as it gets back -EINPROGRESS from the request function).
Then, when it sees at least one backlogged request (gets -EBUSY), it
waits till that backlogged request is handled (completion gets called
with -EINPROGRESS), and then continues.  The references to
af_alg_wait_for_completion() and af_alg_complete() in that commit's
commit message are irrelevant because those functions only handle one
request at a time, unlink dm-crypt.

The problem is that the Freescale CAAM driver, which that commit
describes as having being tested with, fails to implement the
backlogging behaviour correctly.  In cam_jr_enqueue(), if the hardware
queue is full, it simply returns -EBUSY without backlogging the request.
What the observed deadlock was is not described in the commit message
but it is obviously the wait_for_completion() in crypto_convert() where
dm-crypto would wait for the completion being called with -EINPROGRESS
in the case of backlogged requests.  This completion will never be
completed due to the bug in the CAAM driver.

Commit 0618764cb2 incorrectly made dm-crypt wait for every request,
even when the driver/hardware queues are not full, which means that
dm-crypt will never see -EBUSY.  This means that that commit will cause
a performance regression on all crypto drivers which implement the API
correctly.

Revert it.  Correct backlog handling should be implemented in the CAAM
driver instead.

Cc'ing stable purely because commit 0618764cb2 did.  If for some reason
a stable@ kernel did pick up commit 0618764cb2 it should get reverted.

Signed-off-by: Rabin Vincent <rabin.vincent@axis.com>
Reviewed-by: Horia Geanta <horia.geanta@freescale.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-05-05 12:16:43 -04:00
Mike Snitzer
aa6df8dd28 dm: fix free_rq_clone() NULL pointer when requeueing unmapped request
Commit 022333427a ("dm: optimize dm_mq_queue_rq to _not_ use kthread if
using pure blk-mq") mistakenly removed free_rq_clone()'s clone->q check
before testing clone->q->mq_ops.  It was an oversight to discontinue
that check for 1 of the 2 use-cases for free_rq_clone():
1) free_rq_clone() called when an unmapped original request is requeued
2) free_rq_clone() called in the request-based IO completion path

The clone->q check made sense for case #1 but not for #2.  However, we
cannot just reinstate the check as it'd mask a serious bug in the IO
completion case #2 -- no in-flight request should have an uninitialized
request_queue (basic block layer refcounting _should_ ensure this).

The NULL pointer seen for case #1 is detailed here:
https://www.redhat.com/archives/dm-devel/2015-April/msg00160.html

Fix this free_rq_clone() NULL pointer by simply checking if the
mapped_device's type is DM_TYPE_MQ_REQUEST_BASED (clone's queue is
blk-mq) rather than checking clone->q->mq_ops.  This avoids the need to
dereference clone->q, but a WARN_ON_ONCE is added to let us know if an
uninitialized clone request is being completed.

Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-30 10:25:21 -04:00
Christoph Hellwig
3e6180f0c8 dm: only initialize the request_queue once
Commit bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
didn't properly account for the need to short-circuit re-initializing
DM's blk-mq request_queue if it was already initialized.

Otherwise, reloading a blk-mq request-based DM table (either manually
or via multipathd) resulted in errors, see:
 https://www.redhat.com/archives/dm-devel/2015-April/msg00132.html

Fix is to only initialize the request_queue on the initial table load
(when the mapped_device type is assigned).

This is better than having dm_init_request_based_blk_mq_queue() return
early if the queue was already initialized because it elevates the
constraint to a more meaningful location in DM core.  As such the
pre-existing early return in dm_init_request_based_queue() can now be
removed.

Fixes: bfebd1cdb4 ("dm: add full blk-mq support to request-based DM")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2015-04-30 10:25:21 -04:00
NeilBrown
6cd18e711d block: destroy bdi before blockdev is unregistered.
Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
	blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().

Therefore it is important that all visible artifacts of the previous
device are removed before this call.  In particular, the 'bdi'.

Since:
commit c4db59d31e
Author: Christoph Hellwig <hch@lst.de>
    fs: don't reassign dirty inodes to default_backing_dev_info

moved the
   device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().

The new device finds that the bdi name is already registered in sysfs
and complains

> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.

Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk().  As loop.c devices are also created on demand by
opening the device node, we make the same change there.

Fixes: c4db59d31e
Reported-by: Azat Khuzhin <a3at.mail@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-27 10:27:20 -06:00
Linus Torvalds
474095e46c md updates for 4.1
Highlights:
 
 - "experimental" code for managing md/raid1 across a cluster using
   DLM.  Code is not ready for general use and triggers a WARNING if used.
   However it is looking good and mostly done and having in mainline
   will help co-ordinate development.
 - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
   handle a full (chunk wide) stripe as a single unit.
 - RAID6 can now perform read-modify-write cycles which should
   help performance on larger arrays: 6 or more devices.
 - RAID5/6 stripe cache now grows and shrinks dynamically.  The value
   set is used as a minimum.
 - Resync is now allowed to go a little faster than the 'mininum' when
   there is competing IO.  How much faster depends on the speed of the
   devices, so the effective minimum should scale with device speed to
   some extent.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIVAwUAVTbIxDnsnt1WYoG5AQIgAA//Z9FlEpkHcJJ75WGjrXgGJPyNTfEZOkoz
 jnD8PpBY2Afp341vMatd0XKErdEAhuPQmJMAa+tbxht6pk77X3fSzXghGA8FEafg
 yazn5pfBt6NXmepV4vhl/+LNYuWRxxLSA9EDm7wEg+tiO0UEts+a0w2TSXzcT1w+
 30Yi1EjAlTaJ5yjlBHUOtTXWE43D6RKnUr6FMy2dRlnzlFyDlezRMChWo9v6OkQF
 YqJ20FcvmJdLHY/6Yif3jvpm7eQecMqdCZENTvW/mJI86zqf6E+ToCYS1VNjfDnK
 ud61iU9eshu4WtNWBG6KLuHBD0grO1NaEL7/S16w1KdNJMhYgiK8WussvIAJEesA
 5SlETM7Y/1XFq8puwlAq2/tuPfhZ+TFxnAwce/C3hMTDcYAACnS/R6INFQXqGvy3
 nX1NLogrCycX8oqxv3jTFKLVqIVwlkSlHcUGzIWjcfCF37StcXFKI5q862agyg2+
 NNocFMuXhPPM1YcB9JJSo2nCsor4e9tTdVEZlFm2B3cc8LJ9BLWUMSoi1h7VK/1g
 P7psnPIjz7/cdI2TZTFjGTZ0Kvhx/NTYp41AZealDNxeGWUNM+5xGZnUF8QRBc/E
 0dGHtEAah834BDQFvNnJtuuh/s+KwbvswjNP+njoBsHjIQIvngDABpOwpIkdqF6r
 diQ2gUPnHN0=
 =OHG6
 -----END PGP SIGNATURE-----

Merge tag 'md/4.1' of git://neil.brown.name/md

Pull md updates from Neil Brown:
 "More updates that usual this time.  A few have performance impacts
  which hould mostly be positive, but RAID5 (in particular) can be very
  work-load ensitive...  We'll have to wait and see.

  Highlights:

   - "experimental" code for managing md/raid1 across a cluster using
     DLM.  Code is not ready for general use and triggers a WARNING if
     used.  However it is looking good and mostly done and having in
     mainline will help co-ordinate development.

   - RAID5/6 can now batch multiple (4K wide) stripe_heads so as to
     handle a full (chunk wide) stripe as a single unit.

   - RAID6 can now perform read-modify-write cycles which should help
     performance on larger arrays: 6 or more devices.

   - RAID5/6 stripe cache now grows and shrinks dynamically.  The value
     set is used as a minimum.

   - Resync is now allowed to go a little faster than the 'mininum' when
     there is competing IO.  How much faster depends on the speed of the
     devices, so the effective minimum should scale with device speed to
     some extent"

* tag 'md/4.1' of git://neil.brown.name/md: (58 commits)
  md/raid5: don't do chunk aligned read on degraded array.
  md/raid5: allow the stripe_cache to grow and shrink.
  md/raid5: change ->inactive_blocked to a bit-flag.
  md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe
  md/raid5: pass gfp_t arg to grow_one_stripe()
  md/raid5: introduce configuration option rmw_level
  md/raid5: activate raid6 rmw feature
  md/raid6 algorithms: xor_syndrome() for SSE2
  md/raid6 algorithms: xor_syndrome() for generic int
  md/raid6 algorithms: improve test program
  md/raid6 algorithms: delta syndrome functions
  raid5: handle expansion/resync case with stripe batching
  raid5: handle io error of batch list
  RAID5: batch adjacent full stripe write
  raid5: track overwrite disk count
  raid5: add a new flag to track if a stripe can be batched
  raid5: use flex_array for scribble data
  md raid0: access mddev->queue (request queue member) conditionally because it is not set when accessed from dm-raid
  md: allow resync to go faster when there is competing IO.
  md: remove 'go_faster' option from ->sync_request()
  ...
2015-04-24 09:28:01 -07:00
Eric Mei
9ffc8f7cb9 md/raid5: don't do chunk aligned read on degraded array.
When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
 1) optimal, as baseline;
 2) degraded;
 3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.


Signed-off-by: Eric Mei <eric.mei@seagate.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22 08:00:43 +10:00